Building a Multimodal Video Analyzer That Runs Entirely Locally

I spent an evening tinkering with Claude and the latest in multi-modal LLMs to do something very simple - read one of my home videos, understand its content & audio, generate a suggested follow-up question.

The result is a complete video analysis system that combines computer vision, speech recognition, and natural language generation—all running locally on a 2020 Mac Air with no GPU. Claude Code dramatically accelerated the iteration cycle (to nobody’s surprise?). I was even able to revive this old blog of mine without having to remember the local build instructions :)!

What It Does

Point it at a video file, and the system:

Analyzes key frames to understand what’s happening visually
Transcribes the audio to extract spoken content
Generates contextually-aware follow-up questions that connect what it saw with what it heard

This is multimodal analysis in action: a question like “What game are they playing with the guava?” could only emerge from combining visual context (a child in a garden) with audio understanding (since i was using a tiny vision model, the fact that my son was holding some guavas from our garden would not have been evident without the audio transcript).

The Stack

Building this required orchestrating three specialized models:

Moondream (1.7GB) for vision - surprisingly capable for its size
Whisper tiny (75MB) for speech - OpenAI’s model optimized for CPU inference
Qwen 2.5 3B (1.9GB) for language generation - generates coherent questions from combined context

All served locally via Ollama, with OpenCV handling video processing. Total footprint: ~4GB of models.

Why This Approach Matters

Privacy: No video data leaves the machine. Whisper and Moondream run entirely on your hardware with zero network calls during inference.

Economics: Zero marginal cost per video analyzed. Download the models once, run as you please.

Control: Swap models, tune parameters, understand the pipeline. No black-box APIs. A good test bed for tinkering with the multi-modal LLMs.

The Implementation

Working with Claude Code made the development surprisingly smooth—from initial prototyping to dependency wrangling to performance optimization. The technical challenges were real (NumPy version conflicts, performance tuning for CPU-only hardware), but having an AI pair programmer accelerated the iteration cycle significantly. I would say I did not miss my good old days wrestling with pip/conda/brew.

The modular architecture was helpful. The system breaks into three independent scripts (video_summarizer.py, audio_transcriber.py, video_analyzer.py) that work standalone or combined. This lets you use just the components you need.

Real Results

Testing on a 10-second family video:

Video analysis demo

Visual: “A young child stands on a gravel path, wearing a blue hat and white sweater with black stripes, holding two yellow balls while surrounded by plants and flowers.”

Audio: [Transcribed speech with timestamps]

[0.0s - 2.0s] What is it?
[2.0s - 3.0s] Gwava?
[3.0s - 4.0s] Gwava?
[4.0s - 5.0s] Gwava?
[5.0s - 7.0s] Picked from our own garden.
[7.0s - 8.0s] See?
[8.0s - 11.0s] Okay, let's go wash the gwava and eat it.
[11.0s - 12.0s] Okay?

Generated Question: “What game are they playing with the guava?”

Console Output:

(base) vatsan@srivatsans-air ~ % python video_analyzer.py ~/Downloads/guava.mov
/opt/anaconda3/lib/python3.9/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.26.4
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
============================================================
🎬 COMPLETE VIDEO ANALYZER
============================================================

Video: /Users/vatsan/Downloads/guava.mov
Vision Model: moondream
Whisper Model: tiny
Text Model: qwen2.5:3b

============================================================
STEP 1: Visual Analysis
============================================================

📹 Video FPS: 30.00
⏱️  Extracting 1 frame every 3 seconds...
   Frame 1/5 extracted
   Frame 2/5 extracted
   Frame 3/5 extracted
   Frame 4/5 extracted
✅ Extracted 4 frames

🔍 Analyzing frames with vision model...

📸 Frame 1/4:
   Sending to moondream... ............................................................. ✓
   The image shows a young child standing on a gravel path, wearing a blue beanie and a white sweater with black stars. The child is holding two yellow balls in their hands. The setting appears to be an outdoor garden or park area, as evidenced by the presence of bushes and trees nearby.

📸 Frame 2/4:
   Sending to moondream... ..................................................... ✓
   The image shows a young child standing on a sidewalk, wearing a blue beanie and holding two tennis balls in their hands. The child is positioned near the center of the frame, with a path running through the scene leading towards a house in the background.

📸 Frame 3/4:
   Sending to moondream... .......................................................................... ✓
   The image depicts a serene garden scene with a small tree growing on the right side of the frame, surrounded by various plants and shrubs. The garden is enclosed within a wooden fence that adds to its rustic charm. In the background, there are trees with green leaves against an overcast sky, creating a sense of tranquility in the outdoor setting.

📸 Frame 4/4:
   Sending to moondream... ....................................................................... ✓
   The image shows a young child standing on the edge of a sidewalk, wearing a blue beanie and a white sweater with black stripes. The child is holding two yellow balls in their hands. In front of them, there are several plants and flowers, including purple flowers and green bushes. A house can be seen in the background behind the child.

============================================================
📝 Generating video summary...
============================================================

============================================================
STEP 2: Audio Transcription
============================================================


============================================================
🎙️  Transcribing audio...
============================================================

Loading Whisper 'tiny' model...
Transcribing (this may take a moment)...

/opt/anaconda3/lib/python3.9/site-packages/whisper/transcribe.py:132: UserWarning: FP16 is not supported on CPU; using FP32 instead
  warnings.warn("FP16 is not supported on CPU; using FP32 instead")
Detected language: English
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 1205/1205 [00:00<00:00, 1229.96frames/s]
✅ Transcript: What is it? Vala? Vala? Pick from our own garden. See? Okay, let's go wash the guava and eat it. Okay? Let's go.

============================================================
STEP 3: Follow-up Question Generation
============================================================

============================================================
🤔 Generating follow-up question...
============================================================


============================================================
📊 FINAL RESULTS
============================================================

🎥 VIDEO SUMMARY:
------------------------------------------------------------
In this video, we see a young child transitioning from an outdoor garden setting with trees and shrubs to a more urban environment centered on a sidewalk. Initially positioned near a wooded area, the scene then shifts to an outdoor garden with a small tree and wooden fence, offering a serene backdrop before concluding with the child in a more urbanized setting holding yellow balls, surrounded by plants and flowers.

🎙️  AUDIO TRANSCRIPT:
------------------------------------------------------------
Language: en
What is it? Guava? Guava? Pick from our own garden. See? Okay, let's go wash the guava and eat it. Okay? Let's go.

❓ FOLLOW-UP QUESTION:
------------------------------------------------------------
What are they planning to do with the guava?

Processing time: ~2 minutes on older hardware. Fast enough for batch processing, slow enough to appreciate what older hardware can still accomplish with the right optimizations.

Takeaway

I was pleasantly surprised it didn’t take much time at all from my evening to build a simple multimodal application without requiring expensive infrastructure or cloud dependencies. A few gigabytes of models, some Python glue code, and a good coding assistant got me surprisingly far.

I must admit, I haven’t been coding as my day job for a while now since I have been supporting MLEs as an EM, though I still tinker with code on and off. The productivity gains from AI coding assistants are great. What might have taken a day or two of debugging and iteration is now compressed into a couple of hours of my evening. The code is modular and available for anyone who wants local-first AI tools they can understand and control.

Technical details: Uses OpenCV for video processing (handles .mov files natively), Ollama for local model serving, all inference happens on-device. Tested on macOS but should work cross-platform with minor adjustments.