Vision & Multimodal
Video, Audio, and Any-to-Any Models
How Whisper, V-JEPA, Sora-class video generators, MusicGen, and unified any-to-any models extend the multimodal stack beyond static images.
intermediate · 9 min read
The vision-and-text stack covered in earlier notes only handles the easy half of the multimodal problem. Real signals arrive as time-series: audio waveforms, video frames, sensor traces. The recipes for these modalities have converged on a shared template - tokenise the input, train a transformer over the tokens, possibly do it jointly with text - but the tokenisers themselves do most of the architectural work, and the unified "any-to-any" models that promise one network for everything are still part promise, part product.
Whisper as the audio analogue of CLIP
OpenAI's Whisper (Radford et al, December 2022) did for automatic speech recognition what CLIP did for image classification: trained on 680,000 hours of weakly supervised multilingual audio-text pairs scraped from the web, and shipped as one model that handles 99 languages, transcription, and speech-to-English translation zero-shot. The architecture is unspectacular - a log-Mel spectrogram fed into an encoder-decoder transformer - but the data scale and multitask training format are the contribution.
What Whisper changed:
- One model replaces a per-language stack. Pre-2022 production ASR meant separate models per language with separate fine-tuning.
- Robustness to noise, accent, music. The web-scale data covered real-world conditions that hand-curated speech datasets miss.
- Encoder as a general audio feature extractor. Whisper's encoder is the de facto audio backbone for downstream tasks (audio classification, speaker recognition adapters, multimodal LLMs that take speech input).
The next-step research (SeamlessM4T, Voicebox, Moshi) extends to streaming, voice generation, and cross-lingual speech-to-speech. The Whisper-style encoder is the front end for most of them.
Video understanding: V-JEPA, VideoCoCa, VideoLLaMA
Video adds a dimension and a problem. The dimension is time; the problem is that uniformly sampling frames throws away temporal structure, while dense processing is prohibitively expensive.
Three lines of approach:
- V-JEPA (Bardes et al, 2024). Self-supervised, generative-free. Trained to predict masked spatio-temporal regions in an abstract feature space rather than reconstructing pixels. The non-reconstructive objective means the model is free to ignore unpredictable details (exact textures, background motion) and focus on what is structurally predictable. Strong frozen-feature evaluation on action recognition (Kinetics-400 at 81.9% with ViT-H/16).
- VideoCoCa, VideoCoCa-style contrastive. Extends CLIP-style image-text contrastive pretraining to video-text pairs. Good for retrieval and zero-shot classification; less natural for fine-grained temporal reasoning.
- VideoLLaMA, Video-LLaVA, Video-ChatGPT. Slot a video encoder (often a CLIP variant applied per-frame plus a temporal aggregator, or a dedicated video backbone) into the LLaVA-style "encoder + projector + LLM" pattern. Currently the dominant open path for video question answering.
The hard sub-problem common to all three is token budget. A 1-minute video at 1 fps and CLIP ViT-L token rate is 60 frames x 576 tokens = 34k tokens. Compressing this without losing the answer-relevant frame is what perceiver resamplers, temporal pooling, and Q-Former-style learned queries exist to do.
Sora, Veo, Runway: text-to-video generation
Text-to-video was the obvious next target after Stable Diffusion. The 2024 wave delivered:
- OpenAI Sora (February 2024). A DiT operating on spatio-temporal latent patches produced by a 3D VAE. The "patches as tokens" formulation lets the same architecture handle different resolutions, durations, and aspect ratios. Up to 60-second 1080p clips with strong physical consistency.
- Google DeepMind Veo (Veo 3 / Veo 3.1 by 2025). Native audio generation alongside video, with stronger camera control and reference-image conditioning than the original Sora release.
- Runway Gen-3 / Gen-4. Production-focused, shorter clips, stronger fine-grained controls for filmmakers (motion brush, camera path).
The shared technical pattern is: tokenise video into a 3D latent grid (typically 2x or 4x spatial compression, 2x-8x temporal compression via a video VAE), train a DiT or Flow Matching model on those latent tokens, condition on text via cross-attention. Classifier-free guidance applies as in image diffusion. Sampling cost is dominated by the number of tokens, which is why current commercial video models are still in the seconds-to-minute clip range.
The Sora technical report's most quoted line was "video generation models can be thought of as world simulators" - whether the simulation is physically accurate enough to be useful for robotics or games is the open question driving 2025 video-model research.
Audio generation: AudioCraft, MusicGen, Suno
Audio generation followed the same trajectory as image generation, with a tokeniser-and-transformer recipe:
- EnCodec (Defossez et al, 2022). A neural audio codec that compresses 24 kHz audio into discrete tokens at ~75 Hz across multiple codebooks (residual vector quantisation). The same role the VAE plays for latent diffusion.
- MusicGen (Copet et al, 2023). A single transformer language model autoregressing over EnCodec tokens, conditioned on text or melody. Open weights, 3.3B parameters in the largest variant, generates ~12 seconds of music in real time on an A100.
- AudioGen. Same recipe, trained on environmental and sound-effects data instead of music.
- Suno, Udio. Closed commercial systems producing full songs (vocals, lyrics, instruments) at consumer quality. Architectures undisclosed but presumably similar tokeniser + transformer with much more training data.
Speech generation (Tortoise, ElevenLabs, OpenAI's voice models) uses a related stack with an additional phoneme or alignment frontend. The recipe is portable: any 1D continuous signal becomes a candidate for the tokenise-and-model treatment.
Any-to-any unified models
The end state most labs are pointing at is a single model that consumes and produces text, images, audio, and video interchangeably. Three notable attempts:
- Unified-IO and Unified-IO 2 (Lu et al, 2022; 2023). Discretise every modality (images via VQ-VAE, dense outputs via discretisation, audio via codec) into a shared vocabulary, then train one encoder-decoder transformer on dozens of tasks. Trades peak quality on any single modality for genuine task generality.
- 4M and 4M-21 (Mizrahi et al, 2023-2024). Trains a single transformer on a masked-modelling objective over 21 modalities (RGB, depth, semantic masks, edges, captions, bounding boxes, camera intrinsics, etc.). Chain modalities at inference: image -> depth -> captions -> semantic mask in one model.
- GPT-4o and Gemini. Closed frontier systems that handle text + image + audio + (Gemini) video natively. The exact architectural details are not public but the behaviour - speech in, speech out in one forward pass; voice-style controllable from text prompts - matches what a unified token-space model would do.
What unified architectures buy you, and where they still split
Unified models are seductive because they promise:
- One serving stack instead of N.
- Cross-modal transfer (audio understanding helps video understanding).
- Genuinely native interleaved interaction.
In practice, three forces still favour specialisation:
| Force | Why it splits the stack |
|---|---|
| Tokeniser specialisation | A great audio codec is not a great image VAE. Each modality has its own bandwidth and compression sweet spot. |
| Data availability | High-quality paired multimodal data is scarce; specialised unimodal pretraining still produces better encoders. |
| Inference economics | Sora-class video uses 1000x the compute of an image model. Routing video requests to a dedicated stack is cheaper. |
| Latency profiles | Real-time speech needs streaming decoders; image generation does not. Different serving tradeoffs. |
The current consensus is "unified at the user interface, specialised at the compute layer" - frontier products route requests behind a unified API to different models, with growing but not total weight-sharing.
Where it falls down
- Long-duration video. Models that produce 1-minute clips struggle to maintain identity, scene consistency, and physics across many minutes. The token budget grows linearly with duration; compute grows quadratically with attention.
- Cross-modal reasoning under load. "Listen to the audio, watch the video, then answer this text question." Frontier models can do it; open models are still wobbly.
- Fine-grained audio understanding. Music transcription, speaker diarisation across long meetings, sub-second event localisation - specialist models still win, often by large margins.
- Robotics and embodied use. "World simulator" video models do not yet ground predictions in actions accurately enough for closed-loop control. Diffusion policy, RT-X-class action models are the parallel research thread.
Further reading
- Robust Speech Recognition via Large-Scale Weak Supervision - the Whisper paper, Radford et al 2022.
- Revisiting Feature Prediction for Learning Visual Representations from Video - the V-JEPA paper, Bardes et al 2024.
- V-JEPA: The next step toward advanced machine intelligence (Meta AI blog) - readable companion to the paper.
- Veo 3.1 (Google DeepMind) - the current frontier text-to-video model with native audio.
- Simple and Controllable Music Generation - the MusicGen paper.
- facebookresearch/audiocraft - reference implementation for MusicGen, AudioGen, EnCodec.
- 4M: Massively Multimodal Masked Modeling - one of the cleanest any-to-any architectures with open weights.
- Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks - Lu et al 2022, the all-in-one design at scale.