Reference notes.
Multimodal AI systems process and generate multiple types of data — text, images, audio, video — often in combination. This enables richer understanding and more natural interaction.
Modalities
- Text — Natural language, code
- Vision — Images, screenshots, diagrams
- Audio — Speech, music, sounds
- Video — Sequences of frames with optional audio
- 3D — Point clouds, meshes, scenes
- Structured data — Tables, graphs, time series
Vision-Language Models
Capabilities
- Image understanding — Describe, analyse, answer questions about images
- GUI understanding — OmniParser, screen scraping for Agentic control
- OCR — Extract text from images
- Chart/diagram analysis — Interpret visual data
- Visual reasoning — Solve problems using visual information
- Image generation — Create images from text descriptions
Models
See Foundation Models for the latest model versions.
Commercial:
- GPT-5.5 / GPT-5.2 (OpenAI) — Strong general vision, native multimodal
- Claude Opus 4.7 / Sonnet 4.6 (Anthropic) — Document and chart understanding, improved vision in 4.7
- Gemini 3.1 Pro / Flash (Google) — Native multimodal, Deep Think, long video (1M context)
Open-source:
- LLaVA — Popular open VLM family
- Qwen2.5-VL (Alibaba) — Strong multilingual, high resolution
- InternVL 2.5 — Competitive open model
- Llama 4 Scout / Maverick (Meta) — Native multimodal MoE
- Gemma 3 (Google) — Open weights, vision support
- PaliGemma 2 (Google) — Research-friendly
Image Generation
Diffusion models:
- DALL-E 3 (OpenAI) — Text-to-image, good prompt following
- Midjourney — Artistic, high aesthetic quality
- Flux 2 (Black Forest Labs) — Current open-weights frontier, ~2x faster than Flux pro
- Stable Diffusion 3.5 (Stability AI) — Most widely deployed open weights, mature ecosystem
- Imagen 3 (Google) — High fidelity
Key concepts:
- Text-to-image generation
- Image-to-image (style transfer, edits)
- Inpainting (fill in regions)
- ControlNet — Guided generation (pose, edges)
Audio Models
Speech-to-Text (STT / ASR)
| Model | Provider | Notes |
|---|---|---|
| Whisper | OpenAI | Open-source, multilingual |
| Deepgram | Deepgram | Fast, streaming |
| AssemblyAI | AssemblyAI | Speaker diarisation |
| Google Speech | Many languages |
Text-to-Speech (TTS)
| Model | Provider | Notes |
|---|---|---|
| ElevenLabs | ElevenLabs | Realistic, voice cloning |
| Cartesia Sonic | Cartesia | Fast, low latency |
| OpenAI TTS | OpenAI | Natural voices |
| XTTS | Coqui (archived) | Open-source |
| Bark | Suno | Open-source, expressive |
Speech-to-Speech
Real-time conversation without text intermediate:
- GPT-4o Realtime API
- Gemini Live
- Moshi (Kyutai, 2024) — Open unified speech-text foundation model. Treats dialogue as speech-to-speech autoregression over interleaved audio tokens, achieving ~200ms end-to-end latency. Streams continuously rather than alternating turns.
- End-to-end Language-Audio Models (LALMs) increasingly retain prosody, emotion, speaker identity, and ambient sound rather than collapsing audio to text first.
Audio Generation
- Music: Suno, Udio, MusicGen
- Sound effects: AudioGen, ElevenLabs
Video Models
Video Understanding
- Process and answer questions about video content
- Gemini can process hours of video natively
- GPT-5.x and Claude Opus 4.7 can analyse video frames
Video Generation
Rapidly evolving area:
- Sora 2 (OpenAI) — High fidelity, physics, synced audio. Consumer app shut down April 2026; API to follow Sept 2026
- Veo 3 / 3.1 (Google DeepMind, 2025) — Native synchronised audio: ambient sound, dialogue lip-synced to visible mouth movement, and effects matched to on-screen action — all from a single text prompt.
- Runway Gen-3 Alpha — Creative tools
- Kling 3.0 (Kuaishou) — Realistic motion, native audio
- Seedance 2.0 (ByteDance) — Pioneered unified joint audio-video generation, later followed by Veo 3.1 and Kling 3.0
- Pika — Short clips
- Stable Video Diffusion — Open weights
By early 2026 four of six major video models (Sora 2, Veo 3.1, Kling 3.0, Seedance 2.0) generate audio natively. The transition from silent video + post-production audio to single-pass joint generation is the defining 2025-2026 shift in this space.
World Models
Generative models that learn an internal simulator of physics, dynamics, and 3D scene structure. Distinct from clip-generation models because they aim to be navigable and consistent over time, supporting agent training and robotics rather than passive video.
- Genie 3 (DeepMind, Aug 2025) — First real-time interactive general-purpose world model. Generates navigable 3D worlds at 24fps, 720p, several minutes of consistency, with a promptable event module (change weather/terrain mid-scene). Auto-regressive, no hard-coded physics engine.
- V-JEPA 2 (Meta, 2025) — Video Joint-Embedding Predictive Architecture trained on internet-scale video, predicts in embedding space rather than pixel space. Supports zero-shot robot planning on unseen objects/environments.
- GAIA-3, Marble, Interactive World Simulator (2026) — Workflow-shaped variants extending world models toward evaluation infrastructure and robotics-training pipelines.
Robotics Foundation Models (VLAs)
Vision-Language-Action models extend LLMs by emitting low-level motor commands alongside text/vision understanding.
- π0 / π0.7 (Physical Intelligence, 2024-2026) — Generalist policy trained on robot trajectories from 8 embodiments. π0.7 is steerable via language, meta-data, and visual subgoals; matches specialist policies on tasks like making coffee, folding laundry, and assembling boxes. π0.6 adds RL-from-experience fine-tuning to lift real-world success rates.
- RT-2 / RT-X (Google DeepMind) — Earlier VLAs co-fine-tuned on web-scale vision-language data and robot trajectories from the cross-embodiment Open X-Embodiment dataset.
- OpenVLA, Octo — Open-weight VLA baselines.
Architecture Patterns
Early Fusion
Combine modalities at input level. Process jointly from the start.
- Native multimodal (Gemini)
- Single unified model
Late Fusion
Process modalities separately, combine at decision/output level.
- Specialist encoders
- Cross-attention between modalities
Cross-modal Embeddings
Map different modalities to shared embedding space.
- CLIP — Text and images
- ImageBind — Six modalities
- CLAP — Text and audio
Adapter-based
Add modality adapters to existing LLMs.
- LLaVA — Vision adapter for Llama
- Cheaper than training from scratch
Applications
Document Understanding
- Parse complex documents (PDFs, forms)
- Extract information from screenshots
- Analyse charts and diagrams
Creative Tools
- Image editing with natural language
- Video generation and editing
- Music and audio creation
Accessibility
- Image descriptions for visually impaired
- Speech interfaces
- Real-time translation
Robotics
- Visual perception for navigation
- Natural language instructions
- Multimodal world understanding
Healthcare
- Medical image analysis
- Radiology reports from scans
- Surgical video analysis
Challenges
Alignment Across Modalities
Ensuring semantic alignment between different representations.
Hallucinations in Vision
Models may misidentify objects or fabricate visual details.
Temporal Understanding
Video and audio require understanding time, sequencing, causality.
Evaluation
Multimodal benchmarks are less mature than text-only.
Compute Requirements
Multiple modalities multiply data and processing needs.
Evaluation Benchmarks
Vision-Language
- VQAv2 — Visual question answering
- MMMU — Multimodal understanding
- ChartQA — Chart understanding
- DocVQA — Document understanding
- TextVQA — Text in images
Audio
- LibriSpeech — ASR benchmark
- CommonVoice — Multilingual ASR
APIs and Tools
Multimodal LLMs
# OpenAI Vision
response = client.chat.completions.create(
model="gpt-5.5",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{"type": "image_url", "image_url": {"url": image_url}}
]
}]
)Image Generation
# DALL-E 3
response = client.images.generate(
model="dall-e-3",
prompt="A sunset over mountains",
size="1024x1024"
)Resources
- CLIP Paper — Learning visual concepts from text
- LLaVA Paper — Visual instruction tuning
- Whisper Paper — Robust speech recognition
- ImageBind Paper — Unified embedding space
- Gemini Technical Report
- OpenAI Vision Guide