Reference notes.

Multimodal AI systems process and generate multiple types of data — text, images, audio, video — often in combination. This enables richer understanding and more natural interaction.

Modalities

  • Text — Natural language, code
  • Vision — Images, screenshots, diagrams
  • Audio — Speech, music, sounds
  • Video — Sequences of frames with optional audio
  • 3D — Point clouds, meshes, scenes
  • Structured data — Tables, graphs, time series

Vision-Language Models

Capabilities

  • Image understanding — Describe, analyse, answer questions about images
  • GUI understanding — OmniParser, screen scraping for Agentic control
  • OCR — Extract text from images
  • Chart/diagram analysis — Interpret visual data
  • Visual reasoning — Solve problems using visual information
  • Image generation — Create images from text descriptions

Models

See Foundation Models for the latest model versions.

Commercial:

  • GPT-5.5 / GPT-5.2 (OpenAI) — Strong general vision, native multimodal
  • Claude Opus 4.7 / Sonnet 4.6 (Anthropic) — Document and chart understanding, improved vision in 4.7
  • Gemini 3.1 Pro / Flash (Google) — Native multimodal, Deep Think, long video (1M context)

Open-source:

  • LLaVA — Popular open VLM family
  • Qwen2.5-VL (Alibaba) — Strong multilingual, high resolution
  • InternVL 2.5 — Competitive open model
  • Llama 4 Scout / Maverick (Meta) — Native multimodal MoE
  • Gemma 3 (Google) — Open weights, vision support
  • PaliGemma 2 (Google) — Research-friendly

Image Generation

Diffusion models:

  • DALL-E 3 (OpenAI) — Text-to-image, good prompt following
  • Midjourney — Artistic, high aesthetic quality
  • Flux 2 (Black Forest Labs) — Current open-weights frontier, ~2x faster than Flux pro
  • Stable Diffusion 3.5 (Stability AI) — Most widely deployed open weights, mature ecosystem
  • Imagen 3 (Google) — High fidelity

Key concepts:

  • Text-to-image generation
  • Image-to-image (style transfer, edits)
  • Inpainting (fill in regions)
  • ControlNet — Guided generation (pose, edges)

Audio Models

Speech-to-Text (STT / ASR)

ModelProviderNotes
WhisperOpenAIOpen-source, multilingual
DeepgramDeepgramFast, streaming
AssemblyAIAssemblyAISpeaker diarisation
Google SpeechGoogleMany languages

Text-to-Speech (TTS)

ModelProviderNotes
ElevenLabsElevenLabsRealistic, voice cloning
Cartesia SonicCartesiaFast, low latency
OpenAI TTSOpenAINatural voices
XTTSCoqui (archived)Open-source
BarkSunoOpen-source, expressive

Speech-to-Speech

Real-time conversation without text intermediate:

  • GPT-4o Realtime API
  • Gemini Live
  • Moshi (Kyutai, 2024) — Open unified speech-text foundation model. Treats dialogue as speech-to-speech autoregression over interleaved audio tokens, achieving ~200ms end-to-end latency. Streams continuously rather than alternating turns.
  • End-to-end Language-Audio Models (LALMs) increasingly retain prosody, emotion, speaker identity, and ambient sound rather than collapsing audio to text first.

Audio Generation

  • Music: Suno, Udio, MusicGen
  • Sound effects: AudioGen, ElevenLabs

Video Models

Video Understanding

  • Process and answer questions about video content
  • Gemini can process hours of video natively
  • GPT-5.x and Claude Opus 4.7 can analyse video frames

Video Generation

Rapidly evolving area:

  • Sora 2 (OpenAI) — High fidelity, physics, synced audio. Consumer app shut down April 2026; API to follow Sept 2026
  • Veo 3 / 3.1 (Google DeepMind, 2025) — Native synchronised audio: ambient sound, dialogue lip-synced to visible mouth movement, and effects matched to on-screen action — all from a single text prompt.
  • Runway Gen-3 Alpha — Creative tools
  • Kling 3.0 (Kuaishou) — Realistic motion, native audio
  • Seedance 2.0 (ByteDance) — Pioneered unified joint audio-video generation, later followed by Veo 3.1 and Kling 3.0
  • Pika — Short clips
  • Stable Video Diffusion — Open weights

By early 2026 four of six major video models (Sora 2, Veo 3.1, Kling 3.0, Seedance 2.0) generate audio natively. The transition from silent video + post-production audio to single-pass joint generation is the defining 2025-2026 shift in this space.

World Models

Generative models that learn an internal simulator of physics, dynamics, and 3D scene structure. Distinct from clip-generation models because they aim to be navigable and consistent over time, supporting agent training and robotics rather than passive video.

  • Genie 3 (DeepMind, Aug 2025) — First real-time interactive general-purpose world model. Generates navigable 3D worlds at 24fps, 720p, several minutes of consistency, with a promptable event module (change weather/terrain mid-scene). Auto-regressive, no hard-coded physics engine.
  • V-JEPA 2 (Meta, 2025) — Video Joint-Embedding Predictive Architecture trained on internet-scale video, predicts in embedding space rather than pixel space. Supports zero-shot robot planning on unseen objects/environments.
  • GAIA-3, Marble, Interactive World Simulator (2026) — Workflow-shaped variants extending world models toward evaluation infrastructure and robotics-training pipelines.

Robotics Foundation Models (VLAs)

Vision-Language-Action models extend LLMs by emitting low-level motor commands alongside text/vision understanding.

  • π0 / π0.7 (Physical Intelligence, 2024-2026) — Generalist policy trained on robot trajectories from 8 embodiments. π0.7 is steerable via language, meta-data, and visual subgoals; matches specialist policies on tasks like making coffee, folding laundry, and assembling boxes. π0.6 adds RL-from-experience fine-tuning to lift real-world success rates.
  • RT-2 / RT-X (Google DeepMind) — Earlier VLAs co-fine-tuned on web-scale vision-language data and robot trajectories from the cross-embodiment Open X-Embodiment dataset.
  • OpenVLA, Octo — Open-weight VLA baselines.

Architecture Patterns

Early Fusion

Combine modalities at input level. Process jointly from the start.

  • Native multimodal (Gemini)
  • Single unified model

Late Fusion

Process modalities separately, combine at decision/output level.

  • Specialist encoders
  • Cross-attention between modalities

Cross-modal Embeddings

Map different modalities to shared embedding space.

  • CLIP — Text and images
  • ImageBind — Six modalities
  • CLAP — Text and audio

Adapter-based

Add modality adapters to existing LLMs.

  • LLaVA — Vision adapter for Llama
  • Cheaper than training from scratch

Applications

Document Understanding

  • Parse complex documents (PDFs, forms)
  • Extract information from screenshots
  • Analyse charts and diagrams

Creative Tools

  • Image editing with natural language
  • Video generation and editing
  • Music and audio creation

Accessibility

  • Image descriptions for visually impaired
  • Speech interfaces
  • Real-time translation

Robotics

  • Visual perception for navigation
  • Natural language instructions
  • Multimodal world understanding

Healthcare

  • Medical image analysis
  • Radiology reports from scans
  • Surgical video analysis

Challenges

Alignment Across Modalities

Ensuring semantic alignment between different representations.

Hallucinations in Vision

Models may misidentify objects or fabricate visual details.

Temporal Understanding

Video and audio require understanding time, sequencing, causality.

Evaluation

Multimodal benchmarks are less mature than text-only.

Compute Requirements

Multiple modalities multiply data and processing needs.

Evaluation Benchmarks

Vision-Language

  • VQAv2 — Visual question answering
  • MMMU — Multimodal understanding
  • ChartQA — Chart understanding
  • DocVQA — Document understanding
  • TextVQA — Text in images

Audio

  • LibriSpeech — ASR benchmark
  • CommonVoice — Multilingual ASR

APIs and Tools

Multimodal LLMs

# OpenAI Vision
response = client.chat.completions.create(
    model="gpt-5.5",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": image_url}}
        ]
    }]
)

Image Generation

# DALL-E 3
response = client.images.generate(
    model="dall-e-3",
    prompt="A sunset over mountains",
    size="1024x1024"
)

Resources