Multimodal AI systems process and generate multiple types of data — text, images, audio, video — often in combination. This enables richer understanding and more natural interaction.

Modalities

  • Text — Natural language, code
  • Vision — Images, screenshots, diagrams
  • Audio — Speech, music, sounds
  • Video — Sequences of frames with optional audio
  • 3D — Point clouds, meshes, scenes
  • Structured data — Tables, graphs, time series

Vision-Language Models

Capabilities

  • Image understanding — Describe, analyse, answer questions about images
  • OCR — Extract text from images
  • Chart/diagram analysis — Interpret visual data
  • Visual reasoning — Solve problems using visual information
  • Image generation — Create images from text descriptions

Models

Commercial:

  • GPT-4o, GPT-4V (OpenAI) — Strong general vision
  • Claude 3 (Anthropic) — Document and chart understanding
  • Gemini 1.5 (Google) — Native multimodal, long video

Open-source:

  • LLaVA — Popular open VLM family
  • Qwen-VL — Strong multilingual
  • InternVL — Competitive open model
  • PaliGemma (Google) — Research-friendly

Image Generation

Diffusion models:

  • DALL-E 3 (OpenAI) — Text-to-image, good prompt following
  • Midjourney — Artistic, high aesthetic quality
  • Stable Diffusion (Stability AI) — Open weights, customisable
  • Imagen (Google) — High fidelity

Key concepts:

  • Text-to-image generation
  • Image-to-image (style transfer, edits)
  • Inpainting (fill in regions)
  • ControlNet — Guided generation (pose, edges)

Audio Models

Speech-to-Text (STT / ASR)

ModelProviderNotes
WhisperOpenAIOpen-source, multilingual
DeepgramDeepgramFast, streaming
AssemblyAIAssemblyAISpeaker diarisation
Google SpeechGoogleMany languages

Text-to-Speech (TTS)

ModelProviderNotes
ElevenLabsElevenLabsRealistic, voice cloning
Cartesia SonicCartesiaFast, low latency
OpenAI TTSOpenAINatural voices
XTTSCoquiOpen-source
BarkSunoOpen-source, expressive

Speech-to-Speech

Real-time conversation without text intermediate:

  • GPT-4o Realtime API
  • Gemini Live

Audio Generation

  • Music: Suno, Udio, MusicGen
  • Sound effects: AudioGen, ElevenLabs

Video Models

Video Understanding

  • Process and answer questions about video content
  • Gemini 1.5 can process hours of video
  • GPT-4o can analyse video frames

Video Generation

Rapidly evolving area:

  • Sora (OpenAI) — High fidelity, physics understanding
  • Runway Gen-3 — Creative tools
  • Pika — Short clips
  • Stable Video Diffusion — Open weights

Architecture Patterns

Early Fusion

Combine modalities at input level. Process jointly from the start.

  • Native multimodal (Gemini)
  • Single unified model

Late Fusion

Process modalities separately, combine at decision/output level.

  • Specialist encoders
  • Cross-attention between modalities

Cross-modal Embeddings

Map different modalities to shared embedding space.

  • CLIP — Text and images
  • ImageBind — Six modalities
  • CLAP — Text and audio

Adapter-based

Add modality adapters to existing LLMs.

  • LLaVA — Vision adapter for Llama
  • Cheaper than training from scratch

Applications

Document Understanding

  • Parse complex documents (PDFs, forms)
  • Extract information from screenshots
  • Analyse charts and diagrams

Creative Tools

  • Image editing with natural language
  • Video generation and editing
  • Music and audio creation

Accessibility

  • Image descriptions for visually impaired
  • Speech interfaces
  • Real-time translation

Robotics

  • Visual perception for navigation
  • Natural language instructions
  • Multimodal world understanding

Healthcare

  • Medical image analysis
  • Radiology reports from scans
  • Surgical video analysis

Challenges

Alignment Across Modalities

Ensuring semantic alignment between different representations.

Hallucinations in Vision

Models may misidentify objects or fabricate visual details.

Temporal Understanding

Video and audio require understanding time, sequencing, causality.

Evaluation

Multimodal benchmarks are less mature than text-only.

Compute Requirements

Multiple modalities multiply data and processing needs.

Evaluation Benchmarks

Vision-Language

  • VQAv2 — Visual question answering
  • MMMU — Multimodal understanding
  • ChartQA — Chart understanding
  • DocVQA — Document understanding
  • TextVQA — Text in images

Audio

  • LibriSpeech — ASR benchmark
  • CommonVoice — Multilingual ASR

APIs and Tools

Multimodal LLMs

# OpenAI Vision
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": image_url}}
        ]
    }]
)

Image Generation

# DALL-E 3
response = client.images.generate(
    model="dall-e-3",
    prompt="A sunset over mountains",
    size="1024x1024"
)

Resources