Multimodal AI systems process and generate multiple types of data — text, images, audio, video — often in combination. This enables richer understanding and more natural interaction.
Modalities
- Text — Natural language, code
- Vision — Images, screenshots, diagrams
- Audio — Speech, music, sounds
- Video — Sequences of frames with optional audio
- 3D — Point clouds, meshes, scenes
- Structured data — Tables, graphs, time series
Vision-Language Models
Capabilities
- Image understanding — Describe, analyse, answer questions about images
- OCR — Extract text from images
- Chart/diagram analysis — Interpret visual data
- Visual reasoning — Solve problems using visual information
- Image generation — Create images from text descriptions
Models
Commercial:
- GPT-4o, GPT-4V (OpenAI) — Strong general vision
- Claude 3 (Anthropic) — Document and chart understanding
- Gemini 1.5 (Google) — Native multimodal, long video
Open-source:
- LLaVA — Popular open VLM family
- Qwen-VL — Strong multilingual
- InternVL — Competitive open model
- PaliGemma (Google) — Research-friendly
Image Generation
Diffusion models:
- DALL-E 3 (OpenAI) — Text-to-image, good prompt following
- Midjourney — Artistic, high aesthetic quality
- Stable Diffusion (Stability AI) — Open weights, customisable
- Imagen (Google) — High fidelity
Key concepts:
- Text-to-image generation
- Image-to-image (style transfer, edits)
- Inpainting (fill in regions)
- ControlNet — Guided generation (pose, edges)
Audio Models
Speech-to-Text (STT / ASR)
| Model | Provider | Notes |
|---|---|---|
| Whisper | OpenAI | Open-source, multilingual |
| Deepgram | Deepgram | Fast, streaming |
| AssemblyAI | AssemblyAI | Speaker diarisation |
| Google Speech | Many languages |
Text-to-Speech (TTS)
| Model | Provider | Notes |
|---|---|---|
| ElevenLabs | ElevenLabs | Realistic, voice cloning |
| Cartesia Sonic | Cartesia | Fast, low latency |
| OpenAI TTS | OpenAI | Natural voices |
| XTTS | Coqui | Open-source |
| Bark | Suno | Open-source, expressive |
Speech-to-Speech
Real-time conversation without text intermediate:
- GPT-4o Realtime API
- Gemini Live
Audio Generation
- Music: Suno, Udio, MusicGen
- Sound effects: AudioGen, ElevenLabs
Video Models
Video Understanding
- Process and answer questions about video content
- Gemini 1.5 can process hours of video
- GPT-4o can analyse video frames
Video Generation
Rapidly evolving area:
- Sora (OpenAI) — High fidelity, physics understanding
- Runway Gen-3 — Creative tools
- Pika — Short clips
- Stable Video Diffusion — Open weights
Architecture Patterns
Early Fusion
Combine modalities at input level. Process jointly from the start.
- Native multimodal (Gemini)
- Single unified model
Late Fusion
Process modalities separately, combine at decision/output level.
- Specialist encoders
- Cross-attention between modalities
Cross-modal Embeddings
Map different modalities to shared embedding space.
- CLIP — Text and images
- ImageBind — Six modalities
- CLAP — Text and audio
Adapter-based
Add modality adapters to existing LLMs.
- LLaVA — Vision adapter for Llama
- Cheaper than training from scratch
Applications
Document Understanding
- Parse complex documents (PDFs, forms)
- Extract information from screenshots
- Analyse charts and diagrams
Creative Tools
- Image editing with natural language
- Video generation and editing
- Music and audio creation
Accessibility
- Image descriptions for visually impaired
- Speech interfaces
- Real-time translation
Robotics
- Visual perception for navigation
- Natural language instructions
- Multimodal world understanding
Healthcare
- Medical image analysis
- Radiology reports from scans
- Surgical video analysis
Challenges
Alignment Across Modalities
Ensuring semantic alignment between different representations.
Hallucinations in Vision
Models may misidentify objects or fabricate visual details.
Temporal Understanding
Video and audio require understanding time, sequencing, causality.
Evaluation
Multimodal benchmarks are less mature than text-only.
Compute Requirements
Multiple modalities multiply data and processing needs.
Evaluation Benchmarks
Vision-Language
- VQAv2 — Visual question answering
- MMMU — Multimodal understanding
- ChartQA — Chart understanding
- DocVQA — Document understanding
- TextVQA — Text in images
Audio
- LibriSpeech — ASR benchmark
- CommonVoice — Multilingual ASR
APIs and Tools
Multimodal LLMs
# OpenAI Vision
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{"type": "image_url", "image_url": {"url": image_url}}
]
}]
)Image Generation
# DALL-E 3
response = client.images.generate(
model="dall-e-3",
prompt="A sunset over mountains",
size="1024x1024"
)Resources
- CLIP Paper — Learning visual concepts from text
- LLaVA Paper — Visual instruction tuning
- Whisper Paper — Robust speech recognition
- ImageBind Paper — Unified embedding space
- Gemini Technical Report
- OpenAI Vision Guide