ARTICLE

Multimodal Models

Updated 2 May 2025

modelsmultimodalvisiongpt-4ogeminiclaude

Multimodal Models

The world isn’t made of text. It’s made of images and sounds and video and diagrams and faces and handwriting on whiteboards. Multimodal models understand and generate across these forms — natively, in one model, not by bolting separate systems together.

This is the direction everything is heading. In 2024 the leading models gained vision. In 2025 they gained native audio. Soon the question won’t be “which modality” — it’ll be “why would anyone build a model that can’t handle everything?”

What “Multimodal” Actually Means

A multimodal model processes different types of input and output within a single architecture. It doesn’t convert an image to text and then feed it to a text model — the model’s internal representations span modalities.

Level	Description	Example
Level 1: Separate encoders	Image → text. Text → answer. Pass the text to the LLM.	GPT-4V (early 2023)
Level 2: Shared representation	Image and text share the same embedding space. Both are native.	GPT-4o, Claude 3.5
Level 3: Natively multimodal	One model, all modalities, from training through inference. Can generate across modalities.	Gemini 2.5, GPT-4o

Most “multimodal” models today sit at Level 2 — they can see images and read text as native inputs, but generation is still mostly text. Level 3 (full output across modalities) is emerging with image generation in Gemini and GPT-4o.

The Current Landscape

GPT-4o — OpenAI

“Omni” means all modalities in one. GPT-4o can see images (and video frames), hear audio, read text, and generate text and images natively. The Advanced Voice Mode lets you speak to it and it responds in a natural voice with emotion and pacing. It’s the most complete multimodal consumer product.

Inputs: Text, images, audio, video (as frame sequences) Outputs: Text, images, speech Context: 128K tokens The vibe: This is what talking to a computer should feel like. Fast, fluid, multi-sensory.

Gemini 2.5 Pro — Google DeepMind

Built multimodal from the ground up — not retrofitted. Gemini’s architecture was designed to handle text, images, audio, video, and code as first-class inputs from the start. The 1M+ token context window means you can feed it hours of video and ask questions about specific moments.

Inputs: Text, images, audio, video (direct, not frame-by-frame), code Outputs: Text, images (via Imagen), audio (via TTS) Context: 1M tokens (2M announced) The vibe: The research lab release. Extremely capable, but the consumer experience lags behind the raw capability.

Claude 3.5 Sonnet / 4 — Anthropic

Claude handles text and images natively — document analysis, screenshot understanding, diagram interpretation. Audio is not yet a native modality (it goes through transcription). Claude’s approach to multimodality is deliberate: nail text+vision first, add audio when it’s ready.

Inputs: Text, images (including multi-page PDFs, screenshots, photos) Outputs: Text Context: 200K tokens The vibe: The careful one. Won’t see audio natively yet, but what it does see, it understands deeply.

Pixtral — Mistral AI

Mistral’s multimodal model. Handles images and text natively. Smaller than the frontier models but efficient and available with open weights. Good for European deployments where data sovereignty matters.

Inputs: Text, images Outputs: Text The vibe: Efficient European multimodal. Not flashy, but solid and self-hostable.

What Multimodal Unlocks

Document Understanding

Feed a model a PDF and ask it questions. No OCR, no conversion, no separate steps. The model sees the layout, the tables, the charts, the handwriting in the margins. This was the first killer use case for multimodal and it’s still the most practically useful.

Claude and Gemini excel here. GPT-4o is strong. For legal contracts, medical records, engineering diagrams, financial reports — this capability alone is transforming document-heavy industries.

Visual Reasoning

“What’s wrong with this diagram?” “Where in this photo could the problem be?” “What does this X-ray show?” Models that can see and reason about what they see are being used in medicine, manufacturing, architecture, and anywhere else where visual inspection matters.

Screenshot & UI Understanding

A screenshot of an error message. A mockup of a design. A photo of a competitor’s product. Multimodal models can interpret these directly — no need to describe them in text first. This is why coding agents with vision can use your browser, and why AI assistants can help you navigate software.

Accessibility

A blind person photographs their surroundings and the AI describes them. Someone who can’t speak types to an AI that responds with natural speech. Multimodality isn’t just more capable — it’s more inclusive.

The Limitations

Multimodality sounds like magic, but it’s early. Current limitations:

Video understanding is shallow — Models can tell you what’s in a video frame, but struggle with what happened between frames. Temporal reasoning is hard.
Audio generation quality gap — TTS from multimodal models isn’t yet at ElevenLabs quality
Expensive compute — Processing video and audio tokens is much more costly than text
Hallucination in vision — Models sometimes “see” things that aren’t there, especially in complex or low-quality images
Consistency — The same model might give different answers to the same visual question on different attempts

These are engineering problems, not fundamental barriers. They’ll improve.

The Multimodal Agents Coming

The real frontier is multimodal AI agents that can act in the visual world — not just see it. Models that can:

Control a computer by seeing the screen and moving the mouse (Claude Computer Use, OpenAI Operator)
Navigate a physical environment through a camera feed
Watch a process and intervene when something goes wrong

This is early but moving fast. See AI Agents for more on where this is heading.

How to Choose

If you want…	Use…
Best all-around multimodal consumer product	GPT-4o (ChatGPT Plus)
Best document analysis, long PDFs	Claude 3.5 Sonnet / 4
Best for video, massive context	Gemini 2.5 Pro
Open-weight multimodal	Pixtral (Mistral), Llama 4 (Meta)
Enterprise, compliance needs	Claude Enterprise, GPT-4o Enterprise

Go Deeper

AI Models — The complete model landscape
Text Models (LLMs) — The text foundation these build on
Image Generation Models — Text-to-image, the visual output side
Audio & Speech AI — Audio generation and understanding
Video AI Models — Video generation
AI Agents — Where multimodal models become autonomous tools
AI Intelligence Hub — Back to the hub home

Multimodal Models

What “Multimodal” Actually Means

The Current Landscape

GPT-4o — OpenAI

Gemini 2.5 Pro — Google DeepMind

Claude 3.5 Sonnet / 4 — Anthropic

Pixtral — Mistral AI

What Multimodal Unlocks

Document Understanding

Visual Reasoning

Screenshot & UI Understanding

Accessibility

The Limitations

The Multimodal Agents Coming

How to Choose

Go Deeper

Sources