MOC

AI Models

Updated 2 May 2025

mocmodelsllmvisionaudiovideomultimodal

AI Models

Every AI product you use — ChatGPT, Claude, Midjourney, Sora — is a thin interface over a model. The model is the intelligence. Understanding which models exist, what they can do, and how they differ is the key to making informed choices.

This section maps the landscape across every modality: text, image, video, audio, code, and the increasingly important multimodal models that handle several at once.

Text Models (LLMs)

The foundation. Large Language Models that read and generate text. Power chatbots, coding tools, analysis, and increasingly agentic workflows.

Model	Company	Open?	Notes
GPT-4o / GPT-4.1	OpenAI	Closed	Largest user base. Strong general purpose.
o1 / o3	OpenAI	Closed	Reasoning models — think step by step before answering.
Claude 3.5 / 4	Anthropic	Closed	Best at long documents, coding, careful reasoning. Safety-focused.
Gemini 2.5	Google DeepMind	Closed	Massive context window (1M+). Deep Google integration.
Grok 3	xAI	Closed	Integrated with X/Twitter. Fewer guardrails.
Llama 3 / 4	Meta AI	Open weights	Most popular open model. Huge community.
Mistral Large	Mistral AI	Mixed	European. Efficient. Strong multilingual.
DeepSeek V3 / R1	DeepSeek	Open weights	Chinese lab. Competitive performance at lower cost. Transparent reasoning (R1).
Qwen 2.5	Alibaba	Open weights	Strong multilingual and coding. Range of sizes.
Command R+	Cohere	Closed	Enterprise-focused. Strong RAG capabilities.
Amazon Nova	Amazon	Closed	AWS-integrated. Multiple price/performance tiers.
Phi-4	Microsoft	Open weights	Small but capable. 14B parameters competing with much larger models.
Gemma	Google DeepMind	Open weights	Lightweight. Research-friendly. Based on Gemini tech.

See Text Models (LLMs) for a deep dive into the LLM landscape, reasoning models, benchmarks, and how to choose. See How LLMs Work for the technical foundations. See Training & Fine-Tuning for how these models are created.

Image Generation

Text-to-image models that create visual content from descriptions. The area where AI’s creative capabilities are most visible — and most legally contested.

Model	Company	Notes
DALL-E 3	OpenAI	Integrated into ChatGPT. Best prompt adherence.
Midjourney v6	Midjourney	Aesthetic quality leader. Discord-native.
Stable Diffusion 3 / SDXL	Stability AI	Open weights. Community ecosystem. ControlNet.
Flux	Black Forest Labs	Newest contender. High quality. Open variants. Built by ex-Stability AI team.
Ideogram	Ideogram	Specialises in rendering readable text in images. Canva integration.
Imagen 3	Google DeepMind	Google’s entry. Photorealism focus.
Firefly	Adobe	Trained on licensed data. Commercial-safe. Photoshop integration.
Leonardo AI	Leonardo AI	Game assets, concept art, character design.
Grok Image (Aurora)	xAI	Integrated into X/Twitter. Flux-based.

See Image Generation Models for a deep dive into each model, how they work, and the copyright debate. Image generation models are at the centre of the copyright debate. See AI Bias & Fairness for representation issues in generated images.

Video Generation

Text-to-video and image-to-video. The newest frontier. Moving from impressive demos to usable tools.

Model	Company	Status	Notes
Sora	OpenAI	Released	Physics-aware generation. Cinematic quality.
Veo 2	Google DeepMind	Released	Strong motion understanding. YouTube integration planned.
Runway Gen-3	Runway	Released	Pioneer in the space. Creative professionals.
Kling 1.6	Kuaishou	Released	Chinese. Surprisingly strong.
Pika 2.0	Pika Labs	Released	Quick iterations. Good for social content.
Hailuo / MiniMax	MiniMax	Released	Emerging competitor.

The video space is moving fast. What looks incredible today will seem primitive in a year. The key question isn’t quality — it’s control, consistency, and commercial usability.

See Video AI Models for a deep dive into each model, how they work, and the implications. Video models raise urgent questions about deepfakes, consent, and authenticity. See AI Security.

Audio & Speech

Three distinct subfields that are converging:

Text-to-Speech (TTS)

Model/Service	Company	Notes
ElevenLabs	ElevenLabs	Market leader. Voice cloning. Emotional range.
OpenAI TTS	OpenAI	Built into ChatGPT voice mode. Natural prosody.
Bark	Suno AI	Open source. Multilingual. Music and effects.
XTTS	Coqui AI	Open source. Voice cloning. Community-driven.
Azure Neural TTS	Microsoft	Enterprise. Many languages. Custom voice training.

Speech-to-Text (STT)

Model/Service	Company	Notes
Whisper	OpenAI	Open source. Gold standard. Multilingual.
Deepgram	Deepgram	API-first. Real-time. Enterprise focused.
Google Cloud STT	Google	Mature. Many languages. Streaming.
AssemblyAI	AssemblyAI	Speaker diarisation. Summarisation.

Music Generation

Model/Service	Company	Notes
Suno v4	Suno AI	Full song generation from text. Vocals included.
Udio	Udio	High fidelity. Genre range.
MusicLM	Google DeepMind	Research. High quality but limited access.

Voice cloning and TTS raise critical safety concerns — voice scams, identity theft, consent. The EU AI Act classifies some of these as high-risk.

Multimodal Models

Models that understand and generate across multiple modalities. This is where everything is heading.

Model	Company	Modalities	Notes
GPT-4o	OpenAI	Text, image, audio, video	“Omni” — one model, all modalities
Gemini 2.5 Pro	Google DeepMind	Text, image, audio, video	Native multimodal. Huge context (1M+).
Claude 3.5 Sonnet / 4	Anthropic	Text, image	Vision + text. Strong document analysis.
Pixtral	Mistral AI	Text, image	Open-weight multimodal. Efficient.
Llama 4	Meta AI	Text, image, video	Open-weight. Meta’s first natively multimodal release.

Multimodal is the future because the world isn’t text-only. These models can look at a screenshot, listen to a conversation, read a document, and respond in any combination.

See Multimodal Models for a deep dive into how multimodality works, what it unlocks, and the agents it enables.

Speech to Speech

Nvidia Personaplex Vs OpenAI Realtime

Coding Models

Models optimised for understanding and generating code. The backbone of AI coding agents.

Model/Product	Company	Notes
Claude Code	Anthropic	Terminal-native. Reads codebases. Plans and executes.
Cursor Tab	Cursor	IDE-integrated. Fast autocomplete. Context-aware.
GitHub Copilot	Microsoft/OpenAI	Most widely adopted. VS Code native.
Devin	Cognition AI	Fully autonomous AI software engineer.
Aider	Open source	Git-aware pair programming with any LLM backend.
Codex	OpenAI	Original code model. Now folded into GPT-4.
Codex CLI	OpenAI	Terminal-based agent. OpenAI’s answer to Claude Code.
Windsurf	Codeium	AI-first IDE. Cascade mode for multi-file reasoning.
DeepSeek Coder V2	DeepSeek	Open. Competitive with GPT-4 on code benchmarks.
Qwen Coder 2.5	Alibaba	Open. Strong at multiple languages.
StarCoder2	BigCode	Open. Permissively licensed training data.

See Coding Models for a deep dive into each tool, the levels of coding AI, and how to choose. See AI Agents for how these models work as autonomous coding agents, not just autocomplete.

Embedding Models

Models that turn text (and increasingly, images and audio) into lists of numbers — vectors that capture meaning. Embeddings are the invisible infrastructure behind search, retrieval, and semantic understanding.

Model	Company	Notes
text-embedding-3	OpenAI	Small/large variants. Strong all-round.
Voyage AI	Anthropic-backed	Specialised embeddings. Good for code and legal text.
Cohere Embed	Cohere	Multilingual. Optimised for RAG.
Jina Embeddings	Jina AI	Open source. 8K context. Multilingual.
BGE	BAAI	Open source. Top of MTEB leaderboard.
E5 / GTE	Microsoft/Alibaba	Strong open-source baselines.
Nomic Embed	Nomic AI	Open. Built for visualisation and search.
NV-Embed	NVIDIA	Open. Top-ranked on MTEB.

Embeddings are what make RAG work. They power semantic search, recommendation systems, clustering, and the retrieval step in virtually every AI agent. Without good embeddings, your AI has no memory of your documents.

How to Choose

The model you need depends on what you’re building:

Use Case	Best Options	Why
General chat/assistant	GPT-4o, Claude 3.5 Sonnet, Gemini 2.5	Balanced capability
Long documents, analysis	Claude (200K context)	Best at staying coherent over long inputs
Coding	Claude Code, Cursor, DeepSeek Coder	Purpose-built for code workflows
Image generation	Midjourney (quality), SDXL/Flux (control)	Depends on your need
Video	Sora (quality), Runway (workflow)	Space is still maturing
Voice / TTS	ElevenLabs (quality), Whisper (STT)	Clear leaders per task
Multimodal	GPT-4o, Gemini 2.5	Native multi-input understanding
Embeddings / search	OpenAI text-embedding-3, Cohere Embed, BGE	Depends on language and scale
Reasoning / hard problems	o1/o3, DeepSeek R1, Claude 4	Step-by-step thinking
Privacy / local	Llama 3/4, Mistral, Qwen, Ollama	Run on your hardware
Cost sensitive	DeepSeek, Qwen, open models	Open = cheaper at scale

The Bigger Picture

Models are commoditising. What cost millions to build in 2023 can be replicated for thousands in 2025. Open-weight models keep closing the gap with closed frontier models.

What matters increasingly is not the model alone but the system around it: retrieval, agent orchestration, prompting, fine-tuning for your domain, and safety guardrails.

The model is the engine. Everything else is the car.

Go Deeper

Text Models (LLMs) — Deep dive into the LLM landscape, reasoning models, and benchmarks
Image Generation Models — How text-to-image models work and compare
Video AI Models — The emerging video generation space
Audio & Speech AI — TTS, speech-to-text, voice cloning, music generation
Multimodal Models — Models that handle text, image, audio, and video together
Coding Models — Specialist models for software development
How LLMs Work — Understand what’s happening inside text models
Training & Fine-Tuning — How models are created and customised
AI Companies — Who builds these models
AI Agents — How models become autonomous tools
RAG & Retrieval — How embeddings and retrieval give models access to your data
AI Security — Safety considerations for model deployment
AI Safety & Ethics — The broader implications
AI Intelligence Hub — Back to the hub home

Sources

Chatbot Arena — Community-driven model rankings
Hugging Face Open LLM Leaderboard — Open model benchmarks
Artificial Analysis — Model performance and pricing comparison