MOC

AI Models

Updated 2 May 2025
mocmodelsllmvisionaudiovideomultimodal

AI Models

Every AI product you use — ChatGPT, Claude, Midjourney, Sora — is a thin interface over a model. The model is the intelligence. Understanding which models exist, what they can do, and how they differ is the key to making informed choices.

This section maps the landscape across every modality: text, image, video, audio, code, and the increasingly important multimodal models that handle several at once.


Text Models (LLMs)

The foundation. Large Language Models that read and generate text. Power chatbots, coding tools, analysis, and increasingly agentic workflows.

ModelCompanyOpen?Notes
GPT-4o / GPT-4.1OpenAIClosedLargest user base. Strong general purpose.
o1 / o3OpenAIClosedReasoning models — think step by step before answering.
Claude 3.5 / 4AnthropicClosedBest at long documents, coding, careful reasoning. Safety-focused.
Gemini 2.5Google DeepMindClosedMassive context window (1M+). Deep Google integration.
Grok 3xAIClosedIntegrated with X/Twitter. Fewer guardrails.
Llama 3 / 4Meta AIOpen weightsMost popular open model. Huge community.
Mistral LargeMistral AIMixedEuropean. Efficient. Strong multilingual.
DeepSeek V3 / R1DeepSeekOpen weightsChinese lab. Competitive performance at lower cost. Transparent reasoning (R1).
Qwen 2.5AlibabaOpen weightsStrong multilingual and coding. Range of sizes.
Command R+CohereClosedEnterprise-focused. Strong RAG capabilities.
Amazon NovaAmazonClosedAWS-integrated. Multiple price/performance tiers.
Phi-4MicrosoftOpen weightsSmall but capable. 14B parameters competing with much larger models.
GemmaGoogle DeepMindOpen weightsLightweight. Research-friendly. Based on Gemini tech.

See Text Models (LLMs) for a deep dive into the LLM landscape, reasoning models, benchmarks, and how to choose. See How LLMs Work for the technical foundations. See Training & Fine-Tuning for how these models are created.


Image Generation

Text-to-image models that create visual content from descriptions. The area where AI’s creative capabilities are most visible — and most legally contested.

ModelCompanyNotes
DALL-E 3OpenAIIntegrated into ChatGPT. Best prompt adherence.
Midjourney v6MidjourneyAesthetic quality leader. Discord-native.
Stable Diffusion 3 / SDXLStability AIOpen weights. Community ecosystem. ControlNet.
FluxBlack Forest LabsNewest contender. High quality. Open variants. Built by ex-Stability AI team.
IdeogramIdeogramSpecialises in rendering readable text in images. Canva integration.
Imagen 3Google DeepMindGoogle’s entry. Photorealism focus.
FireflyAdobeTrained on licensed data. Commercial-safe. Photoshop integration.
Leonardo AILeonardo AIGame assets, concept art, character design.
Grok Image (Aurora)xAIIntegrated into X/Twitter. Flux-based.

See Image Generation Models for a deep dive into each model, how they work, and the copyright debate. Image generation models are at the centre of the copyright debate. See AI Bias & Fairness for representation issues in generated images.


Video Generation

Text-to-video and image-to-video. The newest frontier. Moving from impressive demos to usable tools.

ModelCompanyStatusNotes
SoraOpenAIReleasedPhysics-aware generation. Cinematic quality.
Veo 2Google DeepMindReleasedStrong motion understanding. YouTube integration planned.
Runway Gen-3RunwayReleasedPioneer in the space. Creative professionals.
Kling 1.6KuaishouReleasedChinese. Surprisingly strong.
Pika 2.0Pika LabsReleasedQuick iterations. Good for social content.
Hailuo / MiniMaxMiniMaxReleasedEmerging competitor.

The video space is moving fast. What looks incredible today will seem primitive in a year. The key question isn’t quality — it’s control, consistency, and commercial usability.

See Video AI Models for a deep dive into each model, how they work, and the implications. Video models raise urgent questions about deepfakes, consent, and authenticity. See AI Security.


Audio & Speech

Three distinct subfields that are converging:

Text-to-Speech (TTS)

Model/ServiceCompanyNotes
ElevenLabsElevenLabsMarket leader. Voice cloning. Emotional range.
OpenAI TTSOpenAIBuilt into ChatGPT voice mode. Natural prosody.
BarkSuno AIOpen source. Multilingual. Music and effects.
XTTSCoqui AIOpen source. Voice cloning. Community-driven.
Azure Neural TTSMicrosoftEnterprise. Many languages. Custom voice training.

Speech-to-Text (STT)

Model/ServiceCompanyNotes
WhisperOpenAIOpen source. Gold standard. Multilingual.
DeepgramDeepgramAPI-first. Real-time. Enterprise focused.
Google Cloud STTGoogleMature. Many languages. Streaming.
AssemblyAIAssemblyAISpeaker diarisation. Summarisation.

Music Generation

Model/ServiceCompanyNotes
Suno v4Suno AIFull song generation from text. Vocals included.
UdioUdioHigh fidelity. Genre range.
MusicLMGoogle DeepMindResearch. High quality but limited access.

Voice cloning and TTS raise critical safety concerns — voice scams, identity theft, consent. The EU AI Act classifies some of these as high-risk.


Multimodal Models

Models that understand and generate across multiple modalities. This is where everything is heading.

ModelCompanyModalitiesNotes
GPT-4oOpenAIText, image, audio, video“Omni” — one model, all modalities
Gemini 2.5 ProGoogle DeepMindText, image, audio, videoNative multimodal. Huge context (1M+).
Claude 3.5 Sonnet / 4AnthropicText, imageVision + text. Strong document analysis.
PixtralMistral AIText, imageOpen-weight multimodal. Efficient.
Llama 4Meta AIText, image, videoOpen-weight. Meta’s first natively multimodal release.

Multimodal is the future because the world isn’t text-only. These models can look at a screenshot, listen to a conversation, read a document, and respond in any combination.

See Multimodal Models for a deep dive into how multimodality works, what it unlocks, and the agents it enables.


Speech to Speech

Nvidia Personaplex Vs OpenAI Realtime

Coding Models

Models optimised for understanding and generating code. The backbone of AI coding agents.

Model/ProductCompanyNotes
Claude CodeAnthropicTerminal-native. Reads codebases. Plans and executes.
Cursor TabCursorIDE-integrated. Fast autocomplete. Context-aware.
GitHub CopilotMicrosoft/OpenAIMost widely adopted. VS Code native.
DevinCognition AIFully autonomous AI software engineer.
AiderOpen sourceGit-aware pair programming with any LLM backend.
CodexOpenAIOriginal code model. Now folded into GPT-4.
Codex CLIOpenAITerminal-based agent. OpenAI’s answer to Claude Code.
WindsurfCodeiumAI-first IDE. Cascade mode for multi-file reasoning.
DeepSeek Coder V2DeepSeekOpen. Competitive with GPT-4 on code benchmarks.
Qwen Coder 2.5AlibabaOpen. Strong at multiple languages.
StarCoder2BigCodeOpen. Permissively licensed training data.

See Coding Models for a deep dive into each tool, the levels of coding AI, and how to choose. See AI Agents for how these models work as autonomous coding agents, not just autocomplete.


Embedding Models

Models that turn text (and increasingly, images and audio) into lists of numbers — vectors that capture meaning. Embeddings are the invisible infrastructure behind search, retrieval, and semantic understanding.

ModelCompanyNotes
text-embedding-3OpenAISmall/large variants. Strong all-round.
Voyage AIAnthropic-backedSpecialised embeddings. Good for code and legal text.
Cohere EmbedCohereMultilingual. Optimised for RAG.
Jina EmbeddingsJina AIOpen source. 8K context. Multilingual.
BGEBAAIOpen source. Top of MTEB leaderboard.
E5 / GTEMicrosoft/AlibabaStrong open-source baselines.
Nomic EmbedNomic AIOpen. Built for visualisation and search.
NV-EmbedNVIDIAOpen. Top-ranked on MTEB.

Embeddings are what make RAG work. They power semantic search, recommendation systems, clustering, and the retrieval step in virtually every AI agent. Without good embeddings, your AI has no memory of your documents.


How to Choose

The model you need depends on what you’re building:

Use CaseBest OptionsWhy
General chat/assistantGPT-4o, Claude 3.5 Sonnet, Gemini 2.5Balanced capability
Long documents, analysisClaude (200K context)Best at staying coherent over long inputs
CodingClaude Code, Cursor, DeepSeek CoderPurpose-built for code workflows
Image generationMidjourney (quality), SDXL/Flux (control)Depends on your need
VideoSora (quality), Runway (workflow)Space is still maturing
Voice / TTSElevenLabs (quality), Whisper (STT)Clear leaders per task
MultimodalGPT-4o, Gemini 2.5Native multi-input understanding
Embeddings / searchOpenAI text-embedding-3, Cohere Embed, BGEDepends on language and scale
Reasoning / hard problemso1/o3, DeepSeek R1, Claude 4Step-by-step thinking
Privacy / localLlama 3/4, Mistral, Qwen, OllamaRun on your hardware
Cost sensitiveDeepSeek, Qwen, open modelsOpen = cheaper at scale

The Bigger Picture

Models are commoditising. What cost millions to build in 2023 can be replicated for thousands in 2025. Open-weight models keep closing the gap with closed frontier models.

What matters increasingly is not the model alone but the system around it: retrieval, agent orchestration, prompting, fine-tuning for your domain, and safety guardrails.

The model is the engine. Everything else is the car.


Go Deeper

Sources

enes