AI Models
AI Models
Every AI product you use — ChatGPT, Claude, Midjourney, Sora — is a thin interface over a model. The model is the intelligence. Understanding which models exist, what they can do, and how they differ is the key to making informed choices.
This section maps the landscape across every modality: text, image, video, audio, code, and the increasingly important multimodal models that handle several at once.
Text Models (LLMs)
The foundation. Large Language Models that read and generate text. Power chatbots, coding tools, analysis, and increasingly agentic workflows.
| Model | Company | Open? | Notes |
|---|---|---|---|
| GPT-4o / GPT-4.1 | OpenAI | Closed | Largest user base. Strong general purpose. |
| o1 / o3 | OpenAI | Closed | Reasoning models — think step by step before answering. |
| Claude 3.5 / 4 | Anthropic | Closed | Best at long documents, coding, careful reasoning. Safety-focused. |
| Gemini 2.5 | Google DeepMind | Closed | Massive context window (1M+). Deep Google integration. |
| Grok 3 | xAI | Closed | Integrated with X/Twitter. Fewer guardrails. |
| Llama 3 / 4 | Meta AI | Open weights | Most popular open model. Huge community. |
| Mistral Large | Mistral AI | Mixed | European. Efficient. Strong multilingual. |
| DeepSeek V3 / R1 | DeepSeek | Open weights | Chinese lab. Competitive performance at lower cost. Transparent reasoning (R1). |
| Qwen 2.5 | Alibaba | Open weights | Strong multilingual and coding. Range of sizes. |
| Command R+ | Cohere | Closed | Enterprise-focused. Strong RAG capabilities. |
| Amazon Nova | Amazon | Closed | AWS-integrated. Multiple price/performance tiers. |
| Phi-4 | Microsoft | Open weights | Small but capable. 14B parameters competing with much larger models. |
| Gemma | Google DeepMind | Open weights | Lightweight. Research-friendly. Based on Gemini tech. |
See Text Models (LLMs) for a deep dive into the LLM landscape, reasoning models, benchmarks, and how to choose. See How LLMs Work for the technical foundations. See Training & Fine-Tuning for how these models are created.
Image Generation
Text-to-image models that create visual content from descriptions. The area where AI’s creative capabilities are most visible — and most legally contested.
| Model | Company | Notes |
|---|---|---|
| DALL-E 3 | OpenAI | Integrated into ChatGPT. Best prompt adherence. |
| Midjourney v6 | Midjourney | Aesthetic quality leader. Discord-native. |
| Stable Diffusion 3 / SDXL | Stability AI | Open weights. Community ecosystem. ControlNet. |
| Flux | Black Forest Labs | Newest contender. High quality. Open variants. Built by ex-Stability AI team. |
| Ideogram | Ideogram | Specialises in rendering readable text in images. Canva integration. |
| Imagen 3 | Google DeepMind | Google’s entry. Photorealism focus. |
| Firefly | Adobe | Trained on licensed data. Commercial-safe. Photoshop integration. |
| Leonardo AI | Leonardo AI | Game assets, concept art, character design. |
| Grok Image (Aurora) | xAI | Integrated into X/Twitter. Flux-based. |
See Image Generation Models for a deep dive into each model, how they work, and the copyright debate. Image generation models are at the centre of the copyright debate. See AI Bias & Fairness for representation issues in generated images.
Video Generation
Text-to-video and image-to-video. The newest frontier. Moving from impressive demos to usable tools.
| Model | Company | Status | Notes |
|---|---|---|---|
| Sora | OpenAI | Released | Physics-aware generation. Cinematic quality. |
| Veo 2 | Google DeepMind | Released | Strong motion understanding. YouTube integration planned. |
| Runway Gen-3 | Runway | Released | Pioneer in the space. Creative professionals. |
| Kling 1.6 | Kuaishou | Released | Chinese. Surprisingly strong. |
| Pika 2.0 | Pika Labs | Released | Quick iterations. Good for social content. |
| Hailuo / MiniMax | MiniMax | Released | Emerging competitor. |
The video space is moving fast. What looks incredible today will seem primitive in a year. The key question isn’t quality — it’s control, consistency, and commercial usability.
See Video AI Models for a deep dive into each model, how they work, and the implications. Video models raise urgent questions about deepfakes, consent, and authenticity. See AI Security.
Audio & Speech
Three distinct subfields that are converging:
Text-to-Speech (TTS)
| Model/Service | Company | Notes |
|---|---|---|
| ElevenLabs | ElevenLabs | Market leader. Voice cloning. Emotional range. |
| OpenAI TTS | OpenAI | Built into ChatGPT voice mode. Natural prosody. |
| Bark | Suno AI | Open source. Multilingual. Music and effects. |
| XTTS | Coqui AI | Open source. Voice cloning. Community-driven. |
| Azure Neural TTS | Microsoft | Enterprise. Many languages. Custom voice training. |
Speech-to-Text (STT)
| Model/Service | Company | Notes |
|---|---|---|
| Whisper | OpenAI | Open source. Gold standard. Multilingual. |
| Deepgram | Deepgram | API-first. Real-time. Enterprise focused. |
| Google Cloud STT | Mature. Many languages. Streaming. | |
| AssemblyAI | AssemblyAI | Speaker diarisation. Summarisation. |
Music Generation
| Model/Service | Company | Notes |
|---|---|---|
| Suno v4 | Suno AI | Full song generation from text. Vocals included. |
| Udio | Udio | High fidelity. Genre range. |
| MusicLM | Google DeepMind | Research. High quality but limited access. |
Voice cloning and TTS raise critical safety concerns — voice scams, identity theft, consent. The EU AI Act classifies some of these as high-risk.
Multimodal Models
Models that understand and generate across multiple modalities. This is where everything is heading.
| Model | Company | Modalities | Notes |
|---|---|---|---|
| GPT-4o | OpenAI | Text, image, audio, video | “Omni” — one model, all modalities |
| Gemini 2.5 Pro | Google DeepMind | Text, image, audio, video | Native multimodal. Huge context (1M+). |
| Claude 3.5 Sonnet / 4 | Anthropic | Text, image | Vision + text. Strong document analysis. |
| Pixtral | Mistral AI | Text, image | Open-weight multimodal. Efficient. |
| Llama 4 | Meta AI | Text, image, video | Open-weight. Meta’s first natively multimodal release. |
Multimodal is the future because the world isn’t text-only. These models can look at a screenshot, listen to a conversation, read a document, and respond in any combination.
See Multimodal Models for a deep dive into how multimodality works, what it unlocks, and the agents it enables.
Speech to Speech
Nvidia Personaplex Vs OpenAI Realtime
Coding Models
Models optimised for understanding and generating code. The backbone of AI coding agents.
| Model/Product | Company | Notes |
|---|---|---|
| Claude Code | Anthropic | Terminal-native. Reads codebases. Plans and executes. |
| Cursor Tab | Cursor | IDE-integrated. Fast autocomplete. Context-aware. |
| GitHub Copilot | Microsoft/OpenAI | Most widely adopted. VS Code native. |
| Devin | Cognition AI | Fully autonomous AI software engineer. |
| Aider | Open source | Git-aware pair programming with any LLM backend. |
| Codex | OpenAI | Original code model. Now folded into GPT-4. |
| Codex CLI | OpenAI | Terminal-based agent. OpenAI’s answer to Claude Code. |
| Windsurf | Codeium | AI-first IDE. Cascade mode for multi-file reasoning. |
| DeepSeek Coder V2 | DeepSeek | Open. Competitive with GPT-4 on code benchmarks. |
| Qwen Coder 2.5 | Alibaba | Open. Strong at multiple languages. |
| StarCoder2 | BigCode | Open. Permissively licensed training data. |
See Coding Models for a deep dive into each tool, the levels of coding AI, and how to choose. See AI Agents for how these models work as autonomous coding agents, not just autocomplete.
Embedding Models
Models that turn text (and increasingly, images and audio) into lists of numbers — vectors that capture meaning. Embeddings are the invisible infrastructure behind search, retrieval, and semantic understanding.
| Model | Company | Notes |
|---|---|---|
| text-embedding-3 | OpenAI | Small/large variants. Strong all-round. |
| Voyage AI | Anthropic-backed | Specialised embeddings. Good for code and legal text. |
| Cohere Embed | Cohere | Multilingual. Optimised for RAG. |
| Jina Embeddings | Jina AI | Open source. 8K context. Multilingual. |
| BGE | BAAI | Open source. Top of MTEB leaderboard. |
| E5 / GTE | Microsoft/Alibaba | Strong open-source baselines. |
| Nomic Embed | Nomic AI | Open. Built for visualisation and search. |
| NV-Embed | NVIDIA | Open. Top-ranked on MTEB. |
Embeddings are what make RAG work. They power semantic search, recommendation systems, clustering, and the retrieval step in virtually every AI agent. Without good embeddings, your AI has no memory of your documents.
How to Choose
The model you need depends on what you’re building:
| Use Case | Best Options | Why |
|---|---|---|
| General chat/assistant | GPT-4o, Claude 3.5 Sonnet, Gemini 2.5 | Balanced capability |
| Long documents, analysis | Claude (200K context) | Best at staying coherent over long inputs |
| Coding | Claude Code, Cursor, DeepSeek Coder | Purpose-built for code workflows |
| Image generation | Midjourney (quality), SDXL/Flux (control) | Depends on your need |
| Video | Sora (quality), Runway (workflow) | Space is still maturing |
| Voice / TTS | ElevenLabs (quality), Whisper (STT) | Clear leaders per task |
| Multimodal | GPT-4o, Gemini 2.5 | Native multi-input understanding |
| Embeddings / search | OpenAI text-embedding-3, Cohere Embed, BGE | Depends on language and scale |
| Reasoning / hard problems | o1/o3, DeepSeek R1, Claude 4 | Step-by-step thinking |
| Privacy / local | Llama 3/4, Mistral, Qwen, Ollama | Run on your hardware |
| Cost sensitive | DeepSeek, Qwen, open models | Open = cheaper at scale |
The Bigger Picture
Models are commoditising. What cost millions to build in 2023 can be replicated for thousands in 2025. Open-weight models keep closing the gap with closed frontier models.
What matters increasingly is not the model alone but the system around it: retrieval, agent orchestration, prompting, fine-tuning for your domain, and safety guardrails.
The model is the engine. Everything else is the car.
Go Deeper
- Text Models (LLMs) — Deep dive into the LLM landscape, reasoning models, and benchmarks
- Image Generation Models — How text-to-image models work and compare
- Video AI Models — The emerging video generation space
- Audio & Speech AI — TTS, speech-to-text, voice cloning, music generation
- Multimodal Models — Models that handle text, image, audio, and video together
- Coding Models — Specialist models for software development
- How LLMs Work — Understand what’s happening inside text models
- Training & Fine-Tuning — How models are created and customised
- AI Companies — Who builds these models
- AI Agents — How models become autonomous tools
- RAG & Retrieval — How embeddings and retrieval give models access to your data
- AI Security — Safety considerations for model deployment
- AI Safety & Ethics — The broader implications
- AI Intelligence Hub — Back to the hub home
Sources
- Chatbot Arena — Community-driven model rankings
- Hugging Face Open LLM Leaderboard — Open model benchmarks
- Artificial Analysis — Model performance and pricing comparison