Text Models (LLMs)
Text Models (LLMs)
Large Language Models are the core of the current AI revolution. They power chatbots, coding assistants, search engines, writing tools, and increasingly, autonomous agents. They are the engine under the hood of almost everything in this hub.
Understanding what models exist, how they differ, and what each is good at is fundamental to making sense of AI in 2025.
The Frontier: Closed Models
These are the most capable models on the planet, served through APIs and chat interfaces. You can’t download them or run them yourself.
GPT-4o / GPT-4.1 — OpenAI
The most widely used LLM family. GPT-4o (“omni”) is natively multimodal (text, image, audio) and powers ChatGPT. GPT-4.1 is the latest API-only variant optimised for developer use — better instruction following, longer context, improved coding.
Strengths: Broad capability, strong multilingual, massive user base, deep ecosystem (tools, plugins, APIs) Weaknesses: Closed, expensive at scale, occasionally overconfident on wrong answers Context: 128K tokens (GPT-4o), 1M (GPT-4.1)
Claude 3.5 / 4 — Anthropic
The safety-focused alternative. Claude models are known for careful reasoning, honesty about uncertainty, and genuine skill at long-form analysis and coding. Claude 4 Opus is the frontier, Claude 4 Sonnet is the sweet spot for most tasks, and Claude 3.5 Haiku handles speed-sensitive workloads.
Strengths: Long-context coherence (200K), nuanced instruction following, coding, document analysis Weaknesses: Sometimes overly cautious, can refuse harmless requests, smaller ecosystem than OpenAI Context: 200K tokens
Gemini 2.5 — Google DeepMind
Google’s frontier model. Massive context window, natively multimodal, and deeply integrated with Google’s ecosystem — Search, YouTube, Workspace, Android. Gemini 2.5 Pro introduced “thinking” capabilities (step-by-step reasoning at inference time).
Strengths: Huge context (1M+ tokens), search grounding, Google integration, multimodal from the ground up Weaknesses: Less independent developer ecosystem, Google’s brand in AI is still rebuilding after early missteps Context: 1M tokens (Pro), expanding to 2M
Grok 3 — xAI
Elon Musk’s entry. Grok positions itself as the “anti-woke” model — fewer guardrails, more willing to engage with controversial topics. Trained on X/Twitter data, integrated into the X platform. Grok 3 closed much of the gap with GPT-4o and Claude.
Strengths: Real-time X data access, less filtered responses, competitive reasoning Weaknesses: Smaller ecosystem, polarising brand, behind on some benchmarks vs GPT/Claude
DeepSeek V3 / R1 — DeepSeek
The Chinese lab that shocked the industry. DeepSeek V3 matches frontier models at a fraction of the training cost. R1 is their reasoning model — think step-by-step, openly published chain-of-thought, competitive with OpenAI’s o1. Both have open weights.
Strengths: Open weights, cost-efficient training, competitive performance, transparent reasoning (R1) Weaknesses: Chinese data regulations, some censorship on politically sensitive topics, IP questions about training data Context: 128K tokens
Command R+ — Cohere
Enterprise-first. Cohere focuses on business use cases — RAG, embeddings, search, and tool use — rather than general-purpose chat. Command R+ is their most capable model, optimised for retrieval-augmented generation workflows.
Strengths: RAG performance, enterprise compliance, multilingual (especially Arabic, Japanese, Korean) Weaknesses: Less capable on creative/general tasks, smaller developer mindshare Context: 128K tokens
Mistral Large — Mistral AI
Europe’s champion. Mistral builds efficient models that compete near the frontier. Known for strong multilingual performance, lean architecture, and a mix of open and commercial releases. Mistral Large is their flagship, competitive with Claude and GPT on many tasks.
Strengths: Efficient, strong multilingual (French, Spanish, German, etc.), European data sovereignty Weaknesses: Smaller than the largest labs, less research output, narrower product range Context: 128K tokens
Amazon Nova — Amazon
Amazon’s entry into the LLM market, built for AWS. The Nova family includes models at multiple price/performance tiers. Tight integration with Bedrock (Amazon’s AI platform) is the play — they don’t need to win on pure capability if they win on AWS developer convenience.
Strengths: AWS integration, multiple tiers, good performance-to-cost ratio Weaknesses: Late to the race, less brand recognition in AI, behind frontier on raw benchmarks
The Open-Weight Movement
These models have publicly available weights. You can download them, run them on your own hardware (or a cloud VM), and fine-tune them on your data. No API calls, no data leaving your environment, no usage limits except your hardware.
Llama 3 / 4 — Meta AI
The most important open-weight model in the world. Meta’s strategy is to give the models away, build an ecosystem, and commoditise their competitors’ products. Llama 3.1 405B was a landmark — the first open model to genuinely match GPT-4 class on many benchmarks. Llama 4 brought native multimodality.
Why it matters: Llama proved that open models can compete with closed frontier models. It’s the default choice for companies that want AI capabilities without sending data to OpenAI or Anthropic. The community around it is enormous — fine-tunes, tools, and deployment patterns for every use case.
Qwen 2.5 — Alibaba
China’s strongest open-weight model. Qwen 2.5 is competitive with Llama across languages and excels at coding. Available in sizes from 0.5B to 72B parameters.
Strengths: Coding benchmarks, multilingual (especially Chinese, Arabic, Southeast Asian languages), range of sizes Weaknesses: Chinese government influence over some versions, less documentation in English
Mixtral / Mistral Open — Mistral AI
Mixtral 8x22B uses a mixture-of-experts architecture — only a fraction of the model activates for any given input. This makes it surprisingly fast and cheap to run relative to its quality. Mistral Nemo and Small fill the smaller-model niches.
Strengths: Efficient architecture, strong multilingual, European data governance Weaknesses: Smaller parameter counts mean it can’t match the largest models on truly hard reasoning
Gemma — Google DeepMind
Google’s open-weight offering. Smaller, research-friendly models based on the same technology as Gemini. Useful for experimentation, fine-tuning, and academic work. Not trying to compete on raw capability — more like “here’s the ingredients, go cook.”
Phi-4 — Microsoft
Microsoft’s small but capable model. Phi models are built for efficiency — strong performance from relatively few parameters. Phi-4 (14B) competes with much larger models on reasoning and math. Good for edge devices and applications where compute is constrained.
Open-Source Ecosystem
Beyond the major releases, there’s a thriving community of fine-tunes, merges, and derivatives:
| Model | Based On | Notable For |
|---|---|---|
| Hermes (Nous Research) | Llama | Uncensored, creative writing, roleplay |
| Dolphin | Llama | Uncensored, broad use |
| CodeLlama | Llama | Meta’s code-specialised variant |
| Yi (01.AI) | Original | Strong bilingual (Chinese/English) |
| OLMo (AI2) | Original | Fully open — model + data + training code |
Reasoning Models: The New Paradigm
A new category emerged in late 2024: models that “think” before responding.
OpenAI o1 / o3
First to popularise “test-time compute” — the model spends extra time reasoning through problems rather than generating an immediate answer. Dramatically better at math, logic, and scientific reasoning. Slower and more expensive, but when you need to think hard about something, this is the approach.
DeepSeek R1
An open-weight reasoning model that shakes the field. R1 shows its full chain of thought (unlike o1, which hides it). Matches or exceeds o1 on reasoning benchmarks. Openly published — you can see exactly how it thinks through a problem. This transparency is a big deal for safety research.
Claude 4 Opus (Thinking Mode)
Claude 4 supports “extended thinking” — controllable depth of reasoning before output. You can set a thinking budget (how many tokens to spend reasoning). More thinking = better answers on hard problems.
Gemini 2.5 Pro (Thinking)
Google added thinking capabilities to Gemini 2.5 Pro. The model can reason at length before producing output, significantly improving performance on math, science, and complex analysis.
Why this matters: Reasoning models represent a shift in how we think about scaling AI. Instead of just making models bigger, we’re making them think longer. This is compute at inference time rather than training time. It means capability keeps improving even as training hits diminishing returns. See Scaling Laws for Neural Language Models for the background.
How to Choose
| Use Case | Best Options | Why |
|---|---|---|
| General chat/assistant | GPT-4o, Claude 3.5 Sonnet | Balanced, reliable, large ecosystems |
| Long documents, deep analysis | Claude (200K context) | Best long-context coherence |
| Privacy, local deployment | Llama, Qwen, Mixtral | Run on your hardware |
| Cost-sensitive, high-volume | DeepSeek V3, Qwen, Llama (self-hosted) | Open models = cheaper at scale |
| Hard math, science, logic | o1/o3, DeepSeek R1, Claude 4 | Reasoning models specifically designed for this |
| Coding | Claude 3.5 Sonnet, DeepSeek Coder, Qwen Coder | See Coding Models |
| Enterprise RAG | Command R+, GPT-4.1 | Optimised for retrieval workflows |
| Multilingual (European) | Mistral Large, Llama | Strong across European languages |
| Multilingual (Asian) | Qwen, Command R+ | Good Chinese, Arabic, Japanese, Korean |
| Least guardrails | Grok, Hermes (open) | Willing to discuss most topics |
Benchmarks & Their Limits
The community uses several benchmarks to compare models, but they all have flaws:
| Benchmark | Measures | The Grain of Salt |
|---|---|---|
| MMLU | Knowledge across 57 subjects | Models can memorise test answers |
| HumanEval | Code generation | Doesn’t capture real-world coding |
| GSM8K | Grade-school math | Models get better at this test, not necessarily math |
| Chatbot Arena | Human preference (blind votes) | Subjective, biases toward certain styles |
| MATH | Competition-level math | Narrow — doesn’t reflect most actual use |
Benchmarks are useful signals, not final answers. The best way to choose a model is to test it on your actual use cases. Every use case has quirks that benchmarks don’t capture.
Chatbot Arena (chat.lmsys.org) is the most practically useful — it captures what people actually prefer when using models side-by-side, without knowing which they’re voting for.
What to Watch
- Reasoning-as-standard — Will “thinking” become the default mode for all frontier models?
- Context windows — 1M tokens is here. 10M is coming. What changes when you can give a model an entire library?
- Open frontier — Llama 4 closed the gap significantly. Will an open model fully match the frontier in 2025?
- Training cost collapse — DeepSeek showed you can train frontier-competitive models for far less than previously claimed. Is the moat evaporating?
- Agentic workflows — Models aren’t just chat anymore; they’re acting in the world. Tool use, multi-step planning, code execution.
Go Deeper
- AI Models — The complete model landscape across all modalities
- Coding Models — Models specialised for software development
- Multimodal Models — Models that handle text, image, audio together
- Image Generation Models — Text-to-image models
- How LLMs Work — What’s happening inside these models
- Training & Fine-Tuning — How models are created and customised
- AI Companies — Who builds these models
- AI Intelligence Hub — Back to the hub home
Sources
- Chatbot Arena — Community-driven blind model rankings
- Open LLM Leaderboard — Open model benchmarks
- Artificial Analysis — Model performance, pricing, and speed comparison
- Hugging Face — Repository of open-weight models