ARTICLE

Text Models (LLMs)

Updated 2 May 2025

modelsllmtextgptclaudegeminillamadeepseekqwen

Text Models (LLMs)

Large Language Models are the core of the current AI revolution. They power chatbots, coding assistants, search engines, writing tools, and increasingly, autonomous agents. They are the engine under the hood of almost everything in this hub.

Understanding what models exist, how they differ, and what each is good at is fundamental to making sense of AI in 2025.

The Frontier: Closed Models

These are the most capable models on the planet, served through APIs and chat interfaces. You can’t download them or run them yourself.

GPT-4o / GPT-4.1 — OpenAI

The most widely used LLM family. GPT-4o (“omni”) is natively multimodal (text, image, audio) and powers ChatGPT. GPT-4.1 is the latest API-only variant optimised for developer use — better instruction following, longer context, improved coding.

Strengths: Broad capability, strong multilingual, massive user base, deep ecosystem (tools, plugins, APIs) Weaknesses: Closed, expensive at scale, occasionally overconfident on wrong answers Context: 128K tokens (GPT-4o), 1M (GPT-4.1)

Claude 3.5 / 4 — Anthropic

The safety-focused alternative. Claude models are known for careful reasoning, honesty about uncertainty, and genuine skill at long-form analysis and coding. Claude 4 Opus is the frontier, Claude 4 Sonnet is the sweet spot for most tasks, and Claude 3.5 Haiku handles speed-sensitive workloads.

Strengths: Long-context coherence (200K), nuanced instruction following, coding, document analysis Weaknesses: Sometimes overly cautious, can refuse harmless requests, smaller ecosystem than OpenAI Context: 200K tokens

Gemini 2.5 — Google DeepMind

Google’s frontier model. Massive context window, natively multimodal, and deeply integrated with Google’s ecosystem — Search, YouTube, Workspace, Android. Gemini 2.5 Pro introduced “thinking” capabilities (step-by-step reasoning at inference time).

Strengths: Huge context (1M+ tokens), search grounding, Google integration, multimodal from the ground up Weaknesses: Less independent developer ecosystem, Google’s brand in AI is still rebuilding after early missteps Context: 1M tokens (Pro), expanding to 2M

Grok 3 — xAI

Elon Musk’s entry. Grok positions itself as the “anti-woke” model — fewer guardrails, more willing to engage with controversial topics. Trained on X/Twitter data, integrated into the X platform. Grok 3 closed much of the gap with GPT-4o and Claude.

Strengths: Real-time X data access, less filtered responses, competitive reasoning Weaknesses: Smaller ecosystem, polarising brand, behind on some benchmarks vs GPT/Claude

DeepSeek V3 / R1 — DeepSeek

The Chinese lab that shocked the industry. DeepSeek V3 matches frontier models at a fraction of the training cost. R1 is their reasoning model — think step-by-step, openly published chain-of-thought, competitive with OpenAI’s o1. Both have open weights.

Strengths: Open weights, cost-efficient training, competitive performance, transparent reasoning (R1) Weaknesses: Chinese data regulations, some censorship on politically sensitive topics, IP questions about training data Context: 128K tokens

Command R+ — Cohere

Enterprise-first. Cohere focuses on business use cases — RAG, embeddings, search, and tool use — rather than general-purpose chat. Command R+ is their most capable model, optimised for retrieval-augmented generation workflows.

Strengths: RAG performance, enterprise compliance, multilingual (especially Arabic, Japanese, Korean) Weaknesses: Less capable on creative/general tasks, smaller developer mindshare Context: 128K tokens

Mistral Large — Mistral AI

Europe’s champion. Mistral builds efficient models that compete near the frontier. Known for strong multilingual performance, lean architecture, and a mix of open and commercial releases. Mistral Large is their flagship, competitive with Claude and GPT on many tasks.

Strengths: Efficient, strong multilingual (French, Spanish, German, etc.), European data sovereignty Weaknesses: Smaller than the largest labs, less research output, narrower product range Context: 128K tokens

Amazon Nova — Amazon

Amazon’s entry into the LLM market, built for AWS. The Nova family includes models at multiple price/performance tiers. Tight integration with Bedrock (Amazon’s AI platform) is the play — they don’t need to win on pure capability if they win on AWS developer convenience.

Strengths: AWS integration, multiple tiers, good performance-to-cost ratio Weaknesses: Late to the race, less brand recognition in AI, behind frontier on raw benchmarks

The Open-Weight Movement

These models have publicly available weights. You can download them, run them on your own hardware (or a cloud VM), and fine-tune them on your data. No API calls, no data leaving your environment, no usage limits except your hardware.

Llama 3 / 4 — Meta AI

The most important open-weight model in the world. Meta’s strategy is to give the models away, build an ecosystem, and commoditise their competitors’ products. Llama 3.1 405B was a landmark — the first open model to genuinely match GPT-4 class on many benchmarks. Llama 4 brought native multimodality.

Why it matters: Llama proved that open models can compete with closed frontier models. It’s the default choice for companies that want AI capabilities without sending data to OpenAI or Anthropic. The community around it is enormous — fine-tunes, tools, and deployment patterns for every use case.

Qwen 2.5 — Alibaba

China’s strongest open-weight model. Qwen 2.5 is competitive with Llama across languages and excels at coding. Available in sizes from 0.5B to 72B parameters.

Strengths: Coding benchmarks, multilingual (especially Chinese, Arabic, Southeast Asian languages), range of sizes Weaknesses: Chinese government influence over some versions, less documentation in English

Mixtral / Mistral Open — Mistral AI

Mixtral 8x22B uses a mixture-of-experts architecture — only a fraction of the model activates for any given input. This makes it surprisingly fast and cheap to run relative to its quality. Mistral Nemo and Small fill the smaller-model niches.

Strengths: Efficient architecture, strong multilingual, European data governance Weaknesses: Smaller parameter counts mean it can’t match the largest models on truly hard reasoning

Gemma — Google DeepMind

Google’s open-weight offering. Smaller, research-friendly models based on the same technology as Gemini. Useful for experimentation, fine-tuning, and academic work. Not trying to compete on raw capability — more like “here’s the ingredients, go cook.”

Phi-4 — Microsoft

Microsoft’s small but capable model. Phi models are built for efficiency — strong performance from relatively few parameters. Phi-4 (14B) competes with much larger models on reasoning and math. Good for edge devices and applications where compute is constrained.

Open-Source Ecosystem

Beyond the major releases, there’s a thriving community of fine-tunes, merges, and derivatives:

Model	Based On	Notable For
Hermes (Nous Research)	Llama	Uncensored, creative writing, roleplay
Dolphin	Llama	Uncensored, broad use
CodeLlama	Llama	Meta’s code-specialised variant
Yi (01.AI)	Original	Strong bilingual (Chinese/English)
OLMo (AI2)	Original	Fully open — model + data + training code

Reasoning Models: The New Paradigm

A new category emerged in late 2024: models that “think” before responding.

OpenAI o1 / o3

First to popularise “test-time compute” — the model spends extra time reasoning through problems rather than generating an immediate answer. Dramatically better at math, logic, and scientific reasoning. Slower and more expensive, but when you need to think hard about something, this is the approach.

DeepSeek R1

An open-weight reasoning model that shakes the field. R1 shows its full chain of thought (unlike o1, which hides it). Matches or exceeds o1 on reasoning benchmarks. Openly published — you can see exactly how it thinks through a problem. This transparency is a big deal for safety research.

Claude 4 Opus (Thinking Mode)

Claude 4 supports “extended thinking” — controllable depth of reasoning before output. You can set a thinking budget (how many tokens to spend reasoning). More thinking = better answers on hard problems.

Gemini 2.5 Pro (Thinking)

Google added thinking capabilities to Gemini 2.5 Pro. The model can reason at length before producing output, significantly improving performance on math, science, and complex analysis.

Why this matters: Reasoning models represent a shift in how we think about scaling AI. Instead of just making models bigger, we’re making them think longer. This is compute at inference time rather than training time. It means capability keeps improving even as training hits diminishing returns. See Scaling Laws for Neural Language Models for the background.

How to Choose

Use Case	Best Options	Why
General chat/assistant	GPT-4o, Claude 3.5 Sonnet	Balanced, reliable, large ecosystems
Long documents, deep analysis	Claude (200K context)	Best long-context coherence
Privacy, local deployment	Llama, Qwen, Mixtral	Run on your hardware
Cost-sensitive, high-volume	DeepSeek V3, Qwen, Llama (self-hosted)	Open models = cheaper at scale
Hard math, science, logic	o1/o3, DeepSeek R1, Claude 4	Reasoning models specifically designed for this
Coding	Claude 3.5 Sonnet, DeepSeek Coder, Qwen Coder	See Coding Models
Enterprise RAG	Command R+, GPT-4.1	Optimised for retrieval workflows
Multilingual (European)	Mistral Large, Llama	Strong across European languages
Multilingual (Asian)	Qwen, Command R+	Good Chinese, Arabic, Japanese, Korean
Least guardrails	Grok, Hermes (open)	Willing to discuss most topics

Benchmarks & Their Limits

The community uses several benchmarks to compare models, but they all have flaws:

Benchmark	Measures	The Grain of Salt
MMLU	Knowledge across 57 subjects	Models can memorise test answers
HumanEval	Code generation	Doesn’t capture real-world coding
GSM8K	Grade-school math	Models get better at this test, not necessarily math
Chatbot Arena	Human preference (blind votes)	Subjective, biases toward certain styles
MATH	Competition-level math	Narrow — doesn’t reflect most actual use

Benchmarks are useful signals, not final answers. The best way to choose a model is to test it on your actual use cases. Every use case has quirks that benchmarks don’t capture.

Chatbot Arena (chat.lmsys.org) is the most practically useful — it captures what people actually prefer when using models side-by-side, without knowing which they’re voting for.

What to Watch

Reasoning-as-standard — Will “thinking” become the default mode for all frontier models?
Context windows — 1M tokens is here. 10M is coming. What changes when you can give a model an entire library?
Open frontier — Llama 4 closed the gap significantly. Will an open model fully match the frontier in 2025?
Training cost collapse — DeepSeek showed you can train frontier-competitive models for far less than previously claimed. Is the moat evaporating?
Agentic workflows — Models aren’t just chat anymore; they’re acting in the world. Tool use, multi-step planning, code execution.

Go Deeper

AI Models — The complete model landscape across all modalities
Coding Models — Models specialised for software development
Multimodal Models — Models that handle text, image, audio together
Image Generation Models — Text-to-image models
How LLMs Work — What’s happening inside these models
Training & Fine-Tuning — How models are created and customised
AI Companies — Who builds these models
AI Intelligence Hub — Back to the hub home

Sources

Chatbot Arena — Community-driven blind model rankings
Open LLM Leaderboard — Open model benchmarks
Artificial Analysis — Model performance, pricing, and speed comparison
Hugging Face — Repository of open-weight models