ARTICLE

Text Models (LLMs)

Updated 2 May 2025
modelsllmtextgptclaudegeminillamadeepseekqwen

Text Models (LLMs)

Large Language Models are the core of the current AI revolution. They power chatbots, coding assistants, search engines, writing tools, and increasingly, autonomous agents. They are the engine under the hood of almost everything in this hub.

Understanding what models exist, how they differ, and what each is good at is fundamental to making sense of AI in 2025.


The Frontier: Closed Models

These are the most capable models on the planet, served through APIs and chat interfaces. You can’t download them or run them yourself.

GPT-4o / GPT-4.1 — OpenAI

The most widely used LLM family. GPT-4o (“omni”) is natively multimodal (text, image, audio) and powers ChatGPT. GPT-4.1 is the latest API-only variant optimised for developer use — better instruction following, longer context, improved coding.

Strengths: Broad capability, strong multilingual, massive user base, deep ecosystem (tools, plugins, APIs) Weaknesses: Closed, expensive at scale, occasionally overconfident on wrong answers Context: 128K tokens (GPT-4o), 1M (GPT-4.1)

Claude 3.5 / 4 — Anthropic

The safety-focused alternative. Claude models are known for careful reasoning, honesty about uncertainty, and genuine skill at long-form analysis and coding. Claude 4 Opus is the frontier, Claude 4 Sonnet is the sweet spot for most tasks, and Claude 3.5 Haiku handles speed-sensitive workloads.

Strengths: Long-context coherence (200K), nuanced instruction following, coding, document analysis Weaknesses: Sometimes overly cautious, can refuse harmless requests, smaller ecosystem than OpenAI Context: 200K tokens

Gemini 2.5 — Google DeepMind

Google’s frontier model. Massive context window, natively multimodal, and deeply integrated with Google’s ecosystem — Search, YouTube, Workspace, Android. Gemini 2.5 Pro introduced “thinking” capabilities (step-by-step reasoning at inference time).

Strengths: Huge context (1M+ tokens), search grounding, Google integration, multimodal from the ground up Weaknesses: Less independent developer ecosystem, Google’s brand in AI is still rebuilding after early missteps Context: 1M tokens (Pro), expanding to 2M

Grok 3 — xAI

Elon Musk’s entry. Grok positions itself as the “anti-woke” model — fewer guardrails, more willing to engage with controversial topics. Trained on X/Twitter data, integrated into the X platform. Grok 3 closed much of the gap with GPT-4o and Claude.

Strengths: Real-time X data access, less filtered responses, competitive reasoning Weaknesses: Smaller ecosystem, polarising brand, behind on some benchmarks vs GPT/Claude

DeepSeek V3 / R1 — DeepSeek

The Chinese lab that shocked the industry. DeepSeek V3 matches frontier models at a fraction of the training cost. R1 is their reasoning model — think step-by-step, openly published chain-of-thought, competitive with OpenAI’s o1. Both have open weights.

Strengths: Open weights, cost-efficient training, competitive performance, transparent reasoning (R1) Weaknesses: Chinese data regulations, some censorship on politically sensitive topics, IP questions about training data Context: 128K tokens

Command R+ — Cohere

Enterprise-first. Cohere focuses on business use cases — RAG, embeddings, search, and tool use — rather than general-purpose chat. Command R+ is their most capable model, optimised for retrieval-augmented generation workflows.

Strengths: RAG performance, enterprise compliance, multilingual (especially Arabic, Japanese, Korean) Weaknesses: Less capable on creative/general tasks, smaller developer mindshare Context: 128K tokens

Mistral Large — Mistral AI

Europe’s champion. Mistral builds efficient models that compete near the frontier. Known for strong multilingual performance, lean architecture, and a mix of open and commercial releases. Mistral Large is their flagship, competitive with Claude and GPT on many tasks.

Strengths: Efficient, strong multilingual (French, Spanish, German, etc.), European data sovereignty Weaknesses: Smaller than the largest labs, less research output, narrower product range Context: 128K tokens

Amazon Nova — Amazon

Amazon’s entry into the LLM market, built for AWS. The Nova family includes models at multiple price/performance tiers. Tight integration with Bedrock (Amazon’s AI platform) is the play — they don’t need to win on pure capability if they win on AWS developer convenience.

Strengths: AWS integration, multiple tiers, good performance-to-cost ratio Weaknesses: Late to the race, less brand recognition in AI, behind frontier on raw benchmarks


The Open-Weight Movement

These models have publicly available weights. You can download them, run them on your own hardware (or a cloud VM), and fine-tune them on your data. No API calls, no data leaving your environment, no usage limits except your hardware.

Llama 3 / 4 — Meta AI

The most important open-weight model in the world. Meta’s strategy is to give the models away, build an ecosystem, and commoditise their competitors’ products. Llama 3.1 405B was a landmark — the first open model to genuinely match GPT-4 class on many benchmarks. Llama 4 brought native multimodality.

Why it matters: Llama proved that open models can compete with closed frontier models. It’s the default choice for companies that want AI capabilities without sending data to OpenAI or Anthropic. The community around it is enormous — fine-tunes, tools, and deployment patterns for every use case.

Qwen 2.5 — Alibaba

China’s strongest open-weight model. Qwen 2.5 is competitive with Llama across languages and excels at coding. Available in sizes from 0.5B to 72B parameters.

Strengths: Coding benchmarks, multilingual (especially Chinese, Arabic, Southeast Asian languages), range of sizes Weaknesses: Chinese government influence over some versions, less documentation in English

Mixtral / Mistral Open — Mistral AI

Mixtral 8x22B uses a mixture-of-experts architecture — only a fraction of the model activates for any given input. This makes it surprisingly fast and cheap to run relative to its quality. Mistral Nemo and Small fill the smaller-model niches.

Strengths: Efficient architecture, strong multilingual, European data governance Weaknesses: Smaller parameter counts mean it can’t match the largest models on truly hard reasoning

Gemma — Google DeepMind

Google’s open-weight offering. Smaller, research-friendly models based on the same technology as Gemini. Useful for experimentation, fine-tuning, and academic work. Not trying to compete on raw capability — more like “here’s the ingredients, go cook.”

Phi-4 — Microsoft

Microsoft’s small but capable model. Phi models are built for efficiency — strong performance from relatively few parameters. Phi-4 (14B) competes with much larger models on reasoning and math. Good for edge devices and applications where compute is constrained.

Open-Source Ecosystem

Beyond the major releases, there’s a thriving community of fine-tunes, merges, and derivatives:

ModelBased OnNotable For
Hermes (Nous Research)LlamaUncensored, creative writing, roleplay
DolphinLlamaUncensored, broad use
CodeLlamaLlamaMeta’s code-specialised variant
Yi (01.AI)OriginalStrong bilingual (Chinese/English)
OLMo (AI2)OriginalFully open — model + data + training code

Reasoning Models: The New Paradigm

A new category emerged in late 2024: models that “think” before responding.

OpenAI o1 / o3

First to popularise “test-time compute” — the model spends extra time reasoning through problems rather than generating an immediate answer. Dramatically better at math, logic, and scientific reasoning. Slower and more expensive, but when you need to think hard about something, this is the approach.

DeepSeek R1

An open-weight reasoning model that shakes the field. R1 shows its full chain of thought (unlike o1, which hides it). Matches or exceeds o1 on reasoning benchmarks. Openly published — you can see exactly how it thinks through a problem. This transparency is a big deal for safety research.

Claude 4 Opus (Thinking Mode)

Claude 4 supports “extended thinking” — controllable depth of reasoning before output. You can set a thinking budget (how many tokens to spend reasoning). More thinking = better answers on hard problems.

Gemini 2.5 Pro (Thinking)

Google added thinking capabilities to Gemini 2.5 Pro. The model can reason at length before producing output, significantly improving performance on math, science, and complex analysis.

Why this matters: Reasoning models represent a shift in how we think about scaling AI. Instead of just making models bigger, we’re making them think longer. This is compute at inference time rather than training time. It means capability keeps improving even as training hits diminishing returns. See Scaling Laws for Neural Language Models for the background.


How to Choose

Use CaseBest OptionsWhy
General chat/assistantGPT-4o, Claude 3.5 SonnetBalanced, reliable, large ecosystems
Long documents, deep analysisClaude (200K context)Best long-context coherence
Privacy, local deploymentLlama, Qwen, MixtralRun on your hardware
Cost-sensitive, high-volumeDeepSeek V3, Qwen, Llama (self-hosted)Open models = cheaper at scale
Hard math, science, logico1/o3, DeepSeek R1, Claude 4Reasoning models specifically designed for this
CodingClaude 3.5 Sonnet, DeepSeek Coder, Qwen CoderSee Coding Models
Enterprise RAGCommand R+, GPT-4.1Optimised for retrieval workflows
Multilingual (European)Mistral Large, LlamaStrong across European languages
Multilingual (Asian)Qwen, Command R+Good Chinese, Arabic, Japanese, Korean
Least guardrailsGrok, Hermes (open)Willing to discuss most topics

Benchmarks & Their Limits

The community uses several benchmarks to compare models, but they all have flaws:

BenchmarkMeasuresThe Grain of Salt
MMLUKnowledge across 57 subjectsModels can memorise test answers
HumanEvalCode generationDoesn’t capture real-world coding
GSM8KGrade-school mathModels get better at this test, not necessarily math
Chatbot ArenaHuman preference (blind votes)Subjective, biases toward certain styles
MATHCompetition-level mathNarrow — doesn’t reflect most actual use

Benchmarks are useful signals, not final answers. The best way to choose a model is to test it on your actual use cases. Every use case has quirks that benchmarks don’t capture.

Chatbot Arena (chat.lmsys.org) is the most practically useful — it captures what people actually prefer when using models side-by-side, without knowing which they’re voting for.


What to Watch

  • Reasoning-as-standard — Will “thinking” become the default mode for all frontier models?
  • Context windows — 1M tokens is here. 10M is coming. What changes when you can give a model an entire library?
  • Open frontier — Llama 4 closed the gap significantly. Will an open model fully match the frontier in 2025?
  • Training cost collapse — DeepSeek showed you can train frontier-competitive models for far less than previously claimed. Is the moat evaporating?
  • Agentic workflows — Models aren’t just chat anymore; they’re acting in the world. Tool use, multi-step planning, code execution.

Go Deeper

Sources

enes