FRONTIER MODELS · 2026

The state of frontier AI models, 2026

A landscape survey of the labs shipping at the frontier — what they released, what's interesting about it, and where the field is going next.

By Chris Valentine · Updated May 2026

GPT-5 family

GPT-5 launched in August 2025. The shape of the release was the most interesting part. Rather than ship one larger model named "GPT-5," OpenAI shipped a unified system that internally routes between a fast tier (gpt-5), a reasoning tier (gpt-5-thinking), and an extended-research tier (gpt-5-pro). The router decides which one gets your query based on prompt complexity, so most ChatGPT users see a single interface rather than a model picker. API users can pin to a specific tier for predictable latency and cost.

The capabilities split: the fast tier is roughly comparable to GPT-4o on most benchmarks but with measurably better instruction-following and lower hallucination rates. gpt-5-thinking exposes a configurable reasoning-token budget up to roughly 128K tokens, which is what makes it viable for math, code, and multi-step problems where the chain itself is the value. gpt-5-pro runs deep-research workflows — multi-tool, multi-source, hour-long sessions that look more like an agent than a chat.

Context window: 256K standard. Multimodal first-class — vision and image generation are flags on the same endpoint, not separate models. Voice mode shipped in Q4 2025 through the Realtime API.

The interesting bet underneath the launch: OpenAI moved off the "next bigger model" cadence. GPT-5 is not 10× GPT-4o; it's GPT-4o-class with infrastructure (routing, reasoning, tools, memory) that lifts the underlying model's effective performance. The next year of progress, on this thesis, comes from scaffolding rather than parameter count. PulseAugur tracks every OpenAI release at /entity/openai.

Claude 4.x family

Anthropic shipped Sonnet 4 and Opus 4 in mid-2025, then iterated rapidly: Sonnet 4.5 (September 2025), Opus 4.1, Opus 4.5, Haiku 4.5 (October 2025), and most recently Opus 4.7 (early 2026) with a 1M-token context window in beta. The family tree is still Haiku for cost-sensitive workloads, Sonnet for general production, Opus for the hardest problems — but the iteration cadence is unlike any prior frontier-lab cycle.

What's distinctive about the line: tool-use is first-class everywhere. The models are trained to chain tool calls without prompting tricks. Computer-use mode — a Claude agent that screenshots a virtual computer, generates clicks and keystrokes, and operates a real browser — is in production for agentic workflows. Model Context Protocol (MCP), Anthropic's open spec for tool servers, has become the de facto standard; Microsoft Copilot, Cursor, Zed, Replit, and dozens of IDE and agent platforms ship MCP integrations.

Context: 200K standard, with Sonnet 4.5 and Opus 4.7 offering 1M in beta. Opus 4.7 in particular has become the model of choice for long-context agentic tasks — codebase-spanning refactors, document-heavy research, multi-week conversations that hold state.

Pricing: Sonnet 4.5 sits at $3 input / $15 output per million tokens; Opus 4.7 is $15 / $75 baseline. Prompt caching gives a ~90% discount on cached input prefix; the batch API gives 50% off with a 24-hour SLA. Real-world cost per token compresses meaningfully against the sticker price.

Anthropic has positioned the family for production deployment more than viral consumer chat. Their growth has come through API + Claude.ai + integrations rather than a Super Bowl advertisement.

Gemini 3

Gemini 3 Pro launched in early 2026, replacing Gemini 2.5 Pro and 2.5 Flash. Google DeepMind's positioning: multimodal-native everything, and the longest context window on offer. Gemini 3 Pro accepts up to 2M tokens standard, with an experimental long- context tier reaching 10M for document-heavy workloads.

The capability split: Gemini 3 Pro is the flagship, Gemini 3 Flash trades latency for speed, Gemini 3 Flash-Lite handles high-throughput workloads, and Gemini 3 Pro Deep Think is the reasoning variant. AI Studio (Google's developer surface) and Vertex AI (enterprise) both ship Gemini 3 directly.

What's distinctive: deep integration with Google Workspace. Docs, Sheets, Slides, Gmail, and Meet all have Gemini 3 features built in for AI Premium and Workspace Business+ subscribers. Every Gmail user is one upsell away from being a Gemini user. The enterprise on-ramp is the moat.

Multimodal in: text, image, video (~2 hours), audio (~22 hours), code. Multimodal out: text + image generation (Imagen 4 integration) + voice via Gemini Live. The audio-in length is notable — transcribing a podcast or recorded meeting in one pass is a clean fit for the model.

The strategic context that matters: Google's AI Overviews (the LLM-summary block above traditional search results) runs on a Gemini variant. The integration story is what's driving Google's defensive AI strategy more than the standalone Gemini chat app — every search query becomes a Gemini query at the edge.

Llama 4

Meta released the Llama 4 family in 2025: Llama 4 Scout (small, edge-deployable), Llama 4 Maverick (mid-tier MoE), and Llama 4 Behemoth (largest, delayed for safety review then shipped early 2026). Open weights with the Llama Community License — usage permissive for most cases, with a carve-out restricting services with more than 700M monthly active users.

What's distinctive: Mixture-of-Experts at scale. Maverick uses 17B active parameters from a 400B parameter pool, hitting Claude 3.5 Sonnet-class performance at meaningfully lower inference cost. Behemoth is the larger sibling — roughly 2T parameters total with 288B active. The MoE architecture is the bet that you can scale parameter count without proportionally scaling inference cost; the early production data has supported that.

Context: Scout supports 10M tokens (the largest open-weight context window at release); Maverick supports 1M. Multimodal in: text, image, audio. Multimodal out: text only.

Distribution: HuggingFace, llama.com, plus cloud-provider hosted endpoints — AWS Bedrock, Azure AI Foundry, Groq, Together, Replicate. The open-weights distribution is what makes Llama 4 the foundation for most downstream fine-tunes; the ecosystem is denser than any other lab's.

The strategic pattern: weights as loss leader. Llama is infrastructure for Meta AI (the consumer chatbot in WhatsApp, Instagram, and Facebook) and Meta AI Studio (the agent platform). Giving away the weights expands the surface area on which Meta sells inference and enterprise integration.

Mistral Large 3

Mistral AI's flagship dense model. Mistral Large 3 (also branded Mistral Medium 3 in some configurations — the naming has been inconsistent) shipped in early 2026, replacing Mistral Large 2. The French lab has held to a hybrid open/closed-weights strategy: smaller models (Mistral 7B, Codestral, Mistral Nemo) ship open-weight; the largest closed-weight, API-only.

Capabilities: ~123B parameters, 128K context, strong on European languages — French, German, Spanish, Italian, Portuguese, Dutch. Function calling is competitive with GPT-4o. Pricing is deliberately positioned to undercut OpenAI and Anthropic's flagships at $2 input / $6 output per million tokens.

Le Chat, Mistral's consumer surface, expanded in 2025 with web search, code interpreter, and image generation. More importantly, Mistral has signed sovereign-AI partnerships with several European governments — France, Germany, the Netherlands — positioning itself as the "European AI champion" with regulatory blessing the US labs lack.

What matters strategically: Mistral is the only frontier-tier lab that's not American or Chinese. For European enterprise sales the geopolitical positioning may matter more than raw model quality. EU companies that can't take a dependency on US infrastructure for compliance reasons have one realistic option, and Mistral is it.

DeepSeek V4

DeepSeek's V3 shipped in December 2024 and reset expectations for what open-weights could do at frontier scale. V3 hit GPT-4o-class performance at a fraction of the training cost (claimed $5.6M), released open-weight, and triggered a global rerating of US AI lab valuations within a few weeks. V3.1 followed in mid-2025; V4 shipped in Q1 2026.

Architecture: 671B total parameters, ~37B active, MLA (Multi-head Latent Attention) for memory efficiency. The DeepSeek-R1 reasoning variant — also open-weight — is what made the original splash. R1's January 2025 release of a Claude-3-Sonnet-class reasoning model at zero cost is what triggered the famous one-day NVIDIA stock drop.

What's interesting about DeepSeek beyond the model itself: research velocity. They publish frequently, ship open weights with permissive licenses, and reliably surprise the field. What looks like a single quarterly release is usually paired with an arXiv paper that becomes the citation root for the next year of academic work. PulseAugur's arXiv ingest captures these papers within minutes; cluster pages tie the paper to the broader release coverage.

Qwen 3

Alibaba's Qwen series. Qwen 3 launched in 2025 as a family ranging from Qwen 3 0.6B (edge) up through Qwen 3 235B (flagship), with intermediate sizes at 1.7B, 4B, 8B, 14B, 32B, and 72B parameters. The 32B and 72B variants are the most- downloaded open-weight LLMs on HuggingFace as of early 2026.

Distinctive: best-in-class multilingual coverage — 119 languages with strong performance, particularly Chinese, Japanese, Korean, and the South-Asian language families that other labs underweight. The Qwen-VL multimodal variants are competitive with the closed-weights state of the art on document understanding benchmarks.

License: Apache 2.0 for most sizes — the truly permissive end of the open-weights spectrum — with Qwen-specific terms on the largest models. More permissive than Llama. That license posture is what's driven Qwen's adoption in academic deployments and emerging-market production where Llama's 700M-MAU restriction creates compliance overhead.

Strategic pattern: open weights for ecosystem reach, Alibaba Cloud monetizes the inference. Same shape as Meta's Llama play, but with stronger Chinese-language anchoring. For multilingual deployments outside Western markets Qwen is frequently the practical choice.

Open-weights ecosystem

Beyond the labs above, the open-weights ecosystem has matured into a real second tier. Notable additional lines:

Phi-4 (Microsoft Research) — 14B distilled model that punches well above its weight on reasoning benchmarks. Permissive license; production-ready for cost-sensitive workloads.
OLMo 2 / OLMo 3 (Allen Institute for AI) — fully open: weights, training data, training code, eval suite. The reproducibility benchmark for the field.
Gemma 3 (Google) — open-weights cousin to Gemini, 1B / 4B / 12B / 27B sizes, permissive license.
Llama-derived fine-tunes — WizardLM, Nous Hermes, Dolphin, Hermes — community-fine-tuned variants for specific instruction styles or domains.
Qwen-derived fine-tunes — a smaller but growing community of specialty Qwen tunes for non-English and multilingual workloads.

What this ecosystem means for production: there is now a tier of "good enough" open-weights models for most workloads where the closed-frontier 90th-percentile capability isn't required. Self-hosted cost per inferred token at the open-weights tier is 5–10× cheaper than the closed-frontier; cloud providers (Together, Groq, Cerebras, Fireworks, Replicate) compete on hosted-inference economics for these models.

The open-weights ecosystem has also become the substrate of the GPU-inference economy. Cerebras and Groq's inference-optimized chips deliver Llama 4 / Qwen 3 / DeepSeek V4 at 2,000+ tokens per second, several times faster than the closed-frontier models on commodity hardware. That latency advantage compounds across agentic workloads where each step blocks the next.

Comparative summary

A working comparison for production deployment decisions, as of May 2026. Choose your model by the workload, not the leaderboard ranking — a specialty model tuned for your task will beat a generalist that's two ranks higher on lmarena.ai every time.

Model	Context	Best for
GPT-5 (auto-router)	256K	Generalist API + ChatGPT consumer
GPT-5 thinking	256K	Math, code, multi-step reasoning
Claude Sonnet 4.5	1M	Production agents, long-running workflows, MCP
Claude Opus 4.7	1M	Hardest problems, codebase synthesis, long context
Gemini 3 Pro	2M	Multimodal, audio/video, Workspace integration
Llama 4 Maverick	1M	Self-hosted, fine-tuning, low cost-per-token
Llama 4 Scout	10M	Edge deployment, document-heavy, on-device
DeepSeek V4	128K	Self-hosted frontier, research, reasoning
Qwen 3 235B	128K	Multilingual deployments, Apache license required
Mistral Large 3	128K	EU regulatory contexts, European-language quality

The asymmetry that matters in 2026: open-weights at the Llama-Scout / Qwen / DeepSeek tier is "good enough" for ~70% of workloads where the closed-frontier was previously assumed. Closed-frontier still wins on the hardest reasoning problems, longest agentic chains, and on multimodal tasks involving video. Pick accordingly.

How PulseAugur tracks model releases

Every model release on this list became a cluster on PulseAugur. Each cluster page consolidates the lab's announcement, the technical report, the third-party benchmark threads, the Hacker News reaction, the Reddit discussion, and the developer reactions on Bluesky and Mastodon — ranked by signal across our 200+ source set.

Live feeds: /topic/model-release for the running list of recent releases. /entity/openai, /entity/anthropic, /entity/google-deepmind, /entity/meta-ai, /entity/mistral, and /entity/deepseek for per-lab coverage. New releases appear within minutes of the vendor blog post or arXiv paper; cluster scores update hourly as citations and replication signals arrive.