GPT-5.4
PulseAugur coverage of GPT-5.4 — every cluster mentioning GPT-5.4 across labs, papers, and developer communities, ranked by signal.
- developed by OpenAI 100%
- subsidiary of OpenAI 100%
- instance of large-language models 90%
- used by codex 90%
- developed by Microsoft Research 90%
- competes with DeepSeek 80%
- competes with Claude Opus 4.6 70%
- competes with Gemini 3.1 Pro 70%
- authored by arXiv 70%
- competes with Claude Sonnet 4.6 70%
- used by arXiv 70%
- competes with Claude Opus 4.7 70%
- 2026-05-26 research_milestone An evaluation found GPT-5.4 to be the only model that consistently improved code efficiency when prompted. source
26 day(s) with sentiment data
-
Z.ai's GLM-5.1 tops coding benchmark as open-weight model
Z.ai has released GLM-5.1, a 744B parameter Mixture-of-Experts model that achieved a score of 58.4% on the SWE-Bench Pro leaderboard in April 2026. This marks the first open-weight model to surpass leading proprietary m…
-
AI agents use executable world models to solve ARC-AGI-3 benchmark
A new research paper introduces an executable world model approach for AI agents tackling the ARC-AGI-3 benchmark. This system uses Python to maintain and verify a world model, refactoring it for simplicity and planning…
-
AI generates Traditional Chinese IEPs, outperforming GPT-5.4
Researchers have developed a novel method for automatically generating Individualized Education Programs (IEPs) in Traditional Chinese, addressing a significant gap in special-education NLP. The proposed Corpus-Grounded…
-
OpenAI releases Python SDK for Codex agent
OpenAI has released an official Python SDK for its Codex agent, simplifying its integration into Python-based applications. Previously, developers had to rely on shell commands or a TypeScript SDK, which was inconvenien…
-
AI agents wreck finance workflows via shared context, not model limits
An analysis of financial automation workflows highlights that using a single, always-on AI agent across personal, rental, and business accounts leads to dangerous "confident nonsense." The core issue is not the AI model…
-
AI cost tracking shifts to per-request attribution for better financial oversight
Developers are increasingly focused on tracking the precise cost of AI model usage, moving beyond simple monthly invoices to per-request attribution. This granular approach allows teams to understand which specific feat…
-
New GCF format outperforms JSON and TOON in LLM data handling benchmark
A new benchmark reveals that common data formats like JSON and TOON struggle with large language models, failing to maintain accuracy and validity at scale. The study found that JSON breaks down with as few as 500 recor…
-
Anthropic ships Claude Opus 4.8, accelerating AI agent migration needs
Anthropic has released Claude Opus 4.8, continuing a rapid release cycle with new versions appearing every 5-7 weeks. This accelerated pace means that production agents relying on fixed model versions will require frequ…
-
Promptra offers Russian businesses access to GPT-5.4, GLM 5.1, and DeepSeek V4 Pro APIs
Promptra is offering API access to several advanced LLMs, including OpenAI's GPT-5.4, Z.ai's GLM 5.1, and DeepSeek V4 Pro, with payment in Russian rubles and full documentation for businesses. GPT-5.4 is positioned as a…
-
Promptra enables Russian developers to access Anthropic's Claude Sonnet 4.6
A Russian company, Promptra, is offering access to Anthropic's Claude Sonnet 4.6 model, enabling developers in Russia to use the AI with local currency payments and necessary documentation. This solution addresses commo…
-
SoftBank integrates AGENTIC STAR; Amazon Bedrock adds OpenAI GPT-5.5
SoftBank is integrating AGENTIC STAR with Box's MCP server to enhance AI capabilities. Separately, Amazon Bedrock has begun offering OpenAI's GPT-5.5 and GPT-5.4 models, along with Codex, to users.
-
Estonia benchmark: Claude Opus 4.7 best resists Russian propaganda
Estonia's Language Institute has released a new benchmark called "Propaganda Resistance" to evaluate how well large language models can withstand Russian state-sponsored disinformation. The benchmark tested 14 types of …
-
OpenAI models on AWS signal shift in AI distribution strategy
OpenAI's advanced models, including GPT-5.5 and GPT-5.4, are now accessible via AWS Bedrock, marking a significant shift in distribution strategy. This move allows enterprises to integrate these models through their exi…
-
Claude Opus 4.7 leads AI debates, influencing other models
Claude Opus 4.7 has demonstrated the highest influence in AI debates, successfully persuading other models to change their stance nearly 3,000 times. This finding comes from an analysis of 30,000 AI Roundtable sessions,…
-
New benchmark measures LLM manipulative behavior in dialogues
Researchers have developed CogManip, a new benchmark designed to evaluate the manipulative behaviors of large language models in multi-turn conversations. The benchmark assesses 15 distinct manipulation strategies acros…
-
Hugging Face expands voice agent benchmark to 3 domains, 121 tools
Hugging Face has released EVA-Bench Data 2.0, an expanded benchmark for evaluating voice agents. This new version broadens its scope to three enterprise domains: Airline Customer Service Management, Enterprise IT Servic…
-
Ideogram 4.0 leads open image model releases; Microsoft details MAI-Thinking-1
Ideogram has released version 4.0 of its open-source image generation model, which is now considered the best available in its category. This release, alongside Reve's advancements, highlights significant progress in AI…
-
New KINA benchmark ranks Gemini 3.1 Pro highest, surpassing Claude and GPT-5
A new benchmark called KINA has been introduced to evaluate large language models across 261 fine-grained disciplines, addressing issues of scaling-driven design and annotation quality. The benchmark, comprising 899 ite…
-
GPT-5.4 over-edits code, costing 6.5x more than Claude Opus
A new analysis reveals that GPT-5.4 exhibits a significant over-editing tendency, producing outputs that are functionally correct but structurally diverge from the original code far more than necessary. This behavior re…
-
New DeskCraft benchmark tests AI agents on complex professional tasks
Researchers have introduced DeskCraft, a new benchmark designed to evaluate desktop agents on complex, long-horizon professional tasks and human-in-the-loop collaboration. This benchmark includes tasks in creative and e…