GPT-5.4
PulseAugur coverage of GPT-5.4 — every cluster mentioning GPT-5.4 across labs, papers, and developer communities, ranked by signal.
- subsidiary of OpenAI 100%
- developed by OpenAI 100%
- instance of large-language models 90%
- used by codex 90%
- developed by Microsoft Research 90%
- competes with DeepSeek 80%
- competes with Claude Opus-4.6 70%
- competes with Gemini 3.1 Pro 70%
- competes with Claude Sonnet 4.6 70%
- authored by arXiv 70%
- used by arXiv 70%
- competes with Claude Opus 4.7 70%
- 2026-05-26 research_milestone An evaluation found GPT-5.4 to be the only model that consistently improved code efficiency when prompted. source
25 day(s) with sentiment data
-
Alibaba's Qwen 3.6 open-weight model rivals frontier AI on coding tasks
Alibaba's Qwen 3.6 model family, particularly the 27B dense variant, has demonstrated performance competitive with leading frontier models like GPT-5.4 and Claude 4.6 on coding tasks. This open-weight model, runnable on…
-
AI models fail to reliably forecast scientific progress, study finds
A new benchmark called CUSP has been developed to evaluate AI's ability to forecast scientific progress. The study found that current frontier AI models struggle with predicting the realization and timing of scientific …
-
Microsoft Security Copilot uses AI agent for autonomous threat detection
Microsoft has developed a Dynamic Threat Detection Agent (DTDA) integrated into its Security Copilot, designed to autonomously investigate security incidents and generate novel alerts. This agent utilizes a unified acti…
-
New attack method enhances adversarial transferability in MLLMs
Researchers have developed FRA-Attack, a novel method to improve the transferability of adversarial attacks against multimodal large language models (MLLMs). This technique utilizes frequency-domain regularization to al…
-
Developer finds Claude Code Extension optimal for AI-assisted coding
A software developer details their journey to find the optimal AI coding assistant, ultimately settling on VS Code with the Claude Code Extension and a MAX plan. They found that while tools like GitHub Copilot and Curso…
-
LLMs struggle to simulate real human behavior, new research shows
Two new research papers explore the limitations of current large language models in simulating realistic human behavior. The first paper, "OmniBehavior," introduces a benchmark using real-world data and finds that LLMs …
-
Databricks launches beta Unity AI Gateway Guardrails for AI security
Databricks has launched a beta version of its Unity AI Gateway Guardrails, designed to enhance the security and compliance of AI applications. These guardrails help prevent sensitive data leakage, protect against malici…
-
LLMs generate gendered behaviors, impacting trust calibration in agents
Researchers have developed a method to generate multimodal behaviors for socially interactive agents, aiming to calibrate user trust based on an agent's capabilities and benevolence. The study utilized GPT-5.4 to produc…
-
Alibaba Qwen 3.7 previews top Chinese models in text and vision benchmarks
Alibaba's Qwen team has released preview versions of its Qwen 3.7 Max and Qwen 3.7 Plus models, showcasing rapid iteration cycles. The Qwen 3.7 Max model has achieved top rankings among Chinese models in text-based benc…
-
AI agents struggle with research rigor despite generating papers
A new study published on arXiv introduces ResearchArena, a framework designed to evaluate the capabilities of AI agents in conducting research autonomously. The system allowed agents like Claude Code, Codex, and Kimi Co…
-
Cursor launches Composer 2.5 AI coding assistant with enhanced intelligence
Cursor has released Composer 2.5, an updated AI coding assistant that offers improved intelligence and reliability for long-running tasks. This new version is built upon Moonshot AI's Kimi K2.5 architecture and incorpor…
-
AI systems take top spots in EgoVis 2026 challenges
Two research teams have presented technical reports for challenges at the EgoVis 2026 conference. One team, JFAA, secured first place in the EPIC-KITCHENS-100 Action Anticipation Challenge using a JEPA-based method for …
-
DeepSeek V4 launches with 1.6T MoE, 1M context, and lower costs
DeepSeek V4, an open-weight model family, has been released with a 1.6-trillion-parameter Mixture-of-Experts architecture that activates only 49 billion parameters per token. This new model boasts a 1-million-token cont…
-
Open-weight AI models cost developers fraction of traditional inference
A developer detailed their experience using open-weight AI models for a coding project, incurring a cost of only $5 for over 400 million tokens via a subscription service. This contrasts sharply with the estimated $138.…
-
New benchmark tests AI agents on complex, iterative engineering tasks
A new benchmark, Frontier-Eng Bench, has been released to evaluate AI agents on complex engineering tasks that lack standardized answers. This benchmark moves beyond simple problem-solving by requiring agents to propose…
-
New benchmark CUActSpot targets complex interactions for AI agents
Researchers have introduced CUActSpot, a new benchmark designed to evaluate computer-use agents (CUAs) on complex and infrequent interactions across multiple modalities. The benchmark addresses the long-tail issue in GU…
-
No single AI model leads all benchmarks, report finds
A new report indicates that no single AI model consistently leads across all benchmarks, with different models excelling in specific areas like coding or math. The evaluation process itself is also complex, as multiple …
-
AI models fail to detect danger in long transcripts
A new paper reveals that leading AI models like Opus 4.6, GPT 5.4, and Gemini 3.1 exhibit significant performance degradation when classifying long transcripts, a crucial task for monitoring coding agents. These models …
-
LLMs evaluated for air traffic safety analysis
Researchers are exploring the use of large language models (LLMs) for enhancing safety in air traffic control (ATC) and around non-towered airports. One study proposes a vision-language model approach to analyze radio c…
-
Microsoft Research: LLMs corrupt 25% of documents in delegated tasks
A new benchmark, DELEGATE-52, developed by Microsoft Research, reveals that current large language models significantly corrupt documents during delegated workflows. Even advanced models like Gemini 3.1 Pro, Claude 4.6 …