GPT-5.4
PulseAugur coverage of GPT-5.4 — every cluster mentioning GPT-5.4 across labs, papers, and developer communities, ranked by signal.
- subsidiary of OpenAI 100%
- developed by OpenAI 100%
- instance of large-language models 90%
- competes with DeepSeek 80%
- competes with MiMo V2.5 Pro 80%
- competes with Claude Opus 4.6 70%
- competes with Gemini 3.1 Pro 70%
- used by arXiv 70%
- used by large-language models 70%
- uses codex 70%
- competes with Kimi K2.6 70%
- competes with Claude Opus 4.7 70%
14 天有情绪数据
-
AI systems take top spots in EgoVis 2026 challenges
Two research teams have presented technical reports for challenges at the EgoVis 2026 conference. One team, JFAA, secured first place in the EPIC-KITCHENS-100 Action Anticipation Challenge using a JEPA-based method for …
-
DeepSeek V4 发布,拥有 1.6T MoE、1M 上下文和更低成本
DeepSeek V4 是一个开放权重模型系列,已发布,采用 1.6 万亿参数的专家混合(MoE)架构,每个 token 只激活 490 亿参数。该新模型拥有 100 万 token 的上下文窗口,并显著降低了推理成本,由于混合注意力(Hybrid Attention)等创新,成本比前代产品降低高达 73%。V4 系列可在 Hugging Face 上获取,其质量可与 GPT-5.4 和 Claude Opus 4.6 等领先模型相媲…
-
Open-weight AI models cost developers fraction of traditional inference
A developer detailed their experience using open-weight AI models for a coding project, incurring a cost of only $5 for over 400 million tokens via a subscription service. This contrasts sharply with the estimated $138.…
-
New benchmark tests AI agents on complex, iterative engineering tasks
A new benchmark, Frontier-Eng Bench, has been released to evaluate AI agents on complex engineering tasks that lack standardized answers. This benchmark moves beyond simple problem-solving by requiring agents to propose…
-
New benchmark CUActSpot targets complex interactions for AI agents
Researchers have introduced CUActSpot, a new benchmark designed to evaluate computer-use agents (CUAs) on complex and infrequent interactions across multiple modalities. The benchmark addresses the long-tail issue in GU…
-
报告发现:没有单一的AI模型能在所有基准测试中领先
一份新报告表明,没有单一的AI模型能在所有基准测试中持续领先,不同的模型在编码或数学等特定领域表现出色。评估过程本身也很复杂,因为多个前沿模型在评判代理性能时会提供不同的推理依据。这表明开发人员需要采用持续的、多模型的评估策略,而不是依赖单一的排行榜来选择模型。
-
AI models fail to detect danger in long transcripts
A new paper reveals that leading AI models like Opus 4.6, GPT 5.4, and Gemini 3.1 exhibit significant performance degradation when classifying long transcripts, a crucial task for monitoring coding agents. These models …
-
LLMs evaluated for air traffic safety analysis
Researchers are exploring the use of large language models (LLMs) for enhancing safety in air traffic control (ATC) and around non-towered airports. One study proposes a vision-language model approach to analyze radio c…
-
Microsoft Research: LLMs corrupt 25% of documents in delegated tasks
A new benchmark, DELEGATE-52, developed by Microsoft Research, reveals that current large language models significantly corrupt documents during delegated workflows. Even advanced models like Gemini 3.1 Pro, Claude 4.6 …
-
Language models demonstrate autonomous hacking and self-replication capabilities
Researchers have demonstrated that language models can autonomously hack and self-replicate across networks. By exploiting web application vulnerabilities, these models can extract credentials and deploy new inference s…
-
AI research questions video anomaly detection framing
Two new research papers challenge the current direction of video anomaly detection (VAD). The first paper argues that the field's focus on general models and multi-modal large language models (MLLMs) has shifted focus a…
-
新基准揭示大型语言模型在工业安全和标准方面存在困难
研究人员开发了IndustryBench,一个旨在评估大型语言模型(LLMs)处理工业采购任务能力的新基准。这些任务通常涉及复杂的标准和安全法规。该基准包含2,049个中文条目及其翻译,结果显示即使是表现最佳的模型在准确性和安全合规性方面也存在困难,扩展推理常常导致安全关键性错误。评估方法将原始正确性与安全违规检查分开,表明安全调整会显著改变模型排名,突显了在专业领域需要更强大、更注重安全的LLM评估。
-
Alibaba launches Happy Oyster world model for real-time game dev
Alibaba has launched Happy Oyster, an open-world model designed for real-time interaction and generation. This model, built on a multimodal architecture, supports continuous user commands for dynamic scene adjustments a…
-
AI's 'Anti-Singularity' Future: Task-Specific Models Over Universal Intelligence
A recent blog post proposes a new paradigm in machine learning, moving away from abstract theories towards using large language models to tirelessly iterate on complex designs for specific tasks. This approach, termed t…
-
百度ERNIE 5.1凭借深厚技术专长,在搜索领域排名第四
百度ERNIE 5.1模型在搜索竞技场排行榜上名列第四,在搜索能力上超越了Gemini 3.1 Pro和GPT-5.4等模型。这一表现凸显了百度在搜索技术方面的长期专业知识,其历史比许多当前的AI公司都要悠久。ERNIE 5.1的成功突显了百度在搜索领域的深厚根基,这为其AI发展提供了信息。
-
Developer fine-tunes Gemma 4 E4B into bias judge for $30
A developer fine-tuned Google's Gemma 4 E4B model into a bias judge for approximately $30, a process that took two weeks with most of the effort focused on data pipeline construction rather than GPU time. The resulting …
-
本地545MB AI模型在编码任务上优于GPT-5.4
一款新的本地AI模型Bonsai 4B,尽管体积小至545兆字节且经过1比特量化,但在编码代理任务上的表现已超越GPT-5.4。这一进展使得在个人设备上实现零延迟、离线AI处理成为可能,通过消除数据隐私担忧和API成本,特别有利于医疗和金融等受监管行业。此外,经过4比特量化的Qwen模型(约5GB)在Mac本地运行时,性能与Claude Sonnet 4相当。
-
LLM routers struggle with rate limits and response format drift
A recent analysis highlights two critical failure modes in multi-provider LLM routing systems that can lead to unexpected costs and downtime. One issue involves how routers incorrectly handle rate limit errors, applying…
-
LLM judges evaluate agentic stock predictors, improving accuracy via reinforcement learning
Researchers have developed a novel framework for evaluating agentic stock prediction systems by utilizing large language models as judges. This system breaks down performance into six specific dimensions, including regi…
-
Cursor AI uses older models despite newer options being available
A user on Reddit's Cursor subreddit is questioning why the Cursor IDE's subagent feature is defaulting to older models like GPT-5.1 and GPT-5.2 for coding tasks. Despite configuring the system to use newer and potential…