GPT-5.4

实体 GPT-5.4

GPT-5.4

PulseAugur coverage of GPT-5.4 — every cluster mentioning GPT-5.4 across labs, papers, and developer communities, ranked by signal.

Show in brief

总计 · 30天

71

90 天内 71

发布 · 30天

1

90 天内 1

论文 · 30天

43

90 天内 43

层级分布 · 90 天

frontier release 2
significant 5
research 23
tool 37
commentary 4

关系

情绪 · 30 天

14 天有情绪数据

最近 · 第 3/4 页 · 共 71 条

COMMENTARY · CL_37155 · May 7 · 18:27

AI developers face rate limits, latency; routing is key

Developers are encountering significant challenges with API rate limits and latency when using AI models, particularly from Anthropic. These issues often stem from architectural choices that rely on a single provider fo…
RESEARCH · CL_22056 · May 7 · 13:59

New method corrects Simpson's Paradox to improve AI text detection

Researchers have identified a significant issue in detecting machine-generated text, stemming from a phenomenon akin to Simpson's Paradox. Current methods average token scores, which masks a non-uniform signal across th…
TOOL · CL_20502 · May 7 · 04:00

Adversarial examples trick VLMs into laundering AI authority, spreading misinformation

Researchers have demonstrated a new vulnerability in vision-language models (VLMs) called "AI authority laundering." This attack involves subtly altering images so that VLMs confidently provide authoritative responses a…
TOOL · CL_20391 · May 7 · 04:00

AsymmetryZero framework operationalizes human preferences for AI evaluation

Researchers have introduced AsymmetryZero, a framework designed to translate human expert preferences into measurable semantic evaluations for AI models. This system aims to address the difficulty of encoding subjective…
SIGNIFICANT · CL_19920 · May 6 · 19:39

Z.AI's GLM 5.1 model leads in long-horizon agentic tasks, outperforming rivals

Z.AI has released its GLM 5.1 model, an open-source option designed for long-horizon agentic tasks capable of running autonomously for up to 8 hours. This model reportedly outperforms GPT-5.4, Claude Opus 4.6, and Gemin…
RESEARCH · CL_20622 · May 6 · 17:42

New MRI-Eval benchmark reveals LLMs struggle with GE scanner operations

Researchers have developed MRI-Eval, a new benchmark designed to assess large language models' understanding of MRI physics and GE scanner operations. The benchmark, comprising 1365 questions across three difficulty tie…
TOOL · CL_15946 · May 5 · 04:00

New dataset and benchmark advance Bangla text-to-gloss translation for BdSL

Researchers have developed the first dataset and benchmark for Bangla text-to-gloss translation, addressing a significant gap for the Bangla Sign Language (BdSL) community. The dataset includes manually annotated and sy…
TOOL · CL_13262 · May 2 · 19:49

Fabrica launches as a terminal-based coding agent supporting multiple AI models

Fabrica is a new terminal-based coding agent harness developed in Rust. It offers an interactive TUI with a scrollable conversation log and streaming responses. The tool supports multiple AI providers, including Google …
RESEARCH · CL_12039 · May 1 · 09:29

Google DeepMind's AI Co-Clinician beats GPT-5.4 in medical tests, aids doctors

Google DeepMind has developed an AI co-clinician designed to assist physicians with diagnostics and patient care, aiming to reduce errors and improve efficiency. In blind evaluations, this AI demonstrated superior perfo…
RESEARCH · CL_11817 · May 1 · 04:00

GPT-5.4 leads LLMs in new EU digital battery passport conformance task

Researchers have introduced BatteryPass-12K, the first dataset designed for classifying digital battery passport conformance, in anticipation of the EU's upcoming battery regulation. They evaluated 22 language models, f…
RESEARCH · CL_11687 · May 1 · 04:00

AI agent swarms may fail due to 'Inverse-Wisdom Law,' study finds

A new paper introduces the Inverse-Wisdom Law, challenging the assumption that AI agent swarms benefit from the "Wisdom of the Crowd." The research demonstrates that these swarms can prioritize internal architectural ag…
RESEARCH · CL_11488 · Apr 30 · 15:01

New VeriGround model achieves reliable circuit-to-Verilog code generation

Researchers have identified a significant reliability issue in multimodal large language models (MLLMs) when generating hardware description language (HDL) code from circuit diagrams. This "Mirage" phenomenon occurs whe…
TOOL · CL_09121 · Apr 29 · 13:47

Lingo.dev 发布 v1.0，配备 AI 驱动的本地化引擎

Lingo.dev 推出了其本地化平台的 1.0 版本，引入了检索增强本地化 (RAL)。这种方法将术语表上下文和品牌声音规则注入 LLM 请求中，以提高翻译准确性并防止术语漂移。该平台支持各种 LLM 提供商，并通过 CLI、CI/CD 和 API 提供集成，并提供详细的日志记录以进行质量保证。
SIGNIFICANT · CL_08510 · Apr 29 · 04:10

AWS launches Amazon Quick, integrates OpenAI models into Bedrock

Amazon Web Services has launched Amazon Quick, an AI agent designed to integrate with local files, emails, and applications to streamline workflows. The company also announced a deeper partnership with OpenAI, bringing …
FRONTIER RELEASE · CL_08402 · Apr 29 · 00:52

Xiaomi open-sources MiMo-V2.5 AI models, showcasing macOS simulation and high token efficiency

Xiaomi has officially open-sourced its MiMo-V2.5 series of AI models, including the flagship MiMo-V2.5 Pro agent model. These models demonstrate strong performance, rivaling top closed-source models like Claude Opus 4.6…
FRONTIER RELEASE · CL_07657 · Apr 28 · 12:16

小米的MiMo-v2.5-Pro开源模型可与顶级AI编码助手相媲美

小米发布了MiMo-v2.5-Pro，这是一款专注于编码的开源语言模型，在复杂任务中展现出令人印象深刻的能力。该模型在数小时内成功完成了一个大学级别的编译器项目，根据模糊的提示构建了一个功能齐全的视频编辑器应用程序，并解决了模拟电路设计问题。MiMo-v2.5-Pro在编码基准测试中表现强劲，可与GPT-5.4和Claude Opus 4.6等顶级闭源模型相媲美，现已在HuggingFace上发布。
RESEARCH · CL_06722 · Apr 28 · 04:00

Frontier LLMs like GPT-5.4 and Claude Opus 4.7 show significant verbal tics

A new paper analyzes the prevalence of verbal tics, such as repetitive phrases and sycophantic openers, in eight leading large language models. Researchers developed a Verbal Tic Index (VTI) to quantify these tics, find…
RESEARCH · CL_08361 · Apr 27 · 23:48

Claude Opus 4.7 leads frontier agents in AI research acceleration benchmark

A new research paper proposes a benchmark to assess AI's ability to autonomously implement machine learning pipelines, aiming to detect early signs of recursive self-improvement. Frontier coding agents were tasked with …
RESEARCH · CL_04389 · Apr 26 · 20:01

GPT-5.4 and Claude Opus 4.6 fail banking benchmark, scoring 0% client-ready outputs

A new benchmark called BankerToolBench has revealed significant shortcomings in current large language models when applied to financial tasks. GPT-5.4, Claude Opus 4.6, and other models were tested on simulated junior i…
FRONTIER RELEASE · CL_03105 · Apr 25 · 05:00

DeepSeek releases V4 Pro and Flash models with 1M context, runs on Huawei chips

DeepSeek has released its new V4 family of models, including V4 Pro and V4 Flash, which boast a 1 million token context window. These models were trained on 32 trillion tokens and feature a novel hybrid attention system…