PulseAugur / Brief
EN
LIVE 05:13:56

Brief

last 24h
[25/25] 221 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

  1. GPT-5.5 tops the benchmarks but sits at #22 for actual usage - I built a live index that tracks both (open source)

    A new open-source index called AgentTape ranks AI models based on a blend of benchmark performance, actual usage, cost, and speed. Currently, OpenAI's GPT-5 models dominate the top rankings, with GPT-5.5 specifically excelling in quality benchmarks but lagging in adoption due to its newness and price. The index aims to provide a more holistic view of model performance beyond theoretical benchmarks, reflecting real-world utility. AI

    GPT-5.5 tops the benchmarks but sits at #22 for actual usage - I built a live index that tracks both (open source)

    IMPACT Provides a new metric for evaluating AI models that balances benchmarks with real-world adoption and cost.

  2. [AINews] OpenAI GPT-next disproves 80 year old Erdős planar unit distance problem for under $1000

    OpenAI has announced that an internal model, speculated to be a version of GPT-5, has disproven an 80-year-old mathematical conjecture known as the Erdős planar unit distance problem. This general-purpose reasoning model achieved the result for under $1000, a feat that mathematicians are hailing as a significant milestone for AI in scientific discovery. The model's extensive output suggests that advanced reasoning capabilities are emerging in LLMs, potentially extending beyond mathematics to other scientific fields. AI

    [AINews] OpenAI GPT-next disproves 80 year old Erdős planar unit distance problem for under $1000

    IMPACT Demonstrates advanced reasoning capabilities in LLMs, potentially accelerating scientific discovery across various fields.

  3. Feifei Li strikes again, ImageNet for spatial intelligence is here

    A new benchmark called ESI-Bench has been released by Fei-Fei Li's team to evaluate embodied spatial intelligence in AI. Unlike previous benchmarks that assumed optimal observation, ESI-Bench requires AI agents to actively take actions to gather information, closing the perception-action loop. Initial tests with leading models like GPT-5 and Gemini revealed that current AI struggles with active exploration and decision-making, exhibiting "action blindness" and metacognitive deficits, indicating that the primary challenge lies in strategic action rather than pure perception. AI

    IMPACT Sets a new standard for embodied AI evaluation, highlighting action and metacognition as key challenges.

  4. Claude Sonnet 4.6 vs GPT-4.1 vs Gemini 2.5 Flash: which wins JSON extraction?

    A recent benchmark evaluated six large language models on their ability to extract structured data, specifically JSON, from customer support emails. The analysis found that Anthropic's Claude Haiku 4.5 offered the best value, achieving high accuracy at a significantly lower cost compared to more powerful models. While Gemini 2.5 Flash was fast and inexpensive, it struggled with accuracy, particularly in hallucinating data. The study suggests using Haiku for most extraction tasks, Sonnet for more complex reasoning, and avoiding more expensive frontier models for simple data extraction. AI

    Claude Sonnet 4.6 vs GPT-4.1 vs Gemini 2.5 Flash: which wins JSON extraction?

    IMPACT Identifies the most cost-effective LLM for structured data extraction, guiding developers on model selection for production features.

  5. The latest ARFBench benchmark proves that in diagnosing system failures, engineers still crush GPT-5.5 and Gemini. The reality of production systems is brutal

    A new benchmark called ARFBench reveals that human engineers still significantly outperform AI models like GPT-5 and Gemini in diagnosing system failures. The results challenge the marketing claims of AI's full autonomy in production environments, highlighting the current limitations of AI in complex troubleshooting tasks. AI

    The latest ARFBench benchmark proves that in diagnosing system failures, engineers still crush GPT-5.5 and Gemini. The reality of production systems is brutal

    IMPACT Highlights current AI limitations in complex diagnostic tasks, suggesting human expertise remains critical for system failure analysis.

  6. Optuna Tutorial: Automate Hyperparameter Tuning for ML Models in Python How Optuna's define-by-run API, TPE sampler, and pruners automate hyperparameter tuning

    Several recent posts explore advancements and applications in AI agents, particularly for coding and reasoning tasks. Topics include building autonomous coding agents that can open GitHub pull requests, using patterns like Continual Harness for self-improving agents, and integrating tools like Cursor into agent workflows. The limitations of LLM reasoning in causal inference and new approaches to browser fingerprinting for web scraping are also discussed, alongside efforts to automate hyperparameter tuning for machine learning models. AI

    Optuna Tutorial: Automate Hyperparameter Tuning for ML Models in Python How Optuna's define-by-run API, TPE sampler, and pruners automate hyperparameter tuning

    IMPACT Explores practical applications and limitations of AI agents in coding, reasoning, and web scraping, offering insights for developers.

  7. DeepSeek V4 on Huawei's Ascend 950: A Real Stress Test for China's AI Chip Ecosystem

    DeepSeek's V4 model has successfully validated inference on Huawei's Ascend 950 chip, marking a significant step for China's domestic AI hardware. This validation required substantial engineering effort, including rewriting numerous CUDA operators and extensive testing, to achieve performance parity with NVIDIA's offerings for inference workloads within China. The Ascend 950 features a unique dual-architecture design with high-bandwidth memory to address both compute-bound and memory-bound phases of LLM operations, though its widespread adoption is hindered by manufacturing capacity limitations. AI

    DeepSeek V4 on Huawei's Ascend 950: A Real Stress Test for China's AI Chip Ecosystem

    IMPACT Validates domestic AI hardware for inference, potentially reducing reliance on foreign suppliers within China.

  8. Ricoh develops a high-performance Japanese large language model equivalent to GPT-5 with enhanced inference performance | Ricoh Co., Ltd. https://www.yayafa.com/2804982/ # AgenticAi # AI # ArtificialGeneralIntelligence # ArtificialIntelligence #

    Ricoh has developed a new Japanese large language model that matches GPT-5's performance, particularly in reasoning capabilities. This advanced model is designed to enhance AI applications and services. Separately, Needswell has introduced a new introductory training program for Microsoft 365 Copilot. AI

    Ricoh develops a high-performance Japanese large language model equivalent to GPT-5 with enhanced inference performance | Ricoh Co., Ltd. https://www.yayafa.com/2804982/ # AgenticAi # AI # ArtificialGeneralIntelligence # ArtificialIntelligence #

    IMPACT Ricoh's new Japanese LLM could advance AI capabilities in the region, while Needswell's training program aims to boost adoption of Microsoft's AI assistant.

  9. Reasoning Effort: Low, Medium, High: When Each Setting Actually Pays Off

    The `reasoning_effort` setting in LLMs like OpenAI's GPT-5 and Anthropic's models controls the amount of internal chain-of-thought processing before an answer is generated. While higher settings can improve performance on complex tasks like multi-step math or code generation with verification, they significantly increase costs, potentially by 6-8x compared to lower settings. This increased cost is often not apparent during initial testing if the evaluation set primarily consists of simpler prompts, leading to unexpected budget overruns in production. AI

    Reasoning Effort: Low, Medium, High: When Each Setting Actually Pays Off

    IMPACT Explains how LLM configuration choices directly impact operational costs and performance trade-offs for AI applications.

  10. The Comeback of M365 Copilot, Called 'Junk', with GPT-5 as a Turning Point: The Secret to Utilization is 'Escaping Prompt Craftsmen' - ITmedia AI+

    Microsoft has introduced four new AI-related certifications to address the growing demand for AI professionals. Separately, there are reports that Elon Musk's xAI may have failed to pay a $420 fee for tax data. Additionally, Microsoft's M365 Copilot, initially criticized, is reportedly seeing a turnaround, with GPT-5 cited as a potential catalyst for improved performance and a shift away from prompt engineering expertise. AI

    The Comeback of M365 Copilot, Called 'Junk', with GPT-5 as a Turning Point: The Secret to Utilization is 'Escaping Prompt Craftsmen' - ITmedia AI+

    IMPACT New certifications may help address AI talent shortages, while xAI's payment issue highlights operational challenges in the AI sector.

  11. Code-Driven Visual Perception: Why "Understanding Code" is the Real Key for Large Models to Conquer STEM Problems | CVPR 2026

    Researchers from Shanghai Jiao Tong University and the Qwen team have introduced CodePercept, a novel approach to enhance large language models' visual perception capabilities, particularly for STEM tasks. Their research suggests that improving visual perception, rather than just reasoning, is the key bottleneck for models tackling science and math problems. CodePercept leverages code as a precise language for visual understanding, enabling models to generate executable code that accurately represents image content, thereby overcoming the inherent ambiguity of natural language descriptions. AI

    Code-Driven Visual Perception: Why "Understanding Code" is the Real Key for Large Models to Conquer STEM Problems | CVPR 2026

    IMPACT This approach could significantly improve LLMs' ability to understand and solve complex STEM problems by enhancing their visual perception through precise code-based representations.

  12. Ranked AI models by what people actually use instead of benchmark scores - the benchmark champion barely makes the top 20

    A new ranking system based on actual user adoption and discussion, rather than solely benchmark scores, reveals a significant divergence in AI model popularity. GPT-5 emerges as the top-ranked model by usage, despite newer versions like GPT-5.5 and Gemini 3.1 Pro scoring higher on benchmarks. The data suggests that factors like cost, speed, and availability heavily influence user choices, often leading them to opt for less powerful but more accessible models like Google's Flash Lite over top-tier benchmark performers. AI

    IMPACT Highlights the disconnect between benchmark performance and real-world AI model adoption, emphasizing cost and speed as key user drivers.

  13. Teaching Language Models to Forecast Research Success Through Comparative Idea Evaluation

    Researchers have developed a method for language models to predict the success of scientific research ideas before experimentation. By training models on a dataset of comparative idea evaluations, they achieved significant accuracy in forecasting empirical outcomes. This approach, particularly when framed as a reasoning task using Reinforcement Learning with Verifiable Rewards, allows even smaller, compute-efficient models to act as objective verifiers, potentially accelerating autonomous scientific discovery. AI

    IMPACT Enables efficient filtering of AI-generated research ideas, accelerating scientific discovery.

  14. DrugRAG: Enhancing Pharmacy LLM Performance Through A Novel Retrieval-Augmented Generation Pipeline

    Researchers have developed DrugRAG, a novel retrieval-augmented generation pipeline designed to enhance the performance of large language models (LLMs) on pharmacy-related question-answering tasks. In their study, they evaluated ten LLMs, finding that GPT-5 and o3 performed best on a 141-question dataset. DrugRAG, which integrates structured drug information without altering model architecture, significantly improved accuracy across several models, particularly smaller open-source ones, by up to 21 percentage points. AI

    IMPACT Provides a practical method to enhance LLM accuracy for specialized knowledge domains like pharmacy.

  15. Dissecting Embodied Abilities in Multimodal Language Models through Skill-level Evaluation and Diagnosis

    Researchers have introduced BEAR, a new benchmark designed to evaluate and diagnose the skill-level capabilities of embodied multimodal large language models (MLLMs). This benchmark decomposes embodied tasks into 14 distinct atomic skills, providing more granular insights into model failures than previous task-level evaluations. Evaluations on BEAR revealed that perceptual limitations and unstable spatiotemporal modeling are significant bottlenecks for current MLLMs. To address these issues, the team developed BEAR-Agent, a conversational agent that enhances MLLMs with visual and spatial reasoning tools, demonstrating substantial performance improvements on the benchmark and in robotic experiments. AI

    IMPACT Identifies key weaknesses in embodied AI, guiding future research towards improved perception and spatiotemporal reasoning for robotic agents.

  16. Evaluating Commercial AI Chatbots as News Intermediaries

    A new study evaluated six major AI chatbots on their ability to accurately report emerging news facts. While top models achieved over 90% accuracy on multiple-choice questions, their performance dropped significantly in free-response formats and particularly on questions with false premises. The research also highlighted a notable accuracy disparity across languages, with Hindi queries yielding lower results and indicating a bias towards English-language sources. AI

    IMPACT Highlights critical limitations in AI news intermediaries, including regional bias and vulnerability to misinformation, impacting reliable information dissemination.

  17. Think Thrice Before You Speak: Dual knowledge-enhanced Theory-of-Mind Reasoning for Persuasive Agents

    Researchers have introduced a new framework called Think Thrice Before You Speak (TTBYS) to enhance the Theory of Mind (ToM) capabilities in large language models for persuasive dialogue. This framework addresses limitations in current models by explicitly modeling the sequential dependencies among mental states like beliefs and desires, using the Belief-Desire-Intention (BDI) framework. To support this, they also created a large dataset, ToM-based Broad Persuasive Dialogues (ToM-BPD), and demonstrated that a Qwen3-8B model augmented with TTBYS outperformed GPT-5 on predicting mental states and persuasive strategies. AI

    IMPACT Enhances LLM reasoning for persuasive dialogue, potentially improving human-AI interaction in sensitive applications.

  18. LWiAI Podcast #245 - TML-Interaction, Claude For Legal, Sam Altman on Stand

    OpenAI has launched new voice intelligence features, including GPT Realtime 2 powered by GPT-5, offering real-time translation and transcription with an emphasis on reduced latency and larger context windows. Anthropic is expanding its vertical product offerings with Claude for Legal and increased availability through AWS, while also developing methods to train ethical reasoning in agents. Meanwhile, Thinking Machines has previewed a novel conversational system, though it remains inaccessible to the public. AI

    LWiAI Podcast #245 - TML-Interaction, Claude For Legal, Sam Altman on Stand

    IMPACT New voice features and specialized legal AI tools signal continued vertical integration and performance improvements in large language models.

  19. FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding

    Researchers have introduced FineBench, a new benchmark designed to evaluate the fine-grained human activity understanding capabilities of vision-language models (VLMs). The benchmark includes nearly 200,000 question-answer pairs across 64 long-form videos, focusing on detailed actions and interactions. Evaluations showed that while proprietary models like GPT-5 performed adequately, open-source VLMs struggled with spatial reasoning and subtle movement distinctions. To address these limitations, the team also proposed FineAgent, a framework that enhances VLMs using a localizer and descriptor, demonstrating improved performance on FineBench. AI

    FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding

    IMPACT Establishes a new standard for evaluating VLM's nuanced human activity understanding, potentially driving development of more capable models.

  20. Same Model, Different Weakness: How Language and Modality Reshape the Jailbreak Attack Surface in Frontier MLLMs

    A new study reveals that the vulnerability of frontier multimodal large language models (MLLMs) to jailbreak attacks is significantly influenced by language and modality. Researchers found that while linguistic framing attacks were less effective in Spanish compared to English, visually explicit multimodal attacks became more potent. This suggests that alignment failures operate through distinct language- and modality-specific mechanisms, leading to different safety rankings across languages. The findings highlight the need for safety evaluation frameworks to account for these cross-lingual and cross-modal differences. AI

    IMPACT Demonstrates that current safety evaluations may not generalize across languages, necessitating redesigned frameworks for global MLLM deployment.

  21. An OpenAI model has disproved a central conjecture in discrete geometry

    OpenAI's general-purpose reasoning model has disproved an 80-year-old conjecture in discrete geometry, known as the unit distance problem. This marks a significant advancement for AI in mathematics, as the model autonomously generated a novel proof that challenges long-held beliefs in the field. Unlike a previous claim that was retracted, this breakthrough has been validated by mathematicians, including those who previously expressed skepticism. AI

    IMPACT Demonstrates AI's capability for original discovery, potentially accelerating breakthroughs in science and engineering.

  22. Finally Arrived in Japan! An Editor Tried Meta x EssilorLuxottica's New "AI Smart Glasses" | Lifehacker Japan https://www.yayafa.com/2803934/ # AgenticAi # AI # ArtificialGeneralIntelligence # Art

    Microsoft has integrated GPT-5.5 Thinking and ChatGPT Images 2.0 into its Microsoft 365 Copilot, aiming to enhance its capabilities beyond initial criticisms. This move is part of a broader trend where companies like Meta are also advancing AI-powered hardware, such as their new AI smart glasses developed with EssilorLuxottica, which are seeing increased adoption and upgrades. AI

    Finally Arrived in Japan! An Editor Tried Meta x EssilorLuxottica's New "AI Smart Glasses" | Lifehacker Japan https://www.yayafa.com/2803934/ # AgenticAi # AI # ArtificialGeneralIntelligence # Art

    IMPACT Enhances productivity tools with advanced AI capabilities and signals continued innovation in AI-powered hardware.

  23. How to choose the right open model for production

    Choosing the right open-source AI model for production requires careful consideration of factors like transparency, adaptability, and control. While proprietary models offer tiered options, open models allow for deeper customization and ownership. However, legal licensing requirements, such as Apache-2.0 or MIT, must be strictly adhered to for commercial use, and model size should correlate with the capability tier of comparable closed models. AI

    IMPACT Provides guidance for AI operators on selecting and implementing open-source models effectively.

  24. Where's the raccoon with the ham radio? (ChatGPT Images 2.0)

    OpenAI has released its latest image generation model, ChatGPT Images 2.0, which Sam Altman claims is a significant leap comparable to the jump from GPT-3 to GPT-5. Early tests suggest the new model excels at complex illustrations, particularly in generating detailed scenes like a "Where's Waldo" style image with a raccoon holding a ham radio, a task that previous models struggled with. While the model demonstrates impressive capabilities, there are concerns about its reliability in solving its own generated puzzles, as it failed to accurately identify the hidden raccoon in one instance. AI

    Where's the raccoon with the ham radio? (ChatGPT Images 2.0)

    IMPACT Sets a new benchmark for complex image generation, potentially influencing creative industries and AI model development.

  25. Computer-Using Agent

    OpenAI has released AgentKit, a comprehensive suite of tools designed to streamline the development, deployment, and optimization of AI agents. This new toolkit includes an Agent Builder for visual workflow creation, a Connector Registry for managing data integrations, and ChatKit for embedding agentic UIs. Concurrently, Google DeepMind has introduced CodeMender, an AI agent focused on automatically identifying and fixing software vulnerabilities, and AlphaEvolve, a Gemini-powered agent for algorithm discovery and optimization. OpenAI also detailed its Computer-Using Agent (CUA), which interacts with digital interfaces like a human, achieving state-of-the-art results on various benchmarks. AI

    Computer-Using Agent

    IMPACT New agent development tools and specialized AI agents for coding and security will accelerate software development and improve code quality.