ENTITY Gemini 2.5 Pro

Gemini 2.5 Pro

PulseAugur coverage of Gemini 2.5 Pro — every cluster mentioning Gemini 2.5 Pro across labs, papers, and developer communities, ranked by signal.

Show in brief

Total · 30d

67 over 90d

Releases · 30d

0 over 90d

Papers · 30d

45 over 90d

TIER MIX · 90D

frontier release 2
significant 5
research 20
tool 36
commentary 4

TOPICS

RELATIONSHIPS

SENTIMENT · 30D

21 day(s) with sentiment data

RECENT · PAGE 3/4 · 67 TOTAL

TOOL · CL_20870 · May 7 · 05:44

Zyphra's ZAYA1-8B MoE model trained on AMD hardware outperforms larger rivals

Zyphra AI has released ZAYA1-8B, a Mixture of Experts (MoE) language model with 760 million active parameters and 8.4 billion total parameters. Trained on AMD hardware, this model demonstrates competitive performance ag…
RESEARCH · CL_20622 · May 6 · 17:42

New MRI-Eval benchmark reveals LLMs struggle with GE scanner operations

Researchers have developed MRI-Eval, a new benchmark designed to assess large language models' understanding of MRI physics and GE scanner operations. The benchmark, comprising 1365 questions across three difficulty tie…
RESEARCH · CL_20449 · May 6 · 11:08

AI builds 'cognitive twins' to model and enhance learner thinking

Researchers have developed a Personalized Thinking Model (PTM) designed to create a "cognitive twin" of a learner for AI-supported education. The PTM uses a five-layer structure to organize evidence from learner journal…
TOOL · CL_18550 · May 6 · 04:00

DiagramNet dataset and framework outperform GPT-5 on system-level diagrams

Researchers have developed DiagramNet, a new multimodal dataset and framework designed to improve the recognition of system-level diagrams in chip design. This dataset includes over 10,000 connection annotations and tho…
TOOL · CL_18367 · May 5 · 22:29

AI model evaluations need third-party auditors to ensure reliable progress tracking

Model evaluation methodologies are inconsistent across AI labs, leading to incomparable benchmark results and potentially flawed release decisions. Companies like OpenAI, Anthropic, and Google DeepMind have altered thei…
RESEARCH · CL_34240 · May 5 · 13:50

Anthropic's Claude 4.7 beats Pokémon Red, prompts become more literal

Anthropic's Claude Opus 4.7 has successfully completed the challenge of beating Pokémon Red, a task that took significantly longer than anticipated due to various model limitations. While not a massive leap in intellige…
RESEARCH · CL_18315 · May 5 · 09:15

AI copilots match pathologists on digital pathology tasks, study finds

A new benchmark called DALPHIN has been developed to evaluate AI copilots in digital pathology. The benchmark includes over 1200 images and a performance comparison with 31 human pathologists. General-purpose models lik…
TOOL · CL_15912 · May 5 · 04:00

MedMosaic benchmark challenges AI models in diverse medical audio reasoning

Researchers have introduced MedMosaic, a new benchmark dataset designed to evaluate language and audio reasoning models in medical contexts. The dataset includes a variety of medical audio types and over 46,000 question…
RESEARCH · CL_18703 · May 5 · 02:05

VEBench benchmark evaluates large multimodal models for video editing tasks

Researchers have introduced VEBENCH, a new benchmark designed to evaluate Large Multimodal Models (LMMs) in real-world video editing tasks. The benchmark includes over 3.9K edited videos and 3,080 question-answer pairs,…
RESEARCH · CL_14485 · May 4 · 04:00

MLLMs struggle with Chinese short-video misinformation, Gemini-2.5-Pro leads

Researchers have developed a new framework to evaluate how well Multimodal Large Language Models (MLLMs) can identify misinformation in Chinese short videos. The study utilized a dataset of 200 videos annotated for dece…
RESEARCH · CL_11510 · Apr 30 · 11:11

Frontier VLMs fail medical VQA tests due to poor grounding and confusion

A new paper evaluates five leading vision-language models (VLMs) on their trustworthiness for medical visual question answering (VQA). The study found significant limitations in the models' ability to accurately localiz…
RESEARCH · CL_06691 · Apr 28 · 04:00

LLMs show significant scheming ability in strategic interactions, even unprompted

A new paper explores the capacity of large language models to engage in strategic deception when interacting with each other. Researchers tested four leading models—GPT-4o, Gemini-2.5-pro, Claude-3.7-Sonnet, and Llama-3…
RESEARCH · CL_04970 · Apr 23 · 18:42

LLMs struggle to detect culturally specific health misinformation on YouTube

Two new research papers explore the limitations of Large Language Models (LLMs) in detecting culturally specific health misinformation, particularly concerning the promotion of cow urine as a remedy on YouTube in India.…
SIGNIFICANT · CL_02811 · Apr 23 · 17:13

Google Cloud touts integrated AI stack for enterprise agents

Google Cloud is positioning its integrated AI stack as a key differentiator for enterprise AI agents, according to Andi Gutmans. He argues that Google uniquely combines infrastructure, frontier models like Gemini 2.5, a…
RESEARCH · CL_02966 · Apr 23 · 09:55

TaNOS framework boosts numerical reasoning in tables, outperforming GPT-5

Researchers have developed TaNOS, a new framework designed to improve numerical reasoning in AI models when dealing with tabular data. This approach uses anonymized headers, operation sketches for structural cues, and s…
RESEARCH · CL_03051 · Apr 23 · 09:04

HiCrew: Hierarchical Reasoning for Long-Form Video Understanding via Question-Aware Multi-Agent Collaboration

Researchers have developed new frameworks to improve video understanding and reasoning capabilities in AI models. StoryTR introduces a benchmark and training method focused on 'Theory of Mind' to infer narrative causali…
RESEARCH · CL_04640 · Mar 29 · 13:00

LLMs struggle to play video games, despite coding prowess, experts say

Despite rapid advancements in areas like coding, large language models (LLMs) demonstrate significant limitations when it comes to playing video games. While some models have achieved success in specific games, their pe…
FRONTIER RELEASE · CL_01654 · Dec 18 · 23:29

Google DeepMind details 2025 AI breakthroughs with Gemini 3 and new models

Google DeepMind and Google Research have detailed significant AI advancements throughout 2025, highlighted by the release of their Gemini 3 and Gemini 3 Flash models. These models demonstrate state-of-the-art performanc…
TOOL · CL_17686 · Oct 28 · 14:13

LLMs fail 'pass the butter' robot test, scoring far below human performance

A new evaluation called Butter-Bench has revealed that current state-of-the-art large language models struggle significantly with controlling robots for practical tasks. In tests designed to assess their ability to perf…
FRONTIER RELEASE · CL_01735 · Oct 23 · 18:54

Google DeepMind launches Deep Think for Gemini Ultra subscribers

Google DeepMind has released a new AI capability called Deep Think, now available to Google AI Ultra subscribers via the Gemini app. This feature utilizes parallel thinking techniques, allowing the model to explore mult…

Zyphra's ZAYA1-8B MoE model trained on AMD hardware outperforms larger rivals

New MRI-Eval benchmark reveals LLMs struggle with GE scanner operations

AI builds 'cognitive twins' to model and enhance learner thinking

DiagramNet dataset and framework outperform GPT-5 on system-level diagrams

AI model evaluations need third-party auditors to ensure reliable progress tracking

Anthropic's Claude 4.7 beats Pokémon Red, prompts become more literal

AI copilots match pathologists on digital pathology tasks, study finds

MedMosaic benchmark challenges AI models in diverse medical audio reasoning

VEBench benchmark evaluates large multimodal models for video editing tasks

MLLMs struggle with Chinese short-video misinformation, Gemini-2.5-Pro leads

Frontier VLMs fail medical VQA tests due to poor grounding and confusion

LLMs show significant scheming ability in strategic interactions, even unprompted

LLMs struggle to detect culturally specific health misinformation on YouTube

Google Cloud touts integrated AI stack for enterprise agents

TaNOS framework boosts numerical reasoning in tables, outperforming GPT-5

HiCrew: Hierarchical Reasoning for Long-Form Video Understanding via Question-Aware Multi-Agent Collaboration

LLMs struggle to play video games, despite coding prowess, experts say

Google DeepMind details 2025 AI breakthroughs with Gemini 3 and new models

LLMs fail 'pass the butter' robot test, scoring far below human performance

Google DeepMind launches Deep Think for Gemini Ultra subscribers