Gemini 3.1-pro-preview
PulseAugur coverage of Gemini 3.1-pro-preview — every cluster mentioning Gemini 3.1-pro-preview across labs, papers, and developer communities, ranked by signal.
- 2026-06-01 product_launch Gemini 3.1 Pro Preview is highlighted for its ability to directly transcribe audio input. source
7 day(s) with sentiment data
Gemini 3.1 Pro Preview may show inconsistent performance in financial decision-making tasks
The new 1rok benchmark is designed to test LLMs on stock-picking, a task requiring decision-making under uncertainty. While Gemini 3.1 Pro Preview is included, its performance in this domain is untested. Given the benchmark's focus on practical, downstream evaluation beyond traditional benchmarks, Gemini 3.1 Pro Preview could exhibit variability in its ability to consistently select profitable stocks compared to models with more established real-world usage data.
Gemini 3.1 Pro Preview struggles with complex IT incident diagnosis
The recent ITBench-AA benchmark, which evaluates frontier AI models on enterprise IT tasks like SRE, shows that even advanced models are scoring below 50% on diagnosing Kubernetes incidents. Gemini 3.1 Pro Preview's performance in this specific area, while not explicitly detailed in the provided evidence, is likely to be impacted given the general struggles observed across frontier models with root-cause analysis and avoiding false positives in complex scenarios.
Gemini 3.1 Pro Preview passes initial safety audits for code sabotage
Recent AI safety audits utilizing environment blueprints for more realistic evaluations have tested Gemini 3.1 Pro Preview for code sabotage. The results from these 160 trials indicated no egregious scheming behavior, suggesting that the model is currently robust against this specific type of malicious action under these audited conditions.
Gemini 3.1 Pro Preview may lag in real-world adoption compared to GPT-5 models
Given that AgentTape ranks models by usage and GPT-5 models are currently dominating, and considering Gemini 3.1 Pro Preview's participation in new, specialized benchmarks (ITBench-AA, 1rok) without clear leadership, it's plausible that Gemini 3.1 Pro Preview's real-world adoption is currently lower than that of leading GPT-5 models. Future usage data from indices like AgentTape will be key to verifying this.
Gemini 3.1 Pro Preview shows mixed results in specialized benchmarks
While Gemini 3.1 Pro Preview was tested for code sabotage in AI safety audits and performed adequately, it has not yet demonstrated top-tier performance in newly released benchmarks like ITBench-AA or 1rok, which focus on enterprise IT tasks and stock-picking respectively. This suggests Gemini 3.1 Pro Preview may have specific strengths but is not universally outperforming competitors like GPT-5.5 across all emerging, practical evaluation domains.
-
Google's Gemini 3.5 Flash disappoints on Android benchmark; Pixel Drop features leaked
Google has inadvertently revealed upcoming features for its Pixel Drop update, including "Screen Reactions" for creating reaction videos and Gemini Omni for AI-powered multimedia content generation. Separately, the new …
-
ChatGPT market share dips below 50% as users migrate to rivals · 1 source tracked
ChatGPT's market share has fallen below 50% for the first time, with users shifting to alternatives like Google's Gemini, Anthropic's Claude, and xAI's Grok. In a separate development, Vercel has released 'eve,' an open…
-
New method boosts video QA accuracy using cross-model disagreement
Researchers have developed a novel inference-time procedure called disagreement-based cross-model routing to improve video question answering accuracy. This method leverages the variance in outputs from a primary video …
-
Gemini 3.5 Flash disappoints on Android benchmarks, costs more than predecessor
Google's new Gemini 3.5 Flash model has underperformed in Android development benchmarks, scoring lower than its predecessor, Gemini 3.1 Pro Preview. The model also incurred significantly higher costs per execution, rep…
-
AI model performance heavily depends on prompting method, study finds
A new study published on arXiv reveals that the way AI models are prompted, or "scaffolded," significantly impacts their measured performance. Researchers found that the choice of scaffold alone could alter a model's ac…
-
New KINA benchmark ranks Gemini 3.1 Pro highest, surpassing Claude and GPT-5
A new benchmark called KINA has been introduced to evaluate large language models across 261 fine-grained disciplines, addressing issues of scaling-driven design and annotation quality. The benchmark, comprising 899 ite…
-
LLM constraint injection method boosts optimization modeling accuracy
Researchers have developed a new method called constraint injection to improve how large language models handle complex optimization problems. This technique addresses the issue of LLMs incorrectly adding or omitting co…
-
Gemini 3.1 Pro Preview offers direct audio transcription via API
A guide details how to use AI models for audio transcription, distinguishing between speech recognition and text post-processing. It highlights Google's Gemini 3.1 Pro Preview as a model capable of directly processing a…
-
New Benchmark Tests LLMs on Scientific Hypothesis Generation
A new benchmark called ProjectionBench has been developed to evaluate the scientific hypothesis generation capabilities of large language models. This framework progressively reveals information from research papers, al…
-
Frontier AI models fail new IT benchmark, scoring below 50%
A new benchmark, ITBench-AA, has been released to evaluate the capabilities of frontier AI models on enterprise IT tasks, specifically focusing on Site Reliability Engineering (SRE). In initial tests, even the most adva…
-
New ATLAS benchmark reveals long-context LLM performance shifts
A new benchmarking framework called ATLAS has been introduced to more comprehensively evaluate the long-context abilities of language models. Unlike previous methods that often report single scores or narrow task perfor…
-
AI safety audits improved with environment blueprints
Researchers have developed a new pipeline to generate environment blueprints for more realistic and consistent AI safety audits. This method was tested using the Petri auditor to evaluate Gemini 3.1 Pro Preview for code…
-
AgentTape index ranks AI models by usage, not just benchmarks
A new open-source index called AgentTape ranks AI models based on a blend of benchmark performance, actual usage, cost, and speed. Currently, OpenAI's GPT-5 models dominate the top rankings, with GPT-5.5 specifically ex…
-
LLM benchmark 1rok pits GPT-5.5, Gemini 3.1, Grok 4.3 in stock-picking contest
A new benchmark, dubbed 1rok, has been launched to evaluate the stock-picking capabilities of frontier large language models. The benchmark assigns each participating LLM a virtual portfolio of $100,000 and tasks them w…
-
New benchmark CiteVQA exposes "Attribution Hallucination" in LLMs
Researchers have introduced CiteVQA, a new benchmark designed to evaluate multimodal large language models (MLLMs) on their ability to accurately attribute answers to specific source regions within documents. Unlike pre…
-
AI Labs Pivot to Agent Products Amidst DeepSeek's Price Cuts
Researchers have developed a benchmark to test Large Language Models' ability to handle temporal changes in legal statutes, identifying issues like outdated information and recency bias. Meanwhile, the AI industry is se…
-
[GRPO Explained] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Researchers are developing new benchmarks and evaluation methods for large language models (LLMs) in mathematical reasoning and educational assessment. New datasets like ESTBook and Math-PT aim to go beyond simple accur…