Brief

last 24h

[10/10] 224 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

TOOL · arXiv cs.CL English(EN) · 4d

Self-Harness: Harnesses That Improve Themselves

Researchers have developed a novel method called Self-Harness, enabling LLM-based agents to autonomously improve their own operational harnesses. This iterative process involves identifying model-specific failure patterns, generating targeted harness modifications, and validating these changes through regression testing. When applied to three different base models on the Terminal-Bench-2.0 benchmark, Self-Harness significantly boosted performance, demonstrating a path toward self-optimizing AI agents. AI

IMPACT Enables LLM agents to autonomously adapt and improve their interaction with environments, potentially leading to more robust and efficient AI systems.
TOOL · arXiv cs.AI English(EN) · 1w

What Makes Interaction Trajectories Effective for Training Terminal Agents?

A new research paper explores the effectiveness of interaction trajectories for training AI agents, finding that standalone performance doesn't dictate teaching efficacy. Surprisingly, agents fine-tuned on trajectories from a lower-scoring model, DeepSeek-V3.2, showed better generalization than those trained on a higher-scoring model, Claude Opus 4.6. This "pedagogical paradox" is attributed to Environment-Grounded Supervision (EGS), which exposes inspect-act-verify behaviors, enabling students to internalize problem-solving routines. The study also highlights exceptional data efficiency, with Qwen3-32B achieving state-of-the-art performance using significantly less data. AI

IMPACT Suggests a shift in AI agent training from outcome-matching to harness engineering for better generalization.
SIGNIFICANT · dev.to — LLM tag (CA) · 2w

Qwen3.7-Max: Alibaba's Agent-First 1M-Context LLM Developer Guide

Alibaba has released Qwen3.7-Max, an agent-first LLM with a 1 million token context window, capable of autonomous coding tasks. The model demonstrated a 35-hour coding session without human intervention, optimizing code for unfamiliar hardware and achieving a 10x speedup on a custom chip performance kernel. While independent reproduction of this demo is pending, Qwen3.7-Max shows strong performance on benchmarks like Terminal-Bench 2.0 and MCP-Atlas, surpassing some competitors, though it trails in graduate-level science reasoning and has a lower attempt rate. AI

IMPACT Sets a new bar for agentic coding and long-context reasoning, potentially pressuring competitors in specialized tasks.
TOOL · Towards AI English(EN) · 2w

Claude Code vs Codex vs Antigravity: Which AI Coding Agent Should You Use?

A recent comparison evaluated three AI coding agents: OpenAI's Codex (powered by GPT-5.5), Anthropic's Claude Code (using Claude Sonnet 4.6), and Google's Antigravity (with Gemini 3.5 Flash). The experiment focused on real-world engineering tasks to determine which agent performed best. GPT-5.5 excelled in terminal command execution, Claude Sonnet 4.6 led in SWE-Bench for production code tasks, and Gemini 3.5 Flash demonstrated superior multi-tool orchestration capabilities and speed. AI

IMPACT Provides comparative performance data to help developers choose the most effective AI coding agent for specific tasks.
- Anthropic
- OpenAI
- Google
- GPT-5.5
- Codex
- Claude Code
- SWE-Bench
- Claude Sonnet 4.6
- Terminal-Bench 2.0
- MCP Atlas
- Gemini 3.5 Flash
TOOL · dev.to — LLM tag English(EN) · 3w

Why your local LLM aces benchmarks but fails real terminal tasks

Local large language models often perform poorly on multi-step terminal tasks despite excelling at standard benchmarks like MMLU. This discrepancy arises because traditional benchmarks measure single-turn reasoning, failing to account for an agent's need to decide tools, parse messy output, maintain state, and recover from errors. To address this, new agentic benchmarks like Terminal-Bench 2.0 are emerging, which evaluate models in a sandbox environment by grading task completion rather than just intermediate reasoning. AI

IMPACT Highlights the gap between LLM benchmark performance and real-world agentic capabilities, suggesting a need for more robust evaluation methods.
- Qwen3.6
- LLM
- MMLU
- Terminal-Bench 2.0
- HumanEval
TOOL · dev.to — LLM tag English(EN) · 3w

llama.cpp MTP Boost, New Gemma-4 GGUF, & Qwen 3.6 Local Benchmarks

The llama.cpp project has integrated Multi-head Attention Parallelism (MTP), leading to an 11.5% speed increase for 27B Qwen models in local inference. A new finetuned Gemma-4 model, optimized for creative writing and available in GGUF format, has been released for use with Ollama. Additionally, Qwen 3.6 models have demonstrated competitive performance on the Terminal-Bench 2.0 leaderboard, even surpassing Gemini 2.5 Pro in certain local coding tasks. AI

IMPACT Local LLM inference performance is boosted by llama.cpp's MTP integration, while new finetunes and benchmark results highlight community-driven model specialization.
SIGNIFICANT · 雷峰网 (Leiphone) 中文(ZH) · 1mo

"Dual-Line Actual Test" Qwen 3.6-Plus, Is Agentic Coding Already This Capable of "Carrying the Load"?

Alibaba's Qwen 3.6-Plus model has demonstrated advanced capabilities in complex decision-making and agentic coding tasks, according to a recent evaluation. The model successfully generated a detailed implementation plan for an AI learning assistant system for schools, balancing budget, equity, and risk factors, and dynamically adjusted the plan in response to simulated crises. In a coding test, Qwen 3.6-Plus developed a functional AI TODO Board application, handling natural language input, task decomposition, and AI-driven suggestions, while also performing systematic bug fixes and adhering to UI/UX design principles. AI

IMPACT Sets a new benchmark for AI agentic capabilities in complex planning and full-cycle software development.
RESEARCH · Mastodon — mastodon.social English(EN) · 1mo · [4 sources]

Laguna XS.2 and M.1 https://poolside.ai/blog/laguna-a-deeper-dive # HackerNews # Tech # AI

Poolside AI has released two new agentic coding models, Laguna M.1 and Laguna XS.2, along with their agent training and operation runtime. Laguna M.1 is a large Mixture of Experts (MoE) model trained on 30T tokens using NVIDIA Hopper GPUs, while Laguna XS.2 is a smaller, open-weight model available under an Apache 2.0 license. These models are designed for long-horizon tasks and aim to enable more capable AI agents that can write and execute code. AI

IMPACT Provides open-weight agentic coding models, potentially accelerating development of more capable AI agents.
RESEARCH · Ben's Bites English(EN) · 2mo · [4 sources]

Anthropic built a model too risky to release

Anthropic has developed a new AI model named Claude Mythos, which demonstrates significant advancements in benchmark performance, particularly in identifying software vulnerabilities. Due to its advanced capabilities in finding and exploiting security flaws, Anthropic has opted not to release Mythos publicly. Instead, the company is providing limited access to select organizations through "Project Glasswing" to aid in cybersecurity research and vulnerability discovery, alongside a substantial commitment to open-source security initiatives. AI

IMPACT Restricted release of advanced AI model highlights growing safety concerns and the potential for AI in cybersecurity, influencing future development and deployment strategies.
- Claude Sonnet
- Anthropic
- Meta
- Claude Mythos
- Firefox
- Project Glasswing
- Claude Opus
- OpenBSD
- FFmpeg
- Muse Spark
- Terminal-Bench 2.0
- Sonnet 4.6
- Opus 4.6
- SWE-bench Pro
FRONTIER RELEASE · Google DeepMind English(EN) · 6mo

Start building with Gemini 3

Google DeepMind has launched Gemini 3 Pro, their latest and most intelligent model, which demonstrates significant improvements in reasoning and coding capabilities. This new model surpasses previous versions and excels in agentic workflows and complex zero-shot tasks, topping leaderboards like the WebDev Arena. Gemini 3 Pro is integrated into new platforms like Google Antigravity and is available via the Gemini API, enabling developers to build applications more efficiently using natural language prompts. AI