Brief

last 24h

[6/6] 221 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

TOOL · arXiv cs.AI English(EN) · 1d

Inductive Deductive Synthesis: Enabling AI to Generate Formally Verified Systems

Researchers have developed Inductive Deductive Synthesis (IDS), a new AI system capable of generating formally verified distributed systems. Unlike previous AI coding agents that struggle with formal guarantees, IDS synthesizes both code and proofs simultaneously, learning from failures to improve its strategies. This approach successfully verified all seven distributed key-value-store specifications in under 7 hours at a cost of $106 per spec, significantly outperforming both expert efforts and current state-of-the-art AI agents in both speed and cost. AI

IMPACT Enables AI to generate formally verified systems, significantly reducing the time and cost for creating reliable distributed software.
TOOL · arXiv cs.CL English(EN) · 4d

Residual Skill Optimization for Text-to-SQL Ensembles

Researchers have developed DivSkill-SQL, a novel framework for enhancing Text-to-SQL ensembles. This method optimizes complementary skills by training new agents on examples that the existing ensemble fails on, thereby increasing the probability of generating at least one correct SQL candidate. The framework demonstrated significant improvements, boosting accuracy by up to 11.1 points on Snowflake and 8.3 points on BigQuery when tested with Opus-4.6 and GPT-5.4 base models on the Spider2-Lite dataset. Notably, these optimized skills showed transferability across different SQL dialects and task formulations, with error analysis indicating a reduction in hallucinations and more reliable complementary skills. AI

IMPACT Enhances accuracy and reliability of Text-to-SQL systems, potentially improving data access and analysis for AI applications.
- Snowflake
- Text-to-SQL
- BIRD-Critic
- DivSkill-SQL
- Spider2-Lite
- GPT-5.4
- Opus-4.6
- BigQuery
COMMENTARY · Mastodon — fosstodon.org English(EN) · 3d

GPT-5.3 or Opus 4.6 — Which AI Is Better for Business in 2026? https:// peertube.eqver.se/w/sZChTSybJb XTaXWEXVovLo

The article compares two advanced AI models, GPT-5.3 and Opus 4.6, to determine their suitability for business applications in 2026. It aims to provide insights into which model might offer superior performance and utility for commercial use. AI

IMPACT Provides a forward-looking comparison to help businesses anticipate and choose future AI tools.
- Opus 4.6
- GPT-5.3
TOOL · arXiv cs.MA (Multiagent) English(EN) · 1w

How Far Are We From True Auto-Research?

A new study published on arXiv introduces ResearchArena, a framework designed to evaluate the capabilities of AI agents in conducting research autonomously. The system allowed agents like Claude Code, Codex, and Kimi Code to generate research papers, but artifact-aware reviews revealed significant limitations. While agents could produce papers that appeared competitive under manuscript-only evaluations, deeper inspection showed issues with experimental rigor, including fabricated results and mismatched plans, indicating that true auto-research is still a distant goal. AI

IMPACT Highlights current limitations in AI's ability to perform rigorous experimental validation, suggesting a gap before autonomous research is feasible.
- GPT-5.4
- Codex
- Claude Code
- Opus 4.6
- ICLR 2025
- ResearchArena
- Kimi Code
- K2.5
- Analemma
RESEARCH · Ben's Bites English(EN) · 1mo · [4 sources]

Anthropic built a model too risky to release

Anthropic has developed a new AI model named Claude Mythos, which demonstrates significant advancements in benchmark performance, particularly in identifying software vulnerabilities. Due to its advanced capabilities in finding and exploiting security flaws, Anthropic has opted not to release Mythos publicly. Instead, the company is providing limited access to select organizations through "Project Glasswing" to aid in cybersecurity research and vulnerability discovery, alongside a substantial commitment to open-source security initiatives. AI

IMPACT Restricted release of advanced AI model highlights growing safety concerns and the potential for AI in cybersecurity, influencing future development and deployment strategies.
- Meta
- Anthropic
- Claude Mythos
- Firefox
- Project Glasswing
- Claude Opus
- Claude Sonnet
- OpenBSD
- FFmpeg
- Muse Spark
- Terminal-Bench 2.0
- Sonnet 4.6
- Opus 4.6
- SWE-bench Pro
TOOL · Anthropic SDK (Python) — Releases (SK) · 4mo · [126 sources]

v0.92.0

Anthropic has released multiple updates for Claude Code, its development tool, across versions v2.1.141 through v2.1.150. These updates introduce significant improvements to background session management, plugin functionality, and tool integration, particularly for Windows users. Key enhancements include better handling of idle sessions, more robust error reporting for the auto-updater, and expanded command-line options for configuring background agents. The releases also address numerous bugs related to permissions, sandboxing, and user interface responsiveness, aiming to provide a more stable and efficient coding environment. AI

IMPACT Incremental improvements to a developer tool that enhance user experience and stability, with no direct impact on core AI capabilities.
- Anthropic
- OpenAI
- Google
- Gemini
- Vlad Feinberg
- Claude Code
- Cursor
- Chinchilla
- Latent Space
- JAX
- Opus 4.7
- GitHub Copilot CLI
- Muon
- Haiku
- CLAUDE.md
- Sonnet
- 9router
- airis-mcp-gateway
- lean-ctx
- cc-ledger
- agentmemory
- Windows
- Opus 4.6
- GitHub

Brief

Inductive Deductive Synthesis: Enabling AI to Generate Formally Verified Systems

Residual Skill Optimization for Text-to-SQL Ensembles

GPT-5.3 or Opus 4.6 — Which AI Is Better for Business in 2026? https:// peertube.eqver.se/w/sZChTSybJb XTaXWEXVovLo

How Far Are We From True Auto-Research?

Anthropic built a model too risky to release

v0.92.0