Brief

last 24h

[9/9] 224 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

TOOL · dev.to — Claude Code tag English(EN) · 5h

One Markdown File Made My AI Agent 23 Points Smarter

A new method called SkillOpt, developed by Microsoft and three Chinese universities, has demonstrated that a single Markdown file can significantly improve AI agent performance. When used as context during inference, this file boosted GPT-5.5's scores by an average of 23 points across six procedural benchmarks. This approach outperformed handwritten instructions, LLM-generated instructions, and four specialized training methods, suggesting a notable trend for AI agent optimization. AI

IMPACT This method suggests a simple yet effective way to enhance AI agent capabilities, potentially influencing how agents are optimized and deployed.
- SkillOpt
- Microsoft
- GPT-5.5
- TextGrad
- Gepa Ai Agent
- EvoSkill
- Susiloharjo
- Trace2Skill
RESEARCH · MarkTechPost English(EN) · 1w · [4 sources]

A Coding Implementation on Microsoft SkillOpt for Instrumented Prompt Optimization, Skill Evolution Analysis, and Baseline Comparison

Researchers have developed VISTA, a new framework for automatically optimizing prompts used with large language models. This method aims to overcome limitations in existing reflective prompt optimization techniques, which can be opaque and lead to performance degradation. VISTA decouples hypothesis generation from prompt rewriting, enabling more interpretable optimization traces and improved accuracy on complex tasks like arithmetic word problems. AI

IMPACT Introduces a more interpretable and effective method for prompt engineering, potentially improving LLM performance on complex reasoning tasks.
- Fine-Tuning
- Retrieval-Augmented Generation
- Large Language Models
- Prompt Engineering
- GPT-4o-mini
- OpenAI
- LiteLLM
- MarkTechPost
- gpt-4.1
- Medium
- AIME2025
- GSM8K
- VISTA
- SkillOpt
- LLM
- GPT-4o
- Microsoft
RESEARCH · r/LocalLLaMA English(EN) · 1w · [2 sources]

Levi: Run AlphaEvolve on your local QWEN 30B

A new open-source system named LEVI has been developed to emulate AlphaEvolve's capabilities at a significantly reduced cost, reportedly up to 35 times cheaper. LEVI's core principle is that smaller language models can achieve comparable or superior results to larger ones through optimized search architectures and intelligent routing. The system has demonstrated strong performance in code and prompt optimization tasks, outperforming existing frameworks on benchmarks like ADRS and IFBench while using fewer computational resources. AI

IMPACT This system could enable more accessible and cost-effective AI development and experimentation by leveraging smaller models.
- AlphaEvolve
- LEVI
- Claude Code
- Codex
- Qwen-30B
- Claude Opus
- Google
- Qwen3-30B-A3B
- GPT-5
- TPU Research Cloud
- IFBench
- HotpotQA
TOOL · Mastodon — fosstodon.org English(EN) · 1w

Researchers have introduced GEPA, a reflective prompt-evolution framework that improves how language models solve arithmetic word problems. Starting from weak s

Researchers have developed GEPA, a new framework designed to enhance the problem-solving capabilities of language models, particularly for arithmetic word problems. This system begins with basic prompts and iteratively refines them by creating deterministic benchmarks, establishing structured evaluation methods, and simultaneously evolving both the instructions and the format of the model's output. The improvements demonstrated by GEPA have shown to generalize effectively to new, unseen datasets. AI

IMPACT Enhances LLM reasoning for complex word problems, potentially improving performance in educational and analytical applications.
- language models
RESEARCH · arXiv cs.CL English(EN) · 1mo · [3 sources]

CANTANTE: Optimizing Agentic Systems via Contrastive Credit Attribution

Researchers have developed CANTANTE, a new framework designed to optimize the configuration of large language model-based multi-agent systems. This system addresses the challenge of assigning credit for performance when only system-level scores are available, by decomposing rewards into per-agent update signals. CANTANTE was evaluated on programming, mathematical reasoning, and question-answering tasks, where it demonstrated superior performance compared to existing methods and unoptimized prompts, while also incurring lower inference costs. AI

IMPACT Introduces a novel method for optimizing multi-agent LLM systems, potentially improving performance and efficiency in complex tasks.
- HotpotQA
- LLM
- MIPROv2
- MBPP
- GSM8K
TOOL · arXiv cs.LG English(EN) · 1mo

P^2O: Joint Policy and Prompt Optimization

Researchers have developed a new method called P^2O (Joint Policy and Prompt Optimization) to address the issue of advantage collapse in Reinforcement Learning with Verifiable Rewards (RLVR) for large language models. This technique alternates between continuous policy updates and discrete prompt evolution, using the GEPA algorithm to discover effective prompts for challenging samples. By distilling these prompts into the model's parameters, P^2O improves out-of-distribution generalization and achieves up to a 9.5% performance increase over existing methods. AI

IMPACT Introduces a novel approach to enhance LLM reasoning by combining prompt engineering with reinforcement learning, potentially improving performance on complex tasks.
- Kaiqi Zhang
- RLVR
- LLM
TOOL · arXiv cs.CL English(EN) · 1mo

To Write or to Automate Linguistic Prompts, That Is the Question

A new research paper explores the effectiveness of automated prompt optimization compared to expert-crafted prompts for large language models. The study systematically compared hand-crafted prompts, base DSPy signatures, and GEPA-optimized DSPy signatures across translation, terminology insertion, and language quality assessment tasks. Results indicated that automated and manual prompts often yield similar quality, with performance varying by task and model configuration. AI

IMPACT Investigates whether automated prompt optimization can match or exceed expert prompt engineering for LLMs.
TOOL · Mastodon — sigmoid.social English(EN) · 1mo

Apple's "Reinforced Agent": a reviewer agent vets tool calls before execution instead of recovering after errors. +5.5% on BFCL irrelevance, +7.1% on τ²-Bench m

Apple researchers have developed a "Reinforced Agent" that proactively verifies tool calls before execution, aiming to prevent errors rather than correcting them post-hoc. This approach demonstrated significant improvements on benchmarks like BFCL irrelevance and τ²-Bench, with reasoning-model reviewers achieving a 3:1 helpful-to-harmful ratio. The system also saw a modest gain with the GEPA prompt optimization without requiring model retraining. AI

IMPACT This agent's proactive error prevention could enhance the reliability and safety of AI systems interacting with external tools.
TOOL · Mastodon — sigmoid.social English(EN) · 1mo

GEPA optimizes prompts in compound AI systems by reading failed trajectories in natural language and editing the prompt of the module that caused the failure. A

Researchers have developed GEPA, a new method for optimizing prompts in complex AI systems. GEPA analyzes failed execution paths and automatically refines the prompts of the specific modules responsible for the errors. In tests across six tasks, GEPA outperformed the GRPO method by an average of 6%, achieving this with significantly fewer rollouts. AI

IMPACT This new method could lead to more efficient and effective AI systems by automating prompt refinement and reducing trial-and-error.
- GRPO