Researchers have introduced the Precise Debugging Benchmark (PDB) framework to evaluate the debugging capabilities of large language models. The framework converts existing coding datasets into debugging benchmarks, automatically generating buggy programs with synthesized atomic bugs. PDB employs novel metrics like edit-level precision and bug-level recall to assess how accurately models fix code. Experiments revealed that leading models like GPT-5.1-Codex and DeepSeek-V3.2-Thinking, despite high test pass rates, struggle with precision, often over-editing solutions. AI
影响 Highlights a gap in current LLM coding abilities, suggesting a need for new post-training methods to improve precise debugging.
排序理由 New academic paper introducing a benchmark and evaluation metrics for LLM debugging capabilities.
AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →