A new study published on arXiv investigates post-hoc falsification operators for small, frozen code models, finding that most operators do not improve accuracy over standard methods like Best-of-N. The research highlights a "coverage wall" and "capability scissors" as key limitations. However, an "expression-layer recovery" method showed promise by recovering correct programs that standard extractors discard, boosting the performance of DeepSeek-Coder-1.3B on benchmarks like HumanEval+. AI
IMPACT Suggests that current methods for verifying and repairing code generated by small models are insufficient, highlighting the need for better evaluation harnesses.
RANK_REASON The cluster contains a research paper published on arXiv detailing a measurement study of post-hoc falsification operators for code models.
- arXiv
- DeepSeek-Coder-1.3B
- Hugging Face
- HumanEval+
- MBPP+
- alphaXiv
- CatalyzeX
- DagsHub
- Gotit.pub
- ScienceCast
AI-generated summary · Google Gemini · from 2 sources. How we write summaries →