Brief · PulseAugur

TOOL · arXiv cs.AI English(EN) · 6h

SWE-IF: Aligning Code Evaluation with Human Preference

Researchers have introduced SWE-IF, a new evaluation framework designed to assess Large Language Models' (LLMs) ability to follow code instructions beyond just functional correctness. This framework includes a taxonomy of 30 verifiable code instructions and deterministic verifiers, aiming to capture the 'vibe check' that reflects human preference for clean, intent-preserving, and correct code. Evaluations of 31 LLMs revealed that instruction following is a key differentiator, with a composite score of functional correctness and instruction following correlating best with human preference. AI

IMPACT This new evaluation framework could lead to LLMs that generate more human-aligned and maintainable code, improving developer productivity.

LLMs
SWE-IF
VeriCode
Ming Zhong