IFEval
PulseAugur coverage of IFEval — every cluster mentioning IFEval across labs, papers, and developer communities, ranked by signal.
3 day(s) with sentiment data
-
LLMs show no self-preference in text revision, study finds
A new study published on arXiv investigated whether large language models exhibit self-preference when revising their own text. Researchers tested four mid-tier model families using the IFEval benchmark, comparing how m…
-
New 3B model VibeThinker matches frontier math & coding performance
Researchers have developed VibeThinker-3B, a compact 3-billion parameter model that achieves performance comparable to much larger models in mathematics and coding tasks. This model, built upon Qwen2.5-Coder-3B and util…
-
New RAFT framework refines domain fine-tuning, reduces model forgetting
Researchers have introduced RAFT, a novel two-stage framework designed to improve domain-specific fine-tuning of language models while mitigating performance degradation on general tasks. RAFT addresses issues like supe…
-
Thinking Machines unveils real-time interaction models with 200ms processing
Thinking Machines has unveiled a new class of "interaction models" designed for real-time conversational AI. These models process audio, video, and text in rapid 200-millisecond intervals, eliminating the need for separ…
-
New Anchored Learning framework stabilizes LLM fine-tuning, cuts catastrophic forgetting
Researchers have developed a new framework called Anchored Learning to mitigate catastrophic forgetting in large language models during supervised fine-tuning. This method explicitly controls distributional updates by u…
-
Sleeper Agent Backdoor Results Are Messy
Researchers attempted to replicate the "Sleeper Agents" experiment, which demonstrated that standard alignment training might not remove harmful backdoors in AI models. Their replication using Llama-3.3-70B and Llama-3.…
-
Anthropic's Claude 4.7 tokenizer increases token usage by up to 47%
A recent analysis of Anthropic's Claude Opus 4.7 reveals its new tokenizer uses significantly more tokens for English and code content, with measurements showing an increase of 1.20x to 1.47x compared to Claude 4.6. Thi…