Researchers have found that the labels used to present context to language models significantly impact their behavior. In tests across models like GPT-5.5 and DeepSeek V4 Pro, using labels such as "Instruction:" or "Reference:" led to a much higher adoption of injected information, while "Example:" labels suppressed it. This suggests that the way context is framed can alter how models utilize provided information, and benchmarks should control for these presentation choices. AI
IMPACT Highlights the need for standardized context presentation in RAG benchmarks to ensure reliable model performance evaluation.
RANK_REASON Academic paper detailing new findings on language model behavior. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →