English(EN) One prompt is not enough: Instruction Sensitivity Undermines Embedding Model Evaluation

研究揭示提示词敏感性破坏嵌入模型评估

作者 PulseAugur 编辑部 · [2 个来源] · 2026-05-21 14:27

一篇新研究论文指出了指令微调嵌入模型评估中的一个重大缺陷。研究表明，每个任务使用单个提示词会导致误导性的性能得分和不稳定的排行榜排名。研究人员发现，提示词措辞的选择会极大地改变模型的报告性能，这表明当前的评估方法不足。 AI

影响强调了当前嵌入模型评估方法中的一个关键缺陷，可能导致更稳健的基准设计。

排序理由该集群包含一篇详细介绍新研究发现的学术论文。

AI 生成摘要 · Google Gemini · 来自 2 个来源。我们如何撰写摘要 →

报道来源 [2]

arXiv cs.CL TIER_1 English(EN) · Yevhen Kostiuk, Kenneth Enevoldsen · 2026-05-22 04:00

One prompt is not enough: Instruction Sensitivity Undermines Embedding Model Evaluation

arXiv:2605.22544v1 Announce Type: new Abstract: Instruction embedding models have become common among state-of-the-art models, however are evaluated using a single prompt per task. The single-point evaluation ignores a main problem of the instruction-based approach namely: sensit…
arXiv cs.CL TIER_1 English(EN) · Kenneth Enevoldsen · 2026-05-21 14:27

One prompt is not enough: Instruction Sensitivity Undermines Embedding Model Evaluation

Instruction embedding models have become common among state-of-the-art models, however are evaluated using a single prompt per task. The single-point evaluation ignores a main problem of the instruction-based approach namely: sensitivity to the phrasing of the instruction. We pre…