New research quantifies agreement between data-influence and data-similarity in LLMs

By PulseAugur Editorial · [1 sources] · 2026-06-22 17:00

Researchers have quantified the agreement between data-similarity and data-influence measures used to trace LLM outputs back to their training data. Their findings indicate a significant overlap between the two measures, with data-influence assigning more consistent ranks to the top documents identified by data-similarity. This asymmetry was observed across experiments with models including OLMo2-1B, Qwen3-1.7B, LlaMa3.2-1B, Gemma3-1B, and GPT2. The study proposes leveraging this asymmetry to achieve a better cost-accuracy trade-off by using data-influence to refine data-similarity results. AI

IMPACT Provides a new method for understanding LLM behavior and potentially optimizing training data analysis.

RANK_REASON The cluster contains an academic paper detailing a new research methodology for understanding LLM behavior. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.LG →

paper
other

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

New research quantifies agreement between data-influence and data-similarity in LLMs

COVERAGE [1]

arXiv cs.LG TIER_1 English(EN) · Mohammad Emtiyaz Khan · 2026-06-22 17:00

Quantifying the Agreement Between Data-Influence and Data-Similarity to Understand LLM Behavior

One way to understand LLM behavior is to trace its output back to the training data. Two types of measures are commonly used for output tracing: data-similarity and data-influence. The former is cheaper while the latter is believed to be more accurate. Even though many works have…

COVERAGE [1]

Quantifying the Agreement Between Data-Influence and Data-Similarity to Understand LLM Behavior

RELATED ENTITIES

RELATED TOPICS