Brief · PulseAugur

RESEARCH · arXiv cs.CL English(EN) · 6d · [3 sources]

Self-Training Doesn't Flatten Language -- It Restructures It: Surface Markers Amplify While Deep Syntax Dies

A new research paper challenges the common understanding of self-training in language models, suggesting it restructures rather than flattens language. The study found that while surface-level linguistic features like discourse markers increase, deeper syntactic structures such as questions and passives decline. This "Structural Depth Hypothesis" posits that the decay rate of linguistic features is primarily determined by their structural complexity, not just their frequency in the model's output. AI

IMPACT Reveals that self-training alters language model outputs in complex ways, impacting data curation and LLM text detection.

Pythia
Structural Depth Hypothesis
GPT-2
Pythia-410M
Pythia-2.8B
OPT-1.3B
GPT-2 124M
Pythia-1.4B
Self-Training