Brief · PulseAugur

RESEARCH · arXiv cs.CL · 4d · [2 sources]

Memorization Dynamics of Fill-in-the-Middle Pretraining

Researchers investigated how the fill-in-the-middle (FIM) pretraining objective affects language model memorization compared to standard left-to-right (LTR) training. Their study, using Llama 3.2 models and a corpus with repeated text, found that FIM training leads to verbatim extraction that scales linearly with data repetitions. The research also highlighted that FIM's recall is strongly dependent on prefix context and that evaluating memorization requires careful consideration of span length and probe format. AI

IMPACT This research clarifies how specific pretraining methods influence model memorization, potentially guiding future model development towards desired recall behaviors.

Llama 3.2
FineWeb-Gutenberg corpus
Fill-in-the-Middle (FIM)
Left-to-right (LTR)