Taylor-Calibrate: Principled Initialization for Hybrid Linear Attention Distillation
Researchers have developed Taylor-Calibrate, a new initialization method designed to improve the conversion of Transformer models into hybrid linear attention models. This technique addresses the brittleness of converting pretrained Transformers into Gated DeltaNet students by providing a principled way to set new dynamic parameters. The method utilizes Taylor-guided teacher attention statistics to configure value projections, memory timescales, and gating dynamics, leading to significantly stronger zero-shot students and requiring fewer distillation tokens for effective conversion. AI
IMPACT Improves efficiency and quality of long-context inference models by simplifying the conversion process from standard Transformers.