PulseAugur / Brief
EN
LIVE 22:19:18

Brief

last 24h
[1/1] 222 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

  1. Multi-Head Attention as Ensemble Nadaraya-Watson Estimation: Variance Reduction, Decorrelation, and Optimal Head Diversity

    Researchers have developed a statistical theory that frames multi-head attention (MHA) as an ensemble of Nadaraya-Watson kernel regression estimators. This framework reveals that variance reduction in MHA is fundamentally tied to the decorrelation of outputs from different attention heads, rather than just the number of heads. They introduced the Head Diversity Index (HDI) to measure this decorrelation and derived an optimal head-dimension allocation strategy, suggesting a new architectural scaling law where optimal per-head dimension grows logarithmically with training set size. AI

    Multi-Head Attention as Ensemble Nadaraya-Watson Estimation: Variance Reduction, Decorrelation, and Optimal Head Diversity

    IMPACT Provides a theoretical basis for understanding and optimizing attention mechanisms in large language models.