Brief · PulseAugur

RESEARCH · arXiv cs.LG English(EN) · 20h · [2 sources]

Complementary Attention Head Pruning for Efficient Transformers

Researchers have introduced Complementary Attention Head Pruning (CAHP), a novel post-hoc framework designed to make Transformer models more efficient. Unlike existing methods that often rely on unstable gradient-based rankings or manual tuning, CAHP treats head selection as a global graph-theoretical problem. It uses graph-based clustering and information-theoretic measures to identify a diverse and topologically sound subset of attention heads, automatically determining the optimal number of heads per layer. Evaluations on SST-5 and MNLI benchmarks show CAHP outperforms other methods, especially in high-compression scenarios, by preserving critical intermediate layer heads rather than just those near the output. AI

IMPACT This method could enable the deployment of large Transformer models in resource-constrained environments, expanding their applicability.

Yaniv Livertovsky
Complementary Attention Head Pruning
transformer
MNLI
arXiv
Hugging Face