Complementary Attention Head Pruning for Efficient Transformers
Researchers have introduced Complementary Attention Head Pruning (CAHP), a novel post-hoc framework designed to make Transformer models more efficient. Unlike existing methods that often rely on unstable gradient-based rankings or manual tuning, CAHP treats head selection as a global graph-theoretical problem. It uses graph-based clustering and information-theoretic measures to identify a diverse and topologically sound subset of attention heads, automatically determining the optimal number of heads per layer. Evaluations on SST-5 and MNLI benchmarks show CAHP outperforms other methods, especially in high-compression scenarios, by preserving critical intermediate layer heads rather than just those near the output. AI
IMPACT This method could enable the deployment of large Transformer models in resource-constrained environments, expanding their applicability.