Brief · PulseAugur

RESEARCH · Hugging Face Daily Papers English(EN) · 3d · [3 sources]

Chiaroscuro Attention: Spending Compute in the Dark

Researchers have developed CHIAR-Former, a novel 4-layer transformer model that optimizes compute usage by dynamically routing tokens. Instead of applying self-attention uniformly, CHIAR-Former analyzes token spectral entropy to direct each token to one of three operators: DCT spectral mixing, RBF kernel mixing, or full self-attention. This approach significantly improves performance on large-scale naturalistic text, achieving a 45% perplexity improvement on WikiText-103 with 62.5% fewer attention FLOPs compared to a standard transformer. AI

IMPACT Introduces a method to significantly reduce computational cost for transformers on large text datasets.

IMDB
WikiText-103
WikiText-2
ListOps
Chiaroscuro Attention
CHIAR-Former
RBF kernel mixing
DCT spectral mixing
self-attention
IMDB sentiment classification