Chiaroscuro Attention: Spending Compute in the Dark
Researchers have developed CHIAR-Former, a novel 4-layer transformer model that optimizes compute usage by dynamically routing tokens. Instead of applying self-attention uniformly, CHIAR-Former analyzes token spectral entropy to direct each token to one of three operators: DCT spectral mixing, RBF kernel mixing, or full self-attention. This approach significantly improves performance on large-scale naturalistic text, achieving a 45% perplexity improvement on WikiText-103 with 62.5% fewer attention FLOPs compared to a standard transformer. AI
IMPACT Introduces a method to significantly reduce computational cost for transformers on large text datasets.