Chiaroscuro Attention optimizes transformer compute with dynamic token routing

By PulseAugur Editorial · [3 sources] · 2026-06-06 00:00

Researchers have developed CHIAR-Former, a novel 4-layer transformer model that optimizes compute usage by dynamically routing tokens. Instead of applying self-attention uniformly, CHIAR-Former analyzes token spectral entropy to direct each token to one of three operators: DCT spectral mixing, RBF kernel mixing, or full self-attention. This approach significantly improves performance on large-scale naturalistic text, achieving a 45% perplexity improvement on WikiText-103 with 62.5% fewer attention FLOPs compared to a standard transformer. AI

IMPACT Introduces a method to significantly reduce computational cost for transformers on large text datasets.

RANK_REASON The cluster contains a research paper detailing a novel model architecture and its performance evaluation.

Read on Hugging Face Daily Papers →

AI-generated summary · Google Gemini · from 3 sources. How we write summaries →

Chiaroscuro Attention optimizes transformer compute with dynamic token routing

COVERAGE [3]

arXiv cs.AI TIER_1 English(EN) · Prateek Kumar Sikdar · 2026-06-09 04:00

Chiaroscuro Attention: Spending Compute in the Dark

arXiv:2606.08327v1 Announce Type: cross Abstract: Standard transformers apply self-attention uniformly at every layer and token, regardless of whether the input requires dynamic cross-token interaction. We propose CHIAR-Former (Chiaroscuro Attention), a 4-layer hybrid transformer…
arXiv cs.AI TIER_1 English(EN) · Prateek Kumar Sikdar · 2026-06-06 20:38

Chiaroscuro Attention: Spending Compute in the Dark

Standard transformers apply self-attention uniformly at every layer and token, regardless of whether the input requires dynamic cross-token interaction. We propose CHIAR-Former (Chiaroscuro Attention), a 4-layer hybrid transformer that routes each token to one of three operators …
Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-06 00:00

Chiaroscuro Attention: Spending Compute in the Dark

CHIAR-Former uses spectral entropy-based routing to dynamically select between DCT, RBF, and self-attention operators, achieving improved efficiency on large text datasets while maintaining performance through hybrid attention mechanisms.

COVERAGE [3]

Chiaroscuro Attention: Spending Compute in the Dark

Chiaroscuro Attention: Spending Compute in the Dark

Chiaroscuro Attention: Spending Compute in the Dark

RELATED ENTITIES

RELATED TOPICS