This paper investigates the function of "sinks" and diagonal patterns within transformer attention mechanisms. Researchers analyzed the geometric conditions required for sinks to exist and demonstrated their equivalence to hard attention switches. The study also refines understanding of how sinks prevent oversmoothing, showing dense attention can be smoother than sparse attention under specific conditions. Finally, it compares the cost of representing sinks versus diagonal patterns, explaining why sinks are preferred in pretrained transformers. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT Provides theoretical insights into transformer architecture, potentially informing future model design and optimization.
RANK_REASON Academic paper analyzing mechanisms within transformer attention. [lever_c_demoted from research: ic=1 ai=1.0]