Do Transformers Need Three Projections? Systematic Study of QKV Variants
Researchers have explored variants of the Transformer architecture's query, key, and value (QKV) projections to reduce memory usage. Their study found that sharing projections, particularly the Q-K=V variant, can significantly decrease the KV cache size with minimal impact on performance. Combining these projection-sharing techniques with existing head-sharing methods like GQA and MQA offers substantial cache reductions, making on-device inference more feasible. AI
IMPACT Projection sharing in Transformers significantly reduces inference memory requirements, enabling more efficient on-device deployment.