research · [3 sources] · 2026-05-19 10:53

OScaR framework slashes LLM KV cache memory, boosts speed

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 3 sources

Researchers have developed OScaR, a new framework for compressing the Key-Value (KV) cache in Large Language Models (LLMs). This compression is crucial for handling the increasing memory demands of long-context reasoning and multi-modal capabilities. OScaR addresses the limitations of existing per-channel quantization methods by introducing Canalized Rotation and Omni-Token Scaling to mitigate token norm imbalance, achieving near-lossless performance even at INT2 quantization levels. The framework offers significant improvements, including up to a 3.0x speedup in decoding and a 5.3x reduction in memory footprint. AI

Summary written by gemini-2.5-flash-lite from 3 sources. How we write summaries →

IMPACT Enables more efficient deployment of LLMs with long contexts and multi-modal capabilities by reducing memory bottlenecks.

RANK_REASON The cluster contains a research paper detailing a new method for optimizing LLM KV cache quantization.

Read on arXiv cs.CL →

paper
infra

COVERAGE [3]

arXiv cs.AI TIER_1 · Shimon Vainer · 2026-05-20 14:19

OCTOPUS: Optimized KV Cache for Transformers via Octahedral Parametrization Under optimal Squared error quantization

The key-value (KV) cache dominates memory bandwidth and footprint in long-context autoregressive inference. Recent rotation-preconditioned codecs (TurboQuant, PolarQuant) show that a structured random rotation followed by a per-coordinate scalar quantizer matched to an analytical…
arXiv cs.CL TIER_1 · Ngai Wong · 2026-05-19 10:53

OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond

The rapid advancement toward long-context reasoning and multi-modal intelligence has made the memory footprint of the Key-Value (KV) cache a dominant memory bottleneck for efficient deployment. While the established per-channel quantization effectively accommodates intrinsic chan…
Towards AI TIER_1 · Armin Norouzi, Ph.D · 2026-05-19 22:01

KV Cache Internals: How Transformers Avoid Recomputing Attention

<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://pub.towardsai.net/kv-cache-internals-how-transformers-avoid-recomputing-attention-27672f3382e0?source=rss----98111c9905da---4"><img src="https://cdn-images-1.medium.com/max/1000/1*OcvKEGWsIQb7_1mB7C1n9A.p…

COVERAGE [3]

OCTOPUS: Optimized KV Cache for Transformers via Octahedral Parametrization Under optimal Squared error quantization

OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond

KV Cache Internals: How Transformers Avoid Recomputing Attention

RELATED ENTITIES

RELATED TOPICS