English(EN) Nvidia and NYU's TurboQuant achieves theoretical optimal KV cache compression at 3-4 bits, while Together AI's OSCAR delivers 8x throughput gains through attent

Nvidia、纽约大学和 Together AI 在 KV 缓存压缩和吞吐量方面取得进展

作者 PulseAugur 编辑部 · [1 个来源] · 2026-06-18 10:52

来自 Nvidia 和纽约大学的研究人员开发了 TurboQuant，一种 KV 缓存压缩方法，可在 3-4 比特下实现理论最优。同时，Together AI 的 OSCAR 系统通过采用注意力感知旋转，将吞吐量提高了 8 倍。Apple 的 EpiCache 解决了另一个独立的问题，所有这三种技术都被证明是互补的，而非竞争关系。 AI

影响 KV 缓存压缩和吞吐量优化方面的这些进展可能导致更高效、更快速的 AI 模型推理，从而降低计算成本。

排序理由该集群描述了 AI 基础设施方面的新研究，特别关注 KV 缓存压缩和吞吐量优化技术。[lever_c_demoted from research: ic=1 ai=1.0]

在 Mastodon — sigmoid.social 阅读 →

基础设施

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

Mastodon — sigmoid.social TIER_1 English(EN) · [email protected] · 2026-06-18 10:52

Nvidia and NYU's TurboQuant achieves theoretical optimal KV cache compression at 3-4 bits, while Together AI's OSCAR delivers 8x throughput gains through attent

Nvidia and NYU's TurboQuant achieves theoretical optimal KV cache compression at 3-4 bits, while Together AI's OSCAR delivers 8x throughput gains through attention-aware rotation. Apple's EpiCache handles a distinct problem. The three approaches prove more complementary than comp…

报道来源 [1]

Nvidia and NYU's TurboQuant achieves theoretical optimal KV cache compression at 3-4 bits, while Together AI's OSCAR delivers 8x throughput gains through attent

相关实体

相关话题