English(EN) Qrita: High-performance Top-k and Top-p using Pivot-based Truncation and Selection

新的 Qrita 算法提高了 LLM 采样效率

作者 PulseAugur 编辑部 · [1 个来源] · 2026-05-27 04:00

研究人员开发了 Qrita，这是一种旨在提高大型语言模型中 Top-k 和 Top-p 采样效率的新型算法。通过采用基于高斯的 sigma 截断和四元枢轴搜索，Qrita 在确保确定性输出的同时，显著减小了搜索空间和内存使用量。这种新方法已被集成到 vLLM 中作为默认采样器，与现有的高性能 LLM 执行引擎相比，服务吞吐量提高了 1.4 倍。 AI

影响提高了 LLM 推理速度并减小了内存占用，可能降低运营成本。

排序理由该集群包含一篇详细介绍 LLM 采样新算法的研究论文。[lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.AI TIER_1 English(EN) · Jongseok Park, Sunga Kim, Alvin Cheung, Ion Stoica · 2026-05-27 04:00

Qrita: High-performance Top-k and Top-p using Pivot-based Truncation and Selection

arXiv:2602.01518v2 Announce Type: replace Abstract: Despite their importance in model sampling, efficient implementation of Top-k and Top-p algorithms for large vocabularies remains a significant challenge. Existing approaches often rely on sorting, which incurs significant compu…

报道来源 [1]

Qrita: High-performance Top-k and Top-p using Pivot-based Truncation and Selection

相关实体

相关话题