Qrita: High-performance Top-k and Top-p using Pivot-based Truncation and Selection
Researchers have developed Qrita, a novel algorithm designed to enhance the efficiency of Top-k and Top-p sampling in large language models. By employing Gaussian-based sigma-truncation and a quaternary pivot search, Qrita significantly reduces the search space and memory usage, while ensuring deterministic outputs. This new method has been integrated into vLLM as the default sampler and offers up to a 1.4x improvement in serving throughput compared to existing high-performance LLM execution engines. AI
IMPACT Improves LLM inference speed and reduces memory footprint, potentially lowering operational costs.