Brief · PulseAugur

TOOL · arXiv cs.AI English(EN) · 5d

Quant.npu: Enabling Efficient Mobile NPU Inference for on-device LLMs via Fully Static Quantization

Researchers have developed Quant.npu, a novel framework for fully static quantization designed to enhance the efficiency of large language models on mobile Neural Processing Units (NPUs). This method addresses the incompatibility of existing dynamic quantization techniques with NPU hardware by incorporating learnable quantization parameters and rotation matrices. Quant.npu also introduces a tailored initialization strategy and a two-stage optimization pipeline to ensure stable training and adapt to diverse activation profiles, ultimately reducing inference latency by up to 15.1% while maintaining comparable accuracy to current state-of-the-art approaches. AI

IMPACT Enables more efficient deployment of large language models on mobile devices, potentially improving user experience and expanding on-device AI capabilities.

Large language models
Neural Processing Units
Quant.npu