Energy-Efficient On-Device RAG on a Mobile NPU: System Design and Benchmark on Snapdragon X Elite
Researchers have developed an energy-efficient Retrieval-Augmented Generation (RAG) pipeline that runs entirely on a mobile Neural Processing Unit (NPU), specifically the Qualcomm Hexagon NPU found in the Snapdragon X Elite. This system significantly outperforms CPU and GPU baselines in terms of speed, energy consumption, and latency for both indexing and query processing. Evaluations indicate that the NPU-accelerated RAG achieves comparable answer quality to CPU and GPU methods, suggesting a viable path for private, low-latency, and sustainable on-device AI applications. AI
IMPACT Enables practical, private, and low-latency AI applications on edge devices without compromising quality.