Google AI has developed a new method to accelerate on-device Large Language Models (LLMs) like Gemini Nano and Gemma, particularly for use on Google Pixel phones. This technique, called Multi-Token Prediction (MTP), retrofits a drafting head onto existing, frozen models. This allows the models to generate multiple tokens simultaneously, bypassing the traditional one-token-at-a-time bottleneck and significantly improving inference speed and energy efficiency without requiring separate, memory-intensive drafter models. AI
IMPACT This method significantly enhances the speed and efficiency of on-device AI features, potentially accelerating the adoption of advanced LLM capabilities on mobile platforms.
RANK_REASON The item describes a new method for accelerating LLMs on edge devices, detailing architectural changes and their benefits, which falls under research. [lever_c_demoted from research: ic=1 ai=1.0]
Read on Google AI / Research →
- Confident Adaptive Language Modeling
- EAGLE
- Gemini Nano
- Gemma
- Google AI
- Google Pixel
- Multi Token Prediction
- Pixel 10
- Pixel 9
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →