PulseAugur
EN
LIVE 21:43:53

Google AI accelerates on-device LLMs with new Multi-Token Prediction method

Google AI has developed a new method to accelerate on-device Large Language Models (LLMs) like Gemini Nano and Gemma, particularly for use on Google Pixel phones. This technique, called Multi-Token Prediction (MTP), retrofits a drafting head onto existing, frozen models. This allows the models to generate multiple tokens simultaneously, bypassing the traditional one-token-at-a-time bottleneck and significantly improving inference speed and energy efficiency without requiring separate, memory-intensive drafter models. AI

IMPACT This method significantly enhances the speed and efficiency of on-device AI features, potentially accelerating the adoption of advanced LLM capabilities on mobile platforms.

RANK_REASON The item describes a new method for accelerating LLMs on edge devices, detailing architectural changes and their benefits, which falls under research. [lever_c_demoted from research: ic=1 ai=1.0]

Read on Google AI / Research →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

Google AI accelerates on-device LLMs with new Multi-Token Prediction method

COVERAGE [1]

  1. Google AI / Research TIER_1 English(EN) ·

    Accelerating Gemini Nano models on Pixel with frozen Multi-Token Prediction

    Machine Intelligence