PulseAugur
EN
LIVE 02:31:31

HEAPr algorithm precisely prunes LLM experts, cutting memory needs

Researchers have developed HEAPr, a new pruning algorithm designed to reduce the memory footprint of Mixture-of-Experts (MoE) large language models. Unlike previous methods that prune entire experts, HEAPr breaks down experts into smaller, atomic units. This approach utilizes second-order information from atomic expert outputs, significantly reducing computational complexity and enabling more precise compression with minimal accuracy loss. Experiments on models like DeepSeek MoE and Qwen MoE show HEAPr can achieve nearly lossless compression up to 25% pruning, reducing FLOPs by a similar margin. AI

IMPACT Enables more efficient deployment of large MoE models by reducing memory requirements without significant performance degradation.

RANK_REASON This is a research paper describing a new algorithm for pruning LLMs. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. arXiv cs.AI TIER_1 English(EN) · Ke Li, Zheng Yang, Zhongbin Zhou, Feng Xue, Zhonglin Jiang, Wenxiao Wang ·

    HEAPr: Hessian-based Efficient Atomic Expert Pruning in Output Space

    arXiv:2509.22299v3 Announce Type: replace-cross Abstract: Mixture-of-Experts (MoE) architectures in large language models (LLMs) deliver exceptional performance and reduced inference costs compared to dense LLMs. However, their large parameter counts result in prohibitive memory …