PulseAugur
LIVE 11:14:18
research · [6 sources] ·
0
research

New methods accelerate LLMs via efficient sparsification, quantization, and compression

Researchers have developed several new methods for compressing and optimizing large language models (LLMs) to improve efficiency and reduce computational costs. SparseForge focuses on efficient semi-structured sparsification by optimizing sparsity masks, achieving high accuracy with significantly fewer retraining tokens. FASQ introduces flexible accelerated subspace quantization, enabling continuous compression levels without calibration data and outperforming existing methods in both accuracy and speed on commodity GPUs. Additionally, CoSpaDi uses calibration-guided sparse dictionary learning for structured decomposition, improving accuracy-compression trade-offs. Another approach, SplitZip, offers ultra-fast lossless KV-cache compression for disaggregated LLM serving, significantly speeding up data transfer between model components. AI

Summary written by gemini-2.5-flash-lite from 6 sources. How we write summaries →

IMPACT These advancements in LLM compression and optimization could lead to more efficient deployment of large models on less powerful hardware and faster inference times.

RANK_REASON Multiple research papers published on arXiv detailing novel methods for LLM compression and optimization.

Read on arXiv cs.CL →

COVERAGE [6]

  1. arXiv cs.LG TIER_1 · Liu Hanzuo, Chaofan Lin, Weixuan Sun, Yulong Wang, Key, Rayying, Mingyu Gao ·

    SparseForge: Efficient Semi-Structured LLM Sparsification via Annealing of Hessian-Guided Soft-Mask

    arXiv:2605.06402v1 Announce Type: new Abstract: Semi-structured sparsity provides a practical path to accelerate large language models (LLMs) with native hardware support, but post-training semi-structured pruning often suffers from substantial quality degradation due to strong s…

  2. arXiv cs.LG TIER_1 · Mingyu Gao ·

    SparseForge: Efficient Semi-Structured LLM Sparsification via Annealing of Hessian-Guided Soft-Mask

    Semi-structured sparsity provides a practical path to accelerate large language models (LLMs) with native hardware support, but post-training semi-structured pruning often suffers from substantial quality degradation due to strong structural coupling. Existing methods rely on lar…

  3. arXiv cs.LG TIER_1 · Ye Qiao, Yian Wang, Zhiheng Chen, Hyoukjun Kwon, Sitao Huang ·

    FASQ: Flexible Accelerated Subspace Quantization for Calibration-Free LLM Compression

    arXiv:2605.04084v1 Announce Type: new Abstract: Compressing large language models (LLMs) for deployment on commodity GPUs remains challenging: conventional scalar quantization is limited to fixed bit-widths (e.g., 8/4/3-bit), offers only a few discrete compression points, and typ…

  4. arXiv cs.CL TIER_1 · Denis Makhov, Dmitriy Shopkhoev, Magauiya Zhussip, Ammar Ali, Stamatios Lefkimmiatis ·

    CoSpaDi: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning

    arXiv:2509.22075v5 Announce Type: replace Abstract: Post-training compression of large language models (LLMs) often relies on low-rank weight approximations that represent each column of the weight matrix in a shared low-dimensional subspace. This strategy is computationally effi…

  5. arXiv cs.LG TIER_1 · Wen-Da Wei, Han-Bin Fang, Yang-Di Liu, Jiang-Xin Shi, James Kwok, Yu-Feng Li ·

    Activation Compression in LLMs: Theoretical Analysis and Efficient Algorithm

    arXiv:2605.01255v1 Announce Type: new Abstract: Training large language models (LLMs) is highly memory-intensive, as training must store not only weights and optimizer states but also intermediate activations for backpropagation. While existing memory-efficient methods largely fo…

  6. arXiv cs.LG TIER_1 · Yipin Guo, Siddharth Joshi ·

    SplitZip: Ultra Fast Lossless KV Compression for Disaggregated LLM Serving

    arXiv:2605.01708v1 Announce Type: cross Abstract: Contemporary systems serving large language models (LLMs) have adopted prefill-decode disaggregation to better load-balance between the compute-bound prefill phase and the memory-bound decode phase. Under this design, prefill work…