PulseAugur
LIVE 06:57:59
research · [8 sources] ·
0
research

New methods tackle LLM quantization for improved efficiency and accuracy

Researchers have developed several new methods to improve the efficiency of large language models (LLMs) through quantization. OSAQ focuses on suppressing weight outliers using a low-rank Hessian property for accurate low-bit weight-only quantization. BWLA introduces a framework for 1-bit weight quantization alongside low-bit activations, achieving significant inference speedups. AGoQ targets memory-efficient distributed training by employing layer-aware activation quantization and 8-bit gradient storage, reducing memory usage and improving training speed. AI

Summary written by gemini-2.5-flash-lite from 8 sources. How we write summaries →

IMPACT These advancements in LLM quantization promise to significantly reduce computational costs and memory requirements, enabling wider deployment and faster inference for large models.

RANK_REASON Multiple arXiv papers introduce novel techniques for LLM quantization, focusing on efficiency and accuracy improvements.

Read on arXiv cs.AI →

COVERAGE [8]

  1. arXiv cs.LG TIER_1 · Zhikai Li, Zhen Dong, Xuewen Liu, Jing Zhang, Qingyi Gu ·

    OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization

    arXiv:2605.04738v1 Announce Type: new Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities. However, their massive parameter scale leads to significant resource consumption and latency during inference. Post-training weight-only quantization offers a p…

  2. arXiv cs.LG TIER_1 · Zhixiong Zhao, Zukang Xu, Dawei Yang ·

    BWLA: Breaking the Barrier of W1AX Post-Training Quantization for LLMs

    arXiv:2605.00422v1 Announce Type: new Abstract: Large language models (LLMs) have driven major progress in NLP, yet their substantial memory and compute demands still hinder practical deployment. Binarization can compress weights to 1 bit, fundamentally lowering compute and bandw…

  3. arXiv cs.CL TIER_1 · Wenxiang Lin, Juntao Huang, Luhan Zhang, Laili Li, Xiang Bao, Mengyang Zhang, Bing Wang, Shaohuai Shi ·

    AGoQ: Activation and Gradient Quantization for Memory-Efficient Distributed Training of LLMs

    arXiv:2605.00539v1 Announce Type: new Abstract: Quantization is a key method for reducing the GPU memory requirement of training large language models (LLMs). Yet, current approaches are ineffective for 4-bit activations and 8-bit gradients, which would easily cause slow converge…

  4. arXiv cs.CL TIER_1 · Shaohuai Shi ·

    AGoQ: Activation and Gradient Quantization for Memory-Efficient Distributed Training of LLMs

    Quantization is a key method for reducing the GPU memory requirement of training large language models (LLMs). Yet, current approaches are ineffective for 4-bit activations and 8-bit gradients, which would easily cause slow convergence or accuracy loss. To address this, we introd…

  5. arXiv cs.AI TIER_1 · Dawei Yang ·

    BWLA: Breaking the Barrier of W1AX Post-Training Quantization for LLMs

    Large language models (LLMs) have driven major progress in NLP, yet their substantial memory and compute demands still hinder practical deployment. Binarization can compress weights to 1 bit, fundamentally lowering compute and bandwidth cost. However, existing methods cannot addr…

  6. arXiv cs.AI TIER_1 · Selim An, Il hong Suh, Yeseong Kim ·

    GlowQ: Group-Shared LOw-Rank Approximation for Quantized LLMs

    arXiv:2603.25385v2 Announce Type: replace-cross Abstract: Quantization techniques such as BitsAndBytes, AWQ, and GPTQ are widely used as a standard method in deploying large language models but often degrades accuracy when using low-bit representations, e.g., 4 bits. Low-rank cor…

  7. arXiv cs.CV TIER_1 · YiFeng Wang, Zhun Sun, Keisuke Sakaguchi ·

    Technical Report: Activation Residual Hessian Quantization (ARHQ) for Low-Bit LLM Quantization

    arXiv:2605.00140v1 Announce Type: cross Abstract: We present Activation Residual Hessian Quantization (ARHQ), a post-training weight splitting method designed to mitigate error propagation in low-bit activation-weight quantization. By constructing an input-side residual Hessian f…

  8. arXiv cs.CV TIER_1 · Keisuke Sakaguchi ·

    Technical Report: Activation Residual Hessian Quantization (ARHQ) for Low-Bit LLM Quantization

    We present Activation Residual Hessian Quantization (ARHQ), a post-training weight splitting method designed to mitigate error propagation in low-bit activation-weight quantization. By constructing an input-side residual Hessian from activation quantization residuals (G_x), ARHQ …