PulseAugur
实时 14:43:58

New methods tackle LLM quantization for improved efficiency and accuracy

Researchers have developed several new methods to improve the efficiency of large language models (LLMs) through quantization. OSAQ focuses on suppressing weight outliers using a low-rank Hessian property for accurate low-bit weight-only quantization. BWLA introduces a framework for 1-bit weight quantization alongside low-bit activations, achieving significant inference speedups. AGoQ targets memory-efficient distributed training by employing layer-aware activation quantization and 8-bit gradient storage, reducing memory usage and improving training speed. AI

影响 These advancements in LLM quantization promise to significantly reduce computational costs and memory requirements, enabling wider deployment and faster inference for large models.

排序理由 Multiple arXiv papers introduce novel techniques for LLM quantization, focusing on efficiency and accuracy improvements.

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 8 个来源。 我们如何撰写摘要 →

New methods tackle LLM quantization for improved efficiency and accuracy

报道来源 [8]

  1. arXiv cs.LG TIER_1 English(EN) · Zhikai Li, Zhen Dong, Xuewen Liu, Jing Zhang, Qingyi Gu ·

    OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization

    arXiv:2605.04738v1 Announce Type: new Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities. However, their massive parameter scale leads to significant resource consumption and latency during inference. Post-training weight-only quantization offers a p…

  2. arXiv cs.LG TIER_1 English(EN) · Zhixiong Zhao, Zukang Xu, Dawei Yang ·

    BWLA: Breaking the Barrier of W1AX Post-Training Quantization for LLMs

    arXiv:2605.00422v1 Announce Type: new Abstract: Large language models (LLMs) have driven major progress in NLP, yet their substantial memory and compute demands still hinder practical deployment. Binarization can compress weights to 1 bit, fundamentally lowering compute and bandw…

  3. arXiv cs.CL TIER_1 English(EN) · Wenxiang Lin, Juntao Huang, Luhan Zhang, Laili Li, Xiang Bao, Mengyang Zhang, Bing Wang, Shaohuai Shi ·

    AGoQ: Activation and Gradient Quantization for Memory-Efficient Distributed Training of LLMs

    arXiv:2605.00539v1 Announce Type: new Abstract: Quantization is a key method for reducing the GPU memory requirement of training large language models (LLMs). Yet, current approaches are ineffective for 4-bit activations and 8-bit gradients, which would easily cause slow converge…

  4. arXiv cs.CL TIER_1 English(EN) · Shaohuai Shi ·

    AGoQ: Activation and Gradient Quantization for Memory-Efficient Distributed Training of LLMs

    Quantization is a key method for reducing the GPU memory requirement of training large language models (LLMs). Yet, current approaches are ineffective for 4-bit activations and 8-bit gradients, which would easily cause slow convergence or accuracy loss. To address this, we introd…

  5. arXiv cs.AI TIER_1 English(EN) · Dawei Yang ·

    BWLA: Breaking the Barrier of W1AX Post-Training Quantization for LLMs

    Large language models (LLMs) have driven major progress in NLP, yet their substantial memory and compute demands still hinder practical deployment. Binarization can compress weights to 1 bit, fundamentally lowering compute and bandwidth cost. However, existing methods cannot addr…

  6. arXiv cs.AI TIER_1 English(EN) · Selim An, Il hong Suh, Yeseong Kim ·

    GlowQ: Group-Shared LOw-Rank Approximation for Quantized LLMs

    arXiv:2603.25385v2 Announce Type: replace-cross Abstract: Quantization techniques such as BitsAndBytes, AWQ, and GPTQ are widely used as a standard method in deploying large language models but often degrades accuracy when using low-bit representations, e.g., 4 bits. Low-rank cor…

  7. arXiv cs.CV TIER_1 English(EN) · YiFeng Wang, Zhun Sun, Keisuke Sakaguchi ·

    Technical Report: Activation Residual Hessian Quantization (ARHQ) for Low-Bit LLM Quantization

    arXiv:2605.00140v1 Announce Type: cross Abstract: We present Activation Residual Hessian Quantization (ARHQ), a post-training weight splitting method designed to mitigate error propagation in low-bit activation-weight quantization. By constructing an input-side residual Hessian f…

  8. arXiv cs.CV TIER_1 English(EN) · Keisuke Sakaguchi ·

    Technical Report: Activation Residual Hessian Quantization (ARHQ) for Low-Bit LLM Quantization

    We present Activation Residual Hessian Quantization (ARHQ), a post-training weight splitting method designed to mitigate error propagation in low-bit activation-weight quantization. By constructing an input-side residual Hessian from activation quantization residuals (G_x), ARHQ …