New methods tackle LLM quantization for improved efficiency and accuracy

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 8 sources

Researchers have developed several new methods to improve the efficiency of large language models (LLMs) through quantization. OSAQ focuses on suppressing weight outliers using a low-rank Hessian property for accurate low-bit weight-only quantization. BWLA introduces a framework for 1-bit weight quantization alongside low-bit activations, achieving significant inference speedups. AGoQ targets memory-efficient distributed training by employing layer-aware activation quantization and 8-bit gradient storage, reducing memory usage and improving training speed. AI

Summary written by gemini-2.5-flash-lite from 8 sources. How we write summaries →

IMPACT These advancements in LLM quantization promise to significantly reduce computational costs and memory requirements, enabling wider deployment and faster inference for large models.

RANK_REASON Multiple arXiv papers introduce novel techniques for LLM quantization, focusing on efficiency and accuracy improvements.

Read on arXiv cs.AI →

paper
infra

COVERAGE [8]

arXiv cs.LG TIER_1 · Zhikai Li, Zhen Dong, Xuewen Liu, Jing Zhang, Qingyi Gu · 2026-05-07 04:00

OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization

arXiv:2605.04738v1 Announce Type: new Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities. However, their massive parameter scale leads to significant resource consumption and latency during inference. Post-training weight-only quantization offers a p…
arXiv cs.LG TIER_1 · Zhixiong Zhao, Zukang Xu, Dawei Yang · 2026-05-04 04:00

BWLA: Breaking the Barrier of W1AX Post-Training Quantization for LLMs

arXiv:2605.00422v1 Announce Type: new Abstract: Large language models (LLMs) have driven major progress in NLP, yet their substantial memory and compute demands still hinder practical deployment. Binarization can compress weights to 1 bit, fundamentally lowering compute and bandw…
arXiv cs.CL TIER_1 · Wenxiang Lin, Juntao Huang, Luhan Zhang, Laili Li, Xiang Bao, Mengyang Zhang, Bing Wang, Shaohuai Shi · 2026-05-04 04:00

AGoQ: Activation and Gradient Quantization for Memory-Efficient Distributed Training of LLMs

arXiv:2605.00539v1 Announce Type: new Abstract: Quantization is a key method for reducing the GPU memory requirement of training large language models (LLMs). Yet, current approaches are ineffective for 4-bit activations and 8-bit gradients, which would easily cause slow converge…
arXiv cs.CL TIER_1 · Shaohuai Shi · 2026-05-01 09:39

AGoQ: Activation and Gradient Quantization for Memory-Efficient Distributed Training of LLMs

Quantization is a key method for reducing the GPU memory requirement of training large language models (LLMs). Yet, current approaches are ineffective for 4-bit activations and 8-bit gradients, which would easily cause slow convergence or accuracy loss. To address this, we introd…
arXiv cs.AI TIER_1 · Dawei Yang · 2026-05-01 05:42

BWLA: Breaking the Barrier of W1AX Post-Training Quantization for LLMs

Large language models (LLMs) have driven major progress in NLP, yet their substantial memory and compute demands still hinder practical deployment. Binarization can compress weights to 1 bit, fundamentally lowering compute and bandwidth cost. However, existing methods cannot addr…
arXiv cs.AI TIER_1 · Selim An, Il hong Suh, Yeseong Kim · 2026-05-01 04:00

GlowQ: Group-Shared LOw-Rank Approximation for Quantized LLMs

arXiv:2603.25385v2 Announce Type: replace-cross Abstract: Quantization techniques such as BitsAndBytes, AWQ, and GPTQ are widely used as a standard method in deploying large language models but often degrades accuracy when using low-bit representations, e.g., 4 bits. Low-rank cor…
arXiv cs.CV TIER_1 · YiFeng Wang, Zhun Sun, Keisuke Sakaguchi · 2026-05-04 04:00

Technical Report: Activation Residual Hessian Quantization (ARHQ) for Low-Bit LLM Quantization

arXiv:2605.00140v1 Announce Type: cross Abstract: We present Activation Residual Hessian Quantization (ARHQ), a post-training weight splitting method designed to mitigate error propagation in low-bit activation-weight quantization. By constructing an input-side residual Hessian f…
arXiv cs.CV TIER_1 · Keisuke Sakaguchi · 2026-04-30 18:55

Technical Report: Activation Residual Hessian Quantization (ARHQ) for Low-Bit LLM Quantization

We present Activation Residual Hessian Quantization (ARHQ), a post-training weight splitting method designed to mitigate error propagation in low-bit activation-weight quantization. By constructing an input-side residual Hessian from activation quantization residuals (G_x), ARHQ …

COVERAGE [8]

RELATED ENTITIES

RELATED TOPICS