English(EN) Pre-Registering the Detectable Effect: A Paired-MDE Budget for 4-bit Quantization Benchmarks, with a Pilot Audit

新方法提升LLM量化效率与准确性

作者 PulseAugur 编辑部 · [48 个来源] · 2026-05-21 22:23

研究人员开发了多种新方法来提高大型语言模型（LLM）量化的效率和准确性。这些技术旨在减少LLM的内存占用和计算成本，使其更容易部署在资源受限的设备上。创新包括混合专家（MoE）模型的无校准比特分配、利用量化漏洞的异常值注入以及硬件友好的混合精度量化框架。 AI

影响 LLM量化的这些进展可能会显著降低部署成本，并提高更广泛应用和硬件的可及性。

排序理由多篇arXiv论文发布，详细介绍了LLM量化的新方法。

在 arXiv cs.LG 阅读 →

arXiv
GEMQ
MoE-LLMs
Mixture-of-Experts Large Language Models
FP8
InfoQuant
INT4
INT8
LLaMA
MoBiQuant
NeUQI
Qwen
ReSpinQuant
WINDQuant
AlphaQ
EmaQ
EmaQ-LT
GGUF
GPTQ
LLaMA-2-7B
LLaMA-3.1-8B
LLM
Mixture-of-Experts (MoE)
OASIS
Qwen1.5-MoE
WaterSIC

AI 生成摘要 · Google Gemini · 来自 48 个来源。我们如何撰写摘要 →

报道来源 [48]

arXiv cs.AI TIER_1 English(EN) · Zhixiong Zhao, Zukang Xu, Zhixuan Chen, Xing Hu, Zhe Jiang, Dawei Yang · 2026-06-12 04:00

TWLA：通过训练后量化为大型语言模型实现三元权重和低比特激活

arXiv:2606.13054v1 Announce Type: cross Abstract: Large language models (LLMs) exhibit exceptional general language processing capabilities, but their memory and compute costs hinder deployment. Ternarization has emerged as a promising compression technique, offering significant …
Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-11 08:37

TWLA：通过训练后量化实现 LLM 的三元权重和低比特激活

Large language models (LLMs) exhibit exceptional general language processing capabilities, but their memory and compute costs hinder deployment. Ternarization has emerged as a promising compression technique, offering significant reductions in model size and inference complexity.…
arXiv cs.AI TIER_1 English(EN) · Patrik Czak\'o, G\'abor Kert\'esz, S\'andor Sz\'en\'asi · 2026-06-10 04:00

面向LLM量化的可训练平滑旋转变换与学习通道尺度

arXiv:2606.09927v1 Announce Type: cross Abstract: Post-training quantization (PTQ) is one of the most practical ways to reduce the serving cost of Large Language Models (LLMs), but activation quantization remains difficult because outlier-dominated channels lead to large quantiza…
arXiv cs.CL TIER_1 English(EN) · Haoyu Wang, Haiyan Zhao, Xingyu Yu, Zhangyang Yao, Xu Han, Zhiyuan Liu, Maosong Sun · 2026-06-10 04:00

UniSVQ：2位统一标量-向量量化

arXiv:2606.10520v1 Announce Type: new Abstract: Post-training quantization at the 2-bit level enables low-cost deployment and inference acceleration for large language models (LLMs). Scalar quantization (SQ) and vector quantization (VQ) are two primary quantization methods, howev…
arXiv cs.AI TIER_1 English(EN) · Juan Amboage, Pablo Monteagudo-Lago, Ian Colbert, Giuseppe Franco, Nicholas Fraser · 2026-06-10 04:00

最优训练后量化尺度及其查找方法

arXiv:2606.10890v1 Announce Type: cross Abstract: Post-training quantization (PTQ) compresses large language models by mapping weights to low-bit representations. The scaling factor that defines the quantization grid is typically chosen using simple, data-free heuristics. In this…
arXiv cs.AI TIER_1 English(EN) · Nicholas Fraser · 2026-06-09 14:03

最优的训练后量化尺度及其查找方法

Post-training quantization (PTQ) compresses large language models by mapping weights to low-bit representations. The scaling factor that defines the quantization grid is typically chosen using simple, data-free heuristics. In this work, we present PiSO (Piecewise Scale Optimizati…
arXiv cs.CL TIER_1 English(EN) · Maosong Sun · 2026-06-09 07:50

UniSVQ：2位统一标量-向量量化

Post-training quantization at the 2-bit level enables low-cost deployment and inference acceleration for large language models (LLMs). Scalar quantization (SQ) and vector quantization (VQ) are two primary quantization methods, however, the former suffers from significant performa…
arXiv cs.AI TIER_1 English(EN) · Li Lin, Xiaojun Wan · 2026-06-09 04:00

ScaleSweep：通过块尺度初始化实现 LLM 的精确 NVFP4 训练后量化

arXiv:2606.07618v1 Announce Type: cross Abstract: NVFP4 is a recently introduced hardware-supported FP4 format that improves the fidelity of 4-bit quantization through fine-grained block scales. However, existing NVFP4 scale initialization methods still primarily rely on AbsMax i…
arXiv cs.AI TIER_1 English(EN) · Haoqi Wang, Lorenz K. Mueller, Jiawei Zhuang, Mathieu Salzmann, Lukas Cavigelli · 2026-06-08 04:00

OffQ：通过偏移来驯服LLM量化中的结构化异常值

arXiv:2606.07116v1 Announce Type: cross Abstract: Low-bit quantization has been widely adopted to accelerate the inference of large language models (LLMs) by significantly reducing computational cost and memory usage. However, activation outliers pose a major challenge to effecti…
arXiv cs.CL TIER_1 English(EN) · Beshr IslamBouli, David Jin · 2026-06-08 04:00

AAAC：用于4位LLM权重量化的激活感知自适应码本

arXiv:2605.08692v2 Announce Type: replace-cross Abstract: Post-training weight-only quantization to 4 bits is widely used to reduce the memory and compute costs of large language model inference. Existing PTQ methods, such as AWQ and GPTQ, improve how weights are mapped onto a fi…
arXiv cs.AI TIER_1 English(EN) · Haoyu Huang, Linlin Yang, Sheng Xu, Boyu Liu, Guodong Guo, Zhongqian Fu, Hang Zhou, Baochang Zhang · 2026-06-08 04:00

FAIR-Calib：面向扩散大语言模型训练后量化的、考虑前沿不确定性的重校准方法

arXiv:2606.06547v1 Announce Type: cross Abstract: Diffusion Large Language Models (dLLMs) refine tokens iteratively but commit them irreversibly, leading to a "stability lag" where early decisions remain fragile even after being written. We reveal that Post-Training Quantization …
arXiv cs.AI TIER_1 English(EN) · Rayyan Abdalla, Amir Hussein, Min Wu, Dinesh Manocha · 2026-06-06 04:00

大型语言模型隐藏成本的最小化：图引导超低比特量化

arXiv:2606.05429v1 Announce Type: new Abstract: Post-training quantization (PTQ) is critical for the efficient deployment of large language models (LLMs). Recent ultra-low-bit PTQ methods rely on rigid weight-saliency assumptions or position heuristics, introducing substantial hi…
arXiv cs.CL TIER_1 English(EN) · Lukas Cavigelli · 2026-06-05 10:11

OffQ：通过偏移来驯服LLM量化中的结构化异常值

Low-bit quantization has been widely adopted to accelerate the inference of large language models (LLMs) by significantly reducing computational cost and memory usage. However, activation outliers pose a major challenge to effective quantization, often leading to notable performa…
arXiv cs.CL TIER_1 English(EN) · Zihan Chen, Bike Xie, Jundong Li, Cong Shen · 2026-06-05 04:00

面向大语言模型的通道式混合精度量化

arXiv:2410.13056v4 Announce Type: replace Abstract: Large Language Models (LLMs) have demonstrated remarkable success across a wide range of language tasks, but their deployment on edge devices remains challenging due to the substantial memory requirements imposed by their large …
arXiv cs.AI TIER_1 English(EN) · Xiaohua Zhan, Kazuki Egashira, Robin Staab, Mark Vero, Martin Vechev · 2026-06-04 04:00

差距拉大：通过异常值注入利用LLM量化

arXiv:2605.15152v2 Announce Type: replace-cross Abstract: LLM quantization has become essential for memory-efficient deployment. Recent work has shown that quantization schemes can pose critical security risks: an adversary may release a model that appears benign in full precisio…
arXiv cs.LG TIER_1 English(EN) · Wanqi Yang, Yuexiao Ma, Alexander Conzelmann, Xiawu Zheng, Michael W. Mahoney, T. Konstantin Rusch, Shiwei Liu · 2026-06-04 04:00

AlphaQ：面向混合专家模型量化的无校准比特分配

arXiv:2606.04980v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) architectures scale model capacity through sparse expert activation, but their deployment remains memory-bound because all expert weights must reside in memory. Mixed-precision quantization can substantially…
arXiv cs.LG TIER_1 English(EN) · Xin Nie, Haicheng Zhang, Liang Dong, Beining Feng, Jinhong Weng, Guiling Sun · 2026-06-04 04:00

SFMP：面向大型语言模型的细粒度、硬件友好且无搜索的混合精度量化

arXiv:2602.01027v2 Announce Type: replace Abstract: Mixed-precision quantization is a promising approach for compressing large language models under tight memory budgets. However, existing mixed-precision methods typically suffer from one of two limitations: they either rely on e…
arXiv cs.LG TIER_1 English(EN) · Chin-Yuan Yeh, Ting-An Chen, De-Nian Yang, Ming-Syan Chen · 2026-06-04 04:00

面向多领域长尾量化：通过特征对齐与缩放实现

arXiv:2606.04920v1 Announce Type: new Abstract: Quantizing deep neural networks is essential for efficient inference on resource-constrained devices. However, most existing methods are designed for single-domain and class-balanced data, leaving practical settings with domain shif…
arXiv cs.LG TIER_1 English(EN) · Shiwei Liu · 2026-06-03 15:03

AlphaQ：面向混合专家模型量化的无校准比特分配

Mixture-of-Experts (MoE) architectures scale model capacity through sparse expert activation, but their deployment remains memory-bound because all expert weights must reside in memory. Mixed-precision quantization can substantially reduce this footprint by assigning different bi…
arXiv cs.LG TIER_1 English(EN) · Ming-Syan Chen · 2026-06-03 14:16

面向多领域和长尾量化：通过特征对齐和缩放实现

Quantizing deep neural networks is essential for efficient inference on resource-constrained devices. However, most existing methods are designed for single-domain and class-balanced data, leaving practical settings with domain shifts or severe class imbalance underexplored. We a…
arXiv cs.LG TIER_1 English(EN) · Egor Lifar, Semyon Savkin, Or Ordentlich, Yury Polyanskiy · 2026-06-03 04:00

WaterSIC：信息论上（接近）最优的线性层量化

arXiv:2603.04956v2 Announce Type: replace Abstract: This paper considers the problem of converting a given dense linear layer to low precision. The tradeoff between compressed length and output discrepancy is analyzed information theoretically (IT). It is shown that a popular GPT…
arXiv cs.LG TIER_1 English(EN) · Chi-Wei Huang, Chia-Chi Tsai · 2026-06-03 04:00

Qift：面向旋转 W2A4/KV4 LLM 推理的友好型无零 W2 训练后量化

arXiv:2606.02823v1 Announce Type: new Abstract: Two-bit weight quantization is attractive for memory-efficient LLM inference, but the standard W2 level set {-2,-1,0,+1} often collapses under aggressive W2A4/KV4 settings. We study the scalar level-set geometry of two-bit weights i…
arXiv cs.LG TIER_1 English(EN) · Xueying Wu, Baijun Zhou, Zhihui Gao, Yuzhe Fu, Qilin Zheng, Yintao He, Hai Li · 2026-06-03 04:00

OASIS：基于离群感知查找表的双侧量化 GEMM 加速 LLM 推理

arXiv:2507.23035v4 Announce Type: replace Abstract: Large language models (LLMs) have demonstrated impressive capabilities across a wide range of applications, but demand substantial memory and compute resources during inference. Existing quantization methods expose a trade-off b…
arXiv cs.AI TIER_1 English(EN) · Jiayu Zhao, Zihan Teng, Minhao Fan, Tianrui Ma, Wentao Ren, Song Chen, Weichen Liu · 2026-06-02 04:00

BitsMoE：面向MoE大模型量化的高效谱能量引导比特分配

arXiv:2606.00079v1 Announce Type: cross Abstract: Mixture-of-Experts (MoE) large language models reduce per-token computation through sparse expert activation, but their deployment remains memory-intensive because all expert weights must be kept resident in memory. Existing MoE c…
arXiv cs.LG TIER_1 English(EN) · Jiale Chen, Vage Egiazarian, Roberto L. Castro, Torsten Hoefler, Dan Alistarh · 2026-06-02 04:00

WUSH：LLM量化近乎最优的自适应变换

arXiv:2512.00956v3 Announce Type: replace Abstract: Quantizing LLM weights and activations is a standard approach for efficient deployment, but a few extreme outliers can stretch the dynamic range and amplify low-bit quantization errors. Prior transform-based mitigations (e.g., H…
arXiv cs.LG TIER_1 English(EN) · Halyun Jeong, Jack Xin, Penghang Yin · 2026-06-02 04:00

超越离散性：用于1位量化的直通估计器的样本复杂度分析

arXiv:2505.18113v2 Announce Type: replace Abstract: Training quantized neural networks requires addressing the non-differentiable and discrete nature of the underlying optimization problem. To tackle this challenge, the straight-through estimator (STE) has become the most widely …
arXiv cs.CL TIER_1 English(EN) · Li Lin, Xinyu Hu, Xiaojun Wan · 2026-06-01 04:00

NeUQI：低比特大语言模型的近乎最优均匀量化参数初始化

arXiv:2505.17595v4 Announce Type: replace-cross Abstract: Large language models (LLMs) achieve impressive performance across domains but face significant challenges when deployed on consumer-grade GPUs or personal devices such as laptops, due to high memory consumption and infere…
arXiv cs.AI TIER_1 English(EN) · Artur Zagitov, Gleb Molodtsov, Aleksandr Beznosikov · 2026-05-29 04:00

HARP：用于极端大模型量化的Hadamard预条件自适应旋转处理器

arXiv:2605.29843v1 Announce Type: cross Abstract: Post-training quantization (PTQ) is essential for deploying LLMs under memory and bandwidth constraints. However, extreme low-bit quantization remains highly sensitive to activation outliers and anisotropic weight curvature. Exist…
arXiv cs.LG TIER_1 English(EN) · Zexin Zhuang, Yanhang Li, Zhichao Fan · 2026-05-29 04:00

预注册可检测效应：用于4位量化基准测试的配对MDE预算及试点审计

arXiv:2605.28873v1 Announce Type: new Abstract: This is a planning-method note with an unpaired pilot audit. We adapt the classical paired-binary sample-size calculation (Miettinen, 1968) to quantization benchmarks, giving a conservative minimum detectable effect (MDE) bound $\de…
arXiv cs.AI TIER_1 English(EN) · Suyoung Kim, Sunghyun Wee, Hyeonjin Kim, Kyomin Hwang, Hyunho Lee, Nojun Kwak · 2026-05-29 04:00

ReSpinQuant：通过子空间残差旋转近似实现高效的逐层大模型量化

arXiv:2604.11080v2 Announce Type: replace-cross Abstract: Rotation-based Post-Training Quantization (PTQ) has emerged as a promising solution for mitigating activation outliers in the quantization of Large Language Models (LLMs). Global rotation methods achieve inference efficien…
arXiv cs.AI TIER_1 English(EN) · Jung Hyun Lee, June Yong Yang, Jungwook Choi, Eunho Yang · 2026-05-29 04:00

LFQ：用于提升低比特量化大模型生成质量的感知最终块量化

arXiv:2605.29756v1 Announce Type: new Abstract: As large language models continue to scale, low-bit weight-only post-training quantization (PTQ) offers a practical solution to their memory-efficient deployment. Although block-wise PTQ is capable of matching the full-precision (FP…
arXiv cs.CL TIER_1 English(EN) · Preetam Sharma, Kacper Dobek · 2026-05-27 04:00

QAM-W：通过Hadamard旋转和激活感知缩放实现LLM权重的联合2D码本量化

arXiv:2605.26339v1 Announce Type: cross Abstract: Scalar post-training quantizers discard pairwise coordinate structure within weight rows. We introduce QAM-W (Quadrature Amplitude Modulation for Weights), a codec that recovers this structure: each row is L2-normalized, block-Had…
arXiv cs.LG TIER_1 English(EN) · Phong Nam Huu Nguyen, Khoi M. Le, Cong-Duy T Nguyen, Anh Tuan Luu, Thong Thanh Nguyen, Tho Quan · 2026-05-27 04:00

WINDQuant：全局混合精度大模型量化的权重感知神经决策

arXiv:2605.26660v1 Announce Type: new Abstract: Quantization is an effective approach to reduce the memory footprint and inference cost of large language models (LLMs), yet maintaining performance in the ultra-low-bit regime remains challenging. Existing post-training methods oft…
arXiv cs.AI TIER_1 English(EN) · Ke Li, Dong An, Xiaoling Zang, Can Ye, Liang Xie, Qibo Qiu, Chen Shen, Xiaofei He, Wenxiao Wang · 2026-05-27 04:00

InfoQuant：为低比特 LLM 量化塑造激活分布

arXiv:2605.26175v1 Announce Type: cross Abstract: Low-bit activation quantization remains a major bottleneck in efficient large language model (LLM) deployment. The difficulty is not only that activations contain outliers, but that their distributions are often poorly matched to …
arXiv cs.AI TIER_1 English(EN) · Eldar Kurtic, Alexandre Marques, Shubhra Pandit, Mark Kurtz, Dan Alistarh · 2026-05-27 04:00

“给我BF16，否则就死”？LLM量化中的准确性-性能权衡

arXiv:2411.02355v4 Announce Type: replace-cross Abstract: Quantization is a powerful tool for accelerating large language model (LLM) inference, but the accuracy-performance trade-offs across different formats remain unclear. In this paper, we conduct the most comprehensive empir…
arXiv cs.AI TIER_1 English(EN) · Dongwei Wang, Jinhee Kim, Seokho Han, Denis Gudovskiy, Yohei Nakata, Tomoyuki Okuno, KhayTze Peong, Kang Eun Jeon, Jong Hwan Ko, Yiran Chen, Huanrui Yang · 2026-05-26 04:00

MoBiQuant: 混合比特量化用于面向Token自适应的任意精度大模型

arXiv:2602.20191v2 Announce Type: replace-cross Abstract: Dynamic runtime latency and memory constraints necessitate flexible large language model (LLM) deployment, where an LLM can be inferred with various quantization precisions based on available computational resources. Recen…
arXiv cs.CL TIER_1 English(EN) · Jianing Deng, Song Wang, Dongwei Wang, Zijie Liu, Tianlong Chen, Huanrui Yang, Jingtong Hu · 2026-05-25 04:00

GEMQ：面向MoE大模型的全局专家级混合精度量化

arXiv:2605.23078v1 Announce Type: cross Abstract: Mixture-of-Experts Large Language Models (MoE-LLMs) achieve strong performance but incur substantial memory overhead due to massive expert parameters. Mixed-precision quantization mitigates this cost by allocating expert-wise bit-…
arXiv cs.CL TIER_1 English(EN) · Jingtong Hu · 2026-05-21 22:23

GEMQ：面向MoE大语言模型的全局专家级混合精度量化

Mixture-of-Experts Large Language Models (MoE-LLMs) achieve strong performance but incur substantial memory overhead due to massive expert parameters. Mixed-precision quantization mitigates this cost by allocating expert-wise bit-widths based on their importance, approaching the …
arXiv cs.CV TIER_1 English(EN) · Hao Lu, Yongxin Guo, Onur Koyun, Zhengjie Zhu, Abbas Alili, Metin N. Gurcan · 2026-06-11 04:00

NSVQ：通过稳定向量量化中的编码器漂移来缓解码本坍塌

arXiv:2606.11363v1 Announce Type: new Abstract: Vector quantization is central to modern generative modeling pipelines, but large-codebook VQ models often suffer from codebook collapse. We identify encoder drift as a key driver of this failure: as the encoder moves the latent dis…
arXiv stat.ML TIER_1 English(EN) · Hanyang Li, Jianhao Ma, Ying Cui · 2026-06-09 04:00

理解感知量化训练：量化权重下的梯度偏向低损耗盆地

arXiv:2606.09012v1 Announce Type: cross Abstract: Post-training quantization (PTQ) converts a trained full-precision model into low-bit weights without task-level retraining, while quantization-aware training (QAT) incorporates quantization into the training loop. Although PTQ is…
arXiv stat.ML TIER_1 English(EN) · Ying Cui · 2026-06-08 04:21

理解感知量化训练：量化权重下的梯度偏向低损耗盆地

Post-training quantization (PTQ) converts a trained full-precision model into low-bit weights without task-level retraining, while quantization-aware training (QAT) incorporates quantization into the training loop. Although PTQ is efficient and often accurate at moderate bitwidth…
Towards AI TIER_1 English(EN) · The Dev Loop · 2026-06-05 23:01

LLM 量化如何工作：INT8、INT4、GPTQ 和 AWQ 详解

<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://pub.towardsai.net/how-llm-quantization-works-int8-int4-gptq-and-awq-explained-172e1a76b347?source=rss----98111c9905da---4"><img src="https://cdn-images-1.medium.com/max/1238/1*4M0Hf5tg3sypd97KzEMpmA.png" …
r/LocalLLaMA TIER_1 English(EN) · /u/we_are_mammals · 2026-06-12 02:02

一些人为设计的测试，比较不同 Gemma 和 Qwen 量化模型的准确性

<div class="md"><p>I mostly ran these tests for myself, because the published KLD numbers are hard to interpret, and you cannot compare <code>9B-Q4</code> vs <code>4B-Q8</code>, for example. But I'm happy to share the results with anyone interested:</p> <h3>Test 1 …
dev.to — LLM tag TIER_1 Italiano(IT) · Chinaski · 2026-06-11 07:22

重新量化本地模型，速度提升 14 倍

<p><em>Where a 2-bit model spends its bits, and why trying answers used to cost eighty minutes</em></p> <p><a class="article-body-image-wrapper" href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-upload…
r/LocalLLaMA TIER_1 Italiano(IT) · /u/Any-Chipmunk5480 · 2026-06-07 05:57

Dense vs MoE 量化弹性

<div class="md"><p>Which one is more resiliant to quantization? Especially at 4-bit?</p> <p>My experience:i tried gemma4 26b a4b with Ud-q5_k_xl quant and i got loop around 45k context. At 6-bit the looping issue is fixed. (Llamacpp default sample settings)</p> <p>…
r/LocalLLaMA TIER_1 English(EN) · /u/rerri · 2026-06-05 16:11

Gemma 4 结合量化感知训练

<table> <tr><td> <a href="https://www.reddit.com/r/LocalLLaMA/comments/1txpeo0/gemma_4_with_quantizationaware_training/"> <img alt="Gemma 4 with quantization-aware training" src="https://external-preview.redd.it/Kqa2WJRtSdbm_jz8LE3GDS5WzwWk70_7fBudWHrjqnI.png?width=640&crop=s…
r/LocalLLaMA TIER_1 English(EN) · /u/_cpatonn · 2026-06-04 20:18

cyankiwi AWQ 4位 — 26.05更新，NVFP4 + FP8动态量化及Qwen3.6 4位量化模型基准测试

<table> <tr><td> <a href="https://www.reddit.com/r/LocalLLaMA/comments/1twz9ur/cyankiwi_awq_4bit_2605_update_nvfp4_fp8_dynamic/"> <img alt="cyankiwi AWQ 4-bit — 26.05 update, NVFP4 + FP8 Dynamic quantization and benchmarks across Qwen3.6 4-bit quants" src="https://preview.redd.it…
r/LocalLLaMA TIER_1 English(EN) · /u/bobaburger · 2026-05-29 17:53

Qwen3.6-27B 量化基准测试

<table> <tr><td> <a href="https://www.reddit.com/r/LocalLLaMA/comments/1tr9vzn/qwen3627b_quantization_benchmark/"> <img alt="Qwen3.6-27B Quantization Benchmark" src="https://preview.redd.it/awcfprb5744h1.png?width=140&height=105&auto=webp&s=80295b9c977b7615680ea4cef47…

报道来源 [48]

相关实体

相关话题