PulseAugur
实时 07:23:40
English(EN) Introducing AutoRound: Intel’s Advanced Quantization for LLMs and VLMs

Hugging Face 推出用于高效 LLM 的先进量化技术

研究人员正在开发先进的量化技术,以提高大型语言模型 (LLM) 的效率。AutoRound、LATMiX 和 GSQ 等新方法旨在减小模型大小和计算需求,从而能够在功能较弱的硬件上进行部署。这些方法侧重于优化模型权重和激活在较低比特宽度下的表示方式,其中一些方法已达到与更高精度模型相当的准确性。创新包括用于训练后量化的新颖校准策略和用于提高鲁棒性的可学习仿射变换。 AI

影响 能够更有效地在资源受限的设备上部署 LLM,从而可能降低推理成本并提高可访问性。

排序理由 多篇 arXiv 论文和 Hugging Face 博客文章详细介绍了 LLM 量化方面的新研究和工具。

在 Hugging Face Blog 阅读 →

AI 生成摘要 · Google Gemini · 来自 16 个来源。 我们如何撰写摘要 →

Hugging Face 推出用于高效 LLM 的先进量化技术

报道来源 [16]

  1. Hugging Face Blog TIER_1 English(EN) ·

    Introducing AutoRound: Intel’s Advanced Quantization for LLMs and VLMs

  2. Hugging Face Blog TIER_1 English(EN) ·

    Fine-tuning LLMs to 1.58bit: extreme quantization made easy

  3. Hugging Face Blog TIER_1 English(EN) ·

    Quanto: a PyTorch quantization backend for Optimum

  4. Hugging Face Blog TIER_1 English(EN) ·

    Overview of natively supported quantization schemes in 🤗 Transformers

  5. Hugging Face Blog TIER_1 English(EN) ·

    Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA

  6. arXiv cs.LG TIER_1 English(EN) · Soheil Kolouri ·

    ConQuR: Corner Aligned Activation Quantization via Optimized Rotations for LLMs

    Large language models (LLMs) are costly to deploy due to their large memory footprint and high inference cost. Weight-activation quantization can reduce these costs, but low-bit activation quantization remains difficult because activation outliers induce large quantization error.…

  7. arXiv cs.AI TIER_1 English(EN) · Wenshuo Wang ·

    LLMs Should Not Yet Be Credited with Decision Explanation

    arXiv:2605.01164v1 Announce Type: new Abstract: This position paper argues that LLMs should not yet be credited with decision explanation. This matters because recent work increasingly treats accurate behavioral prediction, plausible rationales, and outcome-conditioned reasoning …

  8. arXiv cs.LG TIER_1 English(EN) · Joy Bose ·

    Spiking Sequence Machines and Transformers

    arXiv:2605.00662v1 Announce Type: cross Abstract: Sequence learning reduces to similarity-based retrieval over a temporally indexed representation space, a constraint on any sequence model, not a property of a specific architecture. We show that a spiking Sparse Distributed Memor…

  9. arXiv cs.LG TIER_1 English(EN) · Joy Bose ·

    Spiking Sequence Machines and Transformers

    Sequence learning reduces to similarity-based retrieval over a temporally indexed representation space, a constraint on any sequence model, not a property of a specific architecture. We show that a spiking Sparse Distributed Memory sequence machine (2007) and the transformer (201…

  10. arXiv cs.LG TIER_1 English(EN) · Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma ·

    Coverage-Based Calibration for Post-Training Quantization via Weighted Set Cover over Outlier Channels

    arXiv:2604.24008v1 Announce Type: new Abstract: Post-Training Quantization (PTQ) compresses large language models to low bit-widths using a small calibration set, and its quality depends strongly on which samples are chosen. We identify a failure mode in which calibration samples…

  11. arXiv cs.CL TIER_1 English(EN) · Ofir Gordon, Lior Dikstein, Arnon Netzer, Idan Achituve, Hai Victor Habi ·

    LATMiX: Learnable Affine Transformations for Microscaling Quantization of LLMs

    arXiv:2602.17681v2 Announce Type: replace-cross Abstract: Post-training quantization (PTQ) is a widely used approach for reducing the memory and compute costs of large language models (LLMs). Recent studies have shown that applying invertible transformations to activations can si…

  12. arXiv cs.CL TIER_1 English(EN) · Noel Elias, Homa Esfahanizadeh, Kaan Kale, Sriram Vishwanath, Muriel Medard ·

    MultiTok: Variable-Length Tokenization for Efficient LLMs Adapted from LZW Compression

    arXiv:2410.21548v3 Announce Type: replace Abstract: Large language models have drastically changed the prospects of AI by introducing technologies for more complex natural language processing. However, current methodologies to train such LLMs require extensive resources including…

  13. Hugging Face Daily Papers TIER_1 English(EN) ·

    GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling

    Weight quantization has become a standard tool for efficient LLM deployment, especially for local inference, where models are now routinely served at 2-3 bits per parameter. The state of the art is currently split into two sets of methods: simple scalar quantization techniques, s…

  14. arXiv cs.CV TIER_1 English(EN) · Yuchen Yang, Yifan Zhao, Shubham Ugare, Gagandeep Singh, Sasa Misailovic ·

    ARQ: A Mixed-Precision Quantization Framework for Accurate and Certifiably Robust DNNs

    arXiv:2410.24214v3 Announce Type: replace-cross Abstract: Mixed precision quantization has become an important technique for optimizing the execution of deep neural networks (DNNs). Certified robustness, which provides provable guarantees about a model's ability to withstand diff…

  15. arXiv cs.CV TIER_1 English(EN) · R\'ois\'in Luo, Alexandru Drimbarean, James McDermott, Colm O'Riordan ·

    Reclaiming Residual Knowledge: A Novel Paradigm to Low-Bit Quantization

    arXiv:2408.00923v2 Announce Type: replace Abstract: This paper explores a novel paradigm in low-bit (i.e. 4-bits or lower) quantization, differing from existing state-of-the-art methods, by framing optimal quantization as an architecture search problem within convolutional neural…

  16. Mastodon — sigmoid.social TIER_1 English(EN) · [email protected] ·

    TurboQuant: A First-Principles Walkthrough A brisk, brilliantly coded tutorial on vector quantisation: how far you can push compression on model KV caches and e

    TurboQuant: A First-Principles Walkthrough A brisk, brilliantly coded tutorial on vector quantisation: how far you can push compression on model KV caches and embeddings without breaking what matters. The interactive slider(...) # ai # javascript # ml # quantization # tutorial # …