PulseAugur
EN
LIVE 09:23:25

New AI Research Focuses on Model Efficiency via Quantization and Token Pruning

Researchers are developing new methods to improve the efficiency of AI models through quantization and token pruning. One approach, PeRQ, enhances post-training quantization by redistributing activation mass before rotation, leading to significant accuracy improvements for models like Llama3 1B. Another method, OccamToken, efficiently prunes visual tokens in Vision-Language Models (VLMs) by using register-anchored relative evidence testing, reducing token count while preserving accuracy. Additionally, Clark Hash offers a stateless codec for compact neural embedding storage, reducing space requirements by 32x with minimal accuracy loss. JacQuant introduces a quantization-aware training framework that learns Jacobian surrogates to stabilize and accelerate training, achieving higher accuracy than traditional methods for ultra-low-bit LLM quantization. AI

IMPACT These advancements in quantization and token pruning promise more efficient AI models, enabling wider deployment and reducing computational costs.

RANK_REASON The cluster consists of multiple arXiv papers detailing novel research in AI model optimization techniques.

Read on Hugging Face Daily Papers →

AI-generated summary · Google Gemini · from 9 sources. How we write summaries →

New AI Research Focuses on Model Efficiency via Quantization and Token Pruning

COVERAGE [9]

  1. arXiv cs.AI TIER_1 English(EN) · Sai Sanjeet, Ian Colbert, Pablo Monteagudo-Lago, Giuseppe Franco, Yaman Umuroglu, Nicholas J. Fraser ·

    Pushing the Limits of Block Rotations in Post-Training Quantization

    arXiv:2601.22347v2 Announce Type: replace-cross Abstract: Recent post-training quantization (PTQ) methods have adopted block rotations to diffuse outliers prior to rounding. While this reduces the overhead of online full-vector rotations, the effect of block structure on outlier …

  2. arXiv cs.AI TIER_1 English(EN) · Geng Li, Guohao Chen, Ting Chen, Shilin Shan, Kuangji Zuo, Bofan Lyu, Tuo An, Gen Li, Jianfei Yang ·

    OccamToken: Efficient VLM Inference with Training-Free and Budget-Adaptive Token Pruning

    arXiv:2605.29657v1 Announce Type: cross Abstract: Vision-language models (VLMs) rely on long visual token sequences for visual understanding, making the prefill stage expensive in both computation and memory. Most existing pruning methods follow an absolute-ranking paradigm, assi…

  3. arXiv cs.AI TIER_1 English(EN) · Stanislav Kirdey, Clark Labs Inc ·

    Clark Hash: Stateless Sparse Johnson-Lindenstrauss Quantization for Neural Embeddings

    arXiv:2605.28034v1 Announce Type: new Abstract: Clark Hash is a small method for storing neural embeddings in less space. It normalizes each database vector, applies a deterministic sparse signed Johnson-Lindenstrauss projection, clips the result, and stores a fixed-width scalar-…

  4. arXiv cs.AI TIER_1 English(EN) · Zhanfeng Feng, Shuai Guo, Xin Di, Long Peng, Yang Cao, Zhengjun Zha ·

    Tail-Aware HiFloat4: W4A4 Post-Training Quantization for Wan2.2

    arXiv:2605.26628v1 Announce Type: new Abstract: This report describes Tail-Aware HiFloat4, our submission to the low-bit text-to-video generation quantization challenge. Our method adapts the public ViDiT-Q post-training quantization pipeline to Wan2.2 under the HiFloat4 numerica…

  5. Hugging Face Daily Papers TIER_1 English(EN) ·

    Clark Hash: Stateless Sparse Johnson-Lindenstrauss Quantization for Neural Embeddings

    Clark Hash is a compact, stateless codec that reduces neural embedding storage size by 32x through deterministic sparse projections and scalar quantization while maintaining high similarity accuracy.

  6. Hugging Face Daily Papers TIER_1 English(EN) ·

    Tail-Aware HiFloat4: W4A4 Post-Training Quantization for Wan2.2

    This report describes Tail-Aware HiFloat4, our submission to the low-bit text-to-video generation quantization challenge. Our method adapts the public ViDiT-Q post-training quantization pipeline to Wan2.2 under the HiFloat4 numerical format. We quantize the main linear layers in …

  7. arXiv cs.LG TIER_1 English(EN) · Kai Yi, Vignesh Vivekraja, Harshit Khaitan, Steven Li ·

    JacQuant: STE-Free Quantization-Aware Training via Learned Jacobian Surrogates

    arXiv:2605.25469v1 Announce Type: new Abstract: Quantization-aware training (QAT) is widely deployed but typically relies on the Straight-Through Estimator (STE), which passes gradients through non-differentiable quantizers by fiat. This often makes training brittle near bin boun…

  8. Hugging Face Daily Papers TIER_1 English(EN) ·

    JacQuant: STE-Free Quantization-Aware Training via Learned Jacobian Surrogates

    Quantization-aware training (QAT) is widely deployed but typically relies on the Straight-Through Estimator (STE), which passes gradients through non-differentiable quantizers by fiat. This often makes training brittle near bin boundaries and weakly aligned with the actual behavi…

  9. r/StableDiffusion TIER_2 English(EN) · /u/AgeNo5351 ·

    A Wan 2.2 post-training Quant . 1 model instead of high + low

    <table> <tr><td> <a href="https://www.reddit.com/r/StableDiffusion/comments/1tpcm59/a_wan_22_posttraining_quant_1_model_instead_of/"> <img alt="A Wan 2.2 post-training Quant . 1 model instead of high + low" src="https://preview.redd.it/jzd5r1a8up3h1.png?width=640&amp;crop=smart&a…