New research explores advanced compression techniques for AI models

By PulseAugur Editorial · [24 sources] · 2026-06-01 04:00

Researchers are exploring novel methods for compressing large models and datasets to improve efficiency. Papers discuss unifying dataset pruning and distillation, bootstrapped tokenization for image generation, and activation-informed low-rank compression for LLMs and VLMs. Other work focuses on generic triple-latent sequence models, theoretical aspects of prediction under imperfect compression, and jointly optimizing architectural and quantization choices for LLM compression. AI

IMPACT Advances in compression techniques could significantly reduce deployment costs and increase the accessibility of large AI models.

RANK_REASON Multiple arXiv papers detailing new methods and theoretical analyses for AI model and data compression.

Read on arXiv cs.LG →

AI-generated summary · Google Gemini · from 24 sources. How we write summaries →

New research explores advanced compression techniques for AI models

COVERAGE [24]

arXiv cs.AI TIER_1 English(EN) · Hoang-Loc La, Truong-Thanh Le, Amir Taherkordi, Phuong Hoai Ha · 2026-06-09 04:00

Joint Structural Pruning and Mixed-Precision Quantization for LLM Compression

arXiv:2606.07819v1 Announce Type: new Abstract: Recently, the efficiency of Large Language Models (LLMs) deployment has become a critical concern in practical applications. While post-training quantization (PTQ) and structural pruning are established techniques for reducing memor…
arXiv cs.CL TIER_1 English(EN) · Ernests Lavrinovics, Marco Letizia, Roy Janco, Shai Segal, Johannes Bjerva, Maurizio Pierini · 2026-06-08 04:00

SigmaScale: LLM Compression with SVD-based Low-Rank Decomposition and Learned Scaling Matrices

arXiv:2606.07098v1 Announce Type: new Abstract: We present SigmaScale, a method for learning auxiliary scaling matrices $S$ to aid truncated Singular Value Decomposition (SVD) based Large Language Model (LLM) compression. Instead of deriving scaling matrices analytically, SigmaSc…
arXiv cs.AI TIER_1 English(EN) · Liangji Zhu, Sanjay Ranka, Anand Rangarajan · 2026-06-06 04:00

Residual Modeling for High-Fidelity Learned Compression of Scientific Data

arXiv:2606.05389v1 Announce Type: new Abstract: Lossy compression is essential for massive spatiotemporal data from scientific simulations. Learned compressors can achieve high compression ratios at moderate accuracy targets, but their aggregate reconstruction losses do not guara…
arXiv cs.AI TIER_1 English(EN) · Rui Wang, Yan Zhao, Li Song, Zhengxue Cheng · 2026-06-06 04:00

LLMCodec: Adapting Video Codecs for Efficient Weight Compression of Large Language Models

arXiv:2606.05861v1 Announce Type: cross Abstract: The rapid development of large language models(LLMs) has led to remarkable advances in natural language processing. However, the increasing scale of these models introduces substantial challenges in terms of storage, transmission,…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-05 09:48

SigmaScale: LLM Compression with SVD-based Low-Rank Decomposition and Learned Scaling Matrices

SigmaScale learns auxiliary scaling matrices to improve truncated SVD-based LLM compression by adapting to individual weight structures through activation-aware transformations.
arXiv cs.CL TIER_1 English(EN) · Maurizio Pierini · 2026-06-05 09:48

SigmaScale: LLM Compression with SVD-based Low-Rank Decomposition and Learned Scaling Matrices

We present SigmaScale, a method for learning auxiliary scaling matrices $S$ to aid truncated Singular Value Decomposition (SVD) based Large Language Model (LLM) compression. Instead of deriving scaling matrices analytically, SigmaScale optimizes two sets of vectors that define di…
arXiv cs.CL TIER_1 English(EN) · Liu Xiao · 2026-06-05 04:00

Generic Triple-Latent Compression with Gated Associative Retrieval

arXiv:2606.05175v1 Announce Type: new Abstract: We study generic triple-latent sequence models that maintain a running token state and compressed pair-memory pathway to capture higher-order token interactions without benchmark-specific parsing. The triple-latent family improves a…
arXiv cs.LG TIER_1 English(EN) · Lingao Xiao, Songhua Liu, Yang He, Xinchao Wang · 2026-06-05 04:00

Unifying Dataset Pruning and Distillation for Efficient Large-scale Compression

arXiv:2502.06434v2 Announce Type: replace-cross Abstract: Dataset pruning (DP) and dataset distillation (DD) fundamentally differ in their outputs: DP selects original image subsets, while DD generates synthetic images. Recently, DD's increasing reliance on original images sugges…
arXiv cs.LG TIER_1 English(EN) · Haozhe Chi, Jinghan Li, Hao Jiang, Wu Sheng, Yi Ma, Jing Wang, Yadong Mu · 2026-06-05 04:00

Balancing Image Compression and Generation with Bootstrapped Tokenization

arXiv:2606.05552v1 Announce Type: new Abstract: Despite progress in image tokenization, standard methods encode redundant information by mixing all granularities within each token, thus redundancy persists between tokens. The mix of information of different granularity also compl…
arXiv cs.CL TIER_1 English(EN) · Ryan Solgi, Parsa Madinei, Jiayi Tian, Rupak Swaminathan, Jing Liu, Nathan Susanj, Zheng Zhang · 2026-06-05 04:00

Activation-Informed Pareto-Guided Low-Rank Compression for Efficient LLM/VLM

arXiv:2510.05544v2 Announce Type: replace Abstract: Large language models (LLM) and vision-language models (VLM) have achieved state-of-the-art performance, but they impose significant memory and computing challenges in deployment. We present a novel low-rank compression framewor…
arXiv cs.AI TIER_1 English(EN) · Hoang-Loc La, Truong-Thanh Le, Amir Taherkordi, Phuong Hoai Ha · 2026-06-04 04:00

LLM Compression with Jointly Optimizing Architectural and Quantization choices

arXiv:2606.04063v1 Announce Type: cross Abstract: Deploying large language models (LLMs) is challenging due to their significant memory and computational requirements. While some methods address this by developing small or tiny language models from scratch, these approaches deman…
arXiv cs.LG TIER_1 English(EN) · Qian Li, Xinyu Mao, Shang-Hua Teng, Guangxu Yang · 2026-06-04 04:00

Prediction Under Imperfect Compression: A Theory of Approximate MDL

arXiv:2606.04834v1 Announce Type: new Abstract: Minimum Description Length (MDL) formalizes the principle of Occam's razor by optimizing the total description length: $L(\mathrm{model})+L(\mathrm{data} \ | \ \mathrm{model})$. For sequential prediction, the MDL method repeatedly s…
arXiv cs.LG TIER_1 English(EN) · Guangxu Yang · 2026-06-03 13:03

Prediction Under Imperfect Compression: A Theory of Approximate MDL

Minimum Description Length (MDL) formalizes the principle of Occam's razor by optimizing the total description length: $L(\mathrm{model})+L(\mathrm{data} \ | \ \mathrm{model})$. For sequential prediction, the MDL method repeatedly selects a model with a minimum objective score of…
arXiv cs.CL TIER_1 English(EN) · Justice Owusu Agyemang, Jerry John Kponyo, Kwame Opuni-Boachie Obour Agyekum, Francisca Adoma Acheampong, Kwame Agyeman-Prempeh Agyekum, James Dzisi Gadze · 2026-06-03 04:00

Entropy Gate: Entropy Quenching for Near-Lossless Token Compression in LLM Pipelines

arXiv:2606.03739v1 Announce Type: new Abstract: LLM pipelines waste substantial token budgets on low-information content: repeated context, verbose responses, and redundant boilerplate. We introduce Entropy Gate, a token compression framework applying entropy quenching $-$ a ther…
arXiv cs.AI TIER_1 English(EN) · Artur Zagitov, Alexander Miasnikov, Maxim Krutikov, Vladimir Aletov, Gleb Molodtsov, Nail Bashirov, Artem Tsedenov, Aleksandr Beznosikov · 2026-06-03 04:00

Rethinking the Role of Tensor Decompositions in Post-Training LLM Compression

arXiv:2606.03465v1 Announce Type: cross Abstract: Post-training compression is essential for deploying large language models (LLMs) under tight resource constraints. Tensor decompositions have emerged as a promising direction, offering compact parameterizations well suited to Tra…
arXiv cs.CL TIER_1 English(EN) · James Dzisi Gadze · 2026-06-02 14:55

Entropy Gate: Entropy Quenching for Near-Lossless Token Compression in LLM Pipelines

LLM pipelines waste substantial token budgets on low-information content: repeated context, verbose responses, and redundant boilerplate. We introduce Entropy Gate, a token compression framework applying entropy quenching $-$ a thermodynamic process that progressively freezes out…
arXiv cs.LG TIER_1 English(EN) · Aleksandr Beznosikov · 2026-06-02 10:45

Rethinking the Role of Tensor Decompositions in Post-Training LLM Compression

Post-training compression is essential for deploying large language models (LLMs) under tight resource constraints. Tensor decompositions have emerged as a promising direction, offering compact parameterizations well suited to Transformer weight structures. However, existing stud…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-02 10:45

Rethinking the Role of Tensor Decompositions in Post-Training LLM Compression

Post-training compression is essential for deploying large language models (LLMs) under tight resource constraints. Tensor decompositions have emerged as a promising direction, offering compact parameterizations well suited to Transformer weight structures. However, existing stud…
arXiv cs.AI TIER_1 English(EN) · Yujia Tong, Yuxi Wang, Yunyang Wan, Tian Zhang, Junhao Dong, Jingling Yuan · 2026-06-02 04:00

Does Compression Preserve Uncertainty? A Unified Benchmark for Quantized and Sparse LLMs via Conformal Prediction

arXiv:2606.01850v1 Announce Type: new Abstract: Model compression techniques such as quantization and pruning are widely used to reduce the deployment cost of large language models (LLMs), with existing evaluations focusing almost exclusively on accuracy preservation. However, in…
arXiv cs.LG TIER_1 English(EN) · Wneya Yu, Chao Zhang, Li Wang, Samson Lasaulce, Merouane Debbah · 2026-06-02 04:00

ProjQ: Project-and-Quantize for Adapter-Aware LLM Compression

arXiv:2606.00494v1 Announce Type: new Abstract: Post-Training Quantization (PTQ) and Low-Rank Adaptation (LoRA) constitute the standard pipeline for efficient Large Language Model (LLM) deployment. However, applying them sequentially poses a problem: PTQ often leaves behind rando…
arXiv cs.AI TIER_1 English(EN) · Elia Cunegatti, Marcus Vukojevic, Erik Nielsen, Giovanni Iacca · 2026-06-02 04:00

From Layers to Submodules: Rethinking Granularity in Replacement-Based LLM Compression

arXiv:2606.02559v1 Announce Type: cross Abstract: Post-training compression of Large Language Models (LLMs) removes entire architectural components, either deleting them or replacing them with fitted modules. Existing replacement-based methods share two design constraints: full-l…
arXiv cs.AI TIER_1 English(EN) · Giovanni Iacca · 2026-06-01 17:52

From Layers to Submodules: Rethinking Granularity in Replacement-Based LLM Compression

Post-training compression of Large Language Models (LLMs) removes entire architectural components, either deleting them or replacing them with fitted modules. Existing replacement-based methods share two design constraints: full-layer granularity and contiguous selection. We argu…
arXiv cs.LG TIER_1 English(EN) · Snigdha Chandan Khilar · 2026-06-01 04:00

Cross-Layer Subspace Coupling for LLM Compression: A Unifying Framework and Its Empirical Limits

arXiv:2605.30836v1 Announce Type: new Abstract: Recent SVD based compression methods for large language models like SVD LLM and Basis Sharing can be unified under one optimization problem. While mathematical proofs and tests on Pythia models show this unified approach improves we…
r/LocalLLaMA TIER_1 English(EN) · /u/RudeChocolate9217 · 2026-06-05 02:38

proveKV – Honest 36× lossless (vs f32, 18x vs fp16) KV‑cache compression for LLMs (zero PPL regression)

<div class="md">I’m sharing a new open‑source repo that demonstrates a reproducible KV‑cache compression technique. - Result: 36× lossless / 68× lossy memory reduction vs. f32‑raw KV cache on SmolLM2‑1.7B + WikiText‑2 (0% ΔPPL). - Transpare…

COVERAGE [24]

RELATED ENTITIES

RELATED TOPICS