Brief · PulseAugur

RESEARCH · arXiv cs.LG English(EN) · 2w · [38 sources]

Pre-Registering the Detectable Effect: A Paired-MDE Budget for 4-bit Quantization Benchmarks, with a Pilot Audit

Researchers have developed several new methods to improve the efficiency and accuracy of quantizing large language models (LLMs). These techniques aim to reduce the memory footprint and computational cost of LLMs, making them more accessible for deployment on resource-constrained devices. Innovations include calibration-free bit allocation for Mixture-of-Experts (MoE) models, outlier injection to exploit quantization vulnerabilities, and hardware-friendly mixed-precision quantization frameworks. AI

IMPACT These advancements in LLM quantization could significantly lower deployment costs and increase accessibility for a wider range of applications and hardware.

arXiv
MoE-LLMs
GEMQ
Mixture-of-Experts Large Language Models
ReSpinQuant
NeUQI
MoBiQuant
WINDQuant
InfoQuant
FP8
INT8
Qwen
INT4
LLaMA
WaterSIC
LLM
GPTQ
GGUF
LLaMA-2-7B
Mixture-of-Experts (MoE)
Qwen1.5-MoE
EmaQ
EmaQ-LT
AlphaQ
OASIS
LLaMA-3.1-8B