A Reddit user on r/LocalLLaMA has discovered a method to reduce the file size of quantized large language models by storing indices to scale values instead of the scale values themselves. This technique, demonstrated on Qwen 3.5 2B and Qwen 3.6 27B models using Q4_0 quantization, could save approximately 318MB on the Qwen 3.6 27B model. The user detailed the mathematical process, showing how using 11-bit indices instead of 16-bit scales per block of 32 weights can reclaim significant storage space, with potential further savings in token embeddings. AI
IMPACT Potential for reduced storage requirements for quantized LLMs, making them more accessible on local hardware.
RANK_REASON User-generated technical analysis and proposed optimization for LLM quantization. [lever_c_demoted from research: ic=1 ai=0.7]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →