ENTITY Activation Aware Quantization

Activation Aware Quantization

PulseAugur coverage of Activation Aware Quantization — every cluster mentioning Activation Aware Quantization across labs, papers, and developer communities, ranked by signal.

Show in brief

Total · 30d

16 over 90d

Releases · 30d

0 over 90d

Papers · 30d

8 over 90d

TIER MIX · 90D

research 6
tool 8
commentary 2

TOPICS

SENTIMENT · 30D

6 day(s) with sentiment data

RECENT · PAGE 1/1 · 16 TOTAL

TOOL · CL_148005 · Jul 17 · 04:00

Quantization impacts code generation models differently, study finds

A new study investigates the impact of various quantization methods on the performance of large code generation models when run on resource-constrained hardware. Researchers evaluated six state-of-the-art techniques, in…
TOOL · CL_133678 · Jul 9 · 07:02

Quantization shrinks LLMs by 75% for local use, balancing size and quality

Quantization is a crucial technique for making large language models usable on consumer hardware by reducing their size and memory requirements. This process involves representing model parameters with fewer bits, such …
COMMENTARY · CL_130187 · Jul 7 · 13:01

Self-hosting LLMs shifts cost to continuous evaluation

Self-hosting open-weight large language models shifts the primary cost from API usage to the ongoing effort of model evaluation. Quantization, a common technique to reduce model size for local use, can subtly degrade pe…
TOOL · CL_122053 · Jul 2 · 13:31

Optimizing SLM Serving: AWQ, GPTQ, GGUF, and Dynamic LoRA

This article explores optimizing the serving of small language models (SLMs) for enterprise environments, focusing on reducing latency, increasing concurrency, and minimizing costs. It compares three quantization format…
TOOL · CL_115676 · Jun 29 · 04:00

OpenPangu LLM quantization on Ascend NPUs shows 8-bit is lossless, 4-bit degrades 1B model

A new study investigates the effectiveness of various post-training quantization methods for the OpenPangu large language models when deployed on Ascend NPUs. Researchers found that 8-bit weight-only quantization is nea…
TOOL · CL_110111 · Jun 24 · 21:23

GLM-5.2 speculative decode runs on 4x DGX GB10 cluster

A user successfully implemented GLM-5.2 with MTP speculative decoding on a 4x DGX GB10 cluster, achieving approximately 9.4 tokens/second. This involved reconstructing missing build modifications from public kernels and…
COMMENTARY · CL_86313 · Jun 11 · 22:29

User seeks help optimizing Qwen 3.5 9B inference on MI50 GPU

A user is seeking assistance with configuring the Qwen 3.5 9B model for optimal local inference on a MI50 32GB GPU. They are experiencing slow speeds, below 1 token per second, while using a specific vLLM fork. The user…
TOOL · CL_84316 · Jun 11 · 01:13

LLM Quantization Formats: GGUF, GPTQ, AWQ, and NF4 Compared

The article compares four major LLM weight quantization formats: GGUF, GPTQ, AWQ, and NF4. Quantization is crucial for reducing model size to fit within limited hardware constraints, such as consumer GPUs or unified mem…
RESEARCH · CL_50600 · May 25 · 14:06

New research explores quantization benefits for transformer models

Two new research papers explore methods to improve the efficiency of transformer models, particularly for deployment on edge devices. The first paper introduces OrpQuant, a framework for multiplier-free, power-of-two qu…
RESEARCH · CL_48868 · May 21 · 22:23

New methods enhance LLM quantization for efficiency and accuracy

Researchers have developed several new methods to improve the efficiency and accuracy of quantizing large language models (LLMs). These techniques aim to reduce the memory footprint and computational cost of LLMs, makin…
TOOL · CL_27223 · May 11 · 21:34

ExLlamaV3, Unsloth Qwen, and Phi3 agent see major local AI updates

This week's local AI news highlights significant updates to the ExLlamaV3 inference library, enhancing efficiency for running quantized Llama models on consumer GPUs. Additionally, new GGUF-quantized versions of Qwen 3.…
RESEARCH · CL_23571 · May 8 · 21:34

Local AI tools boost LLM speeds with new prediction and decoding techniques

Recent updates in the local AI community are enhancing inference speeds and providing practical benchmarks for open-weight models. The llama.cpp project now supports Multi-Token Prediction (MTP), which has shown a 40% s…
RESEARCH · CL_15961 · May 5 · 04:00

New methods accelerate LLMs via efficient sparsification, quantization, and compression

Researchers have developed several new methods for compressing and optimizing large language models (LLMs) to improve efficiency and reduce computational costs. SparseForge focuses on efficient semi-structured sparsific…
RESEARCH · CL_14463 · Apr 27 · 04:00

New research explores LLM security, efficiency, and training optimization

Researchers are developing novel methods to enhance the efficiency and security of Large Language Models (LLMs). One approach, "Widening the Gap," exploits outlier injection to compromise LLM quantization, demonstrating…
RESEARCH · CL_01274 · May 24 · 00:00

Hugging Face introduces advanced quantization techniques for efficient LLMs

Researchers are developing advanced quantization techniques to make large language models (LLMs) more efficient. New methods like AutoRound, LATMiX, and GSQ aim to reduce model size and computational requirements, enabl…
RESEARCH · CL_01035 · Jan 18 · 00:00

Optimizing Transformer Inference: Techniques for Faster, Cheaper Large Models

Large transformer models present significant inference challenges due to their substantial memory footprint and computation costs, which scale quadratically with input length. Researchers and practitioners are exploring…

Quantization impacts code generation models differently, study finds

Quantization shrinks LLMs by 75% for local use, balancing size and quality

Self-hosting LLMs shifts cost to continuous evaluation

Optimizing SLM Serving: AWQ, GPTQ, GGUF, and Dynamic LoRA

OpenPangu LLM quantization on Ascend NPUs shows 8-bit is lossless, 4-bit degrades 1B model

GLM-5.2 speculative decode runs on 4x DGX GB10 cluster

User seeks help optimizing Qwen 3.5 9B inference on MI50 GPU

LLM Quantization Formats: GGUF, GPTQ, AWQ, and NF4 Compared

New research explores quantization benefits for transformer models

New methods enhance LLM quantization for efficiency and accuracy

ExLlamaV3, Unsloth Qwen, and Phi3 agent see major local AI updates

Local AI tools boost LLM speeds with new prediction and decoding techniques

New methods accelerate LLMs via efficient sparsification, quantization, and compression

New research explores LLM security, efficiency, and training optimization

Hugging Face introduces advanced quantization techniques for efficient LLMs

Optimizing Transformer Inference: Techniques for Faster, Cheaper Large Models