PulseAugur
EN
LIVE 16:26:13

Stateful Transformers boost streaming inference; Intel releases AutoRound quantization toolkit

A new paper introduces a stateful transformer inference engine that significantly speeds up processing for streaming data by maintaining a persistent KV cache. This approach allows for query latency that is independent of accumulated context size, achieving up to a 5.9x speedup on market-data benchmarks compared to existing engines. Separately, Intel has released AutoRound, an advanced quantization toolkit for LLMs and VLMs that enables high accuracy at ultra-low bit widths (2-4 bits) with broad hardware compatibility, integrating with popular frameworks like vLLM and Transformers. AI

IMPACT New inference techniques and quantization methods reduce computational costs, potentially enabling wider deployment of large models.

RANK_REASON The cluster contains an academic paper detailing a new inference technique and a software toolkit for model quantization.

Read on Mastodon — mastodon.social →

AI-generated summary · Google Gemini · from 4 sources. How we write summaries →

Stateful Transformers boost streaming inference; Intel releases AutoRound quantization toolkit

COVERAGE [4]

  1. arXiv cs.LG TIER_1 English(EN) · Victor Norgren ·

    Attention Once Is All You Need: Efficient Streaming Inference with Stateful Transformers

    Conventional transformer inference engines are request-driven, paying an O(n) prefill cost on every query. In streaming workloads, where data arrives continuously and queries probe an ever-growing context, this cost is prohibitive. We introduce a data-driven computational model c…

  2. Mastodon — fosstodon.org TIER_1 English(EN) · [email protected] ·

    Advanced Quantization Algorithm for LLMs https:// github.com/intel/auto-round # HackerNews # AdvancedQuantization # LLMs # MachineLearning # AI # Research # Int

    Advanced Quantization Algorithm for LLMs https:// github.com/intel/auto-round # HackerNews # AdvancedQuantization # LLMs # MachineLearning # AI # Research # Intel

  3. Mastodon — mastodon.social TIER_1 English(EN) · [email protected] ·

    Advanced Quantization Algorithm for LLMs https://github.com/intel/auto-round # HackerNews # Tech # AI

    Advanced Quantization Algorithm for LLMs https://github.com/intel/auto-round # HackerNews # Tech # AI

  4. Mastodon — mastodon.social TIER_1 English(EN) · rmathew ·

    An excellent introduction to # quantization used for # LLMs 👌🏽: “Quantization From The Ground Up”, Sam Rose, Ngrok ( https:// ngrok.com/blog/quantization ). On

    An excellent introduction to # quantization used for # LLMs 👌🏽: “Quantization From The Ground Up”, Sam Rose, Ngrok ( https:// ngrok.com/blog/quantization ). On HN: https:// news.ycombinator.com/item?id=4 7519295 # AI # Math # FloatingPoint # NumericalAnalysis # Numbers # NeuralNe…