A new paper introduces a stateful transformer inference engine that significantly speeds up processing for streaming data by maintaining a persistent KV cache. This approach allows for query latency that is independent of accumulated context size, achieving up to a 5.9x speedup on market-data benchmarks compared to existing engines. Separately, Intel has released AutoRound, an advanced quantization toolkit for LLMs and VLMs that enables high accuracy at ultra-low bit widths (2-4 bits) with broad hardware compatibility, integrating with popular frameworks like vLLM and Transformers. AI
影响 New inference techniques and quantization methods reduce computational costs, potentially enabling wider deployment of large models.
排序理由 The cluster contains an academic paper detailing a new inference technique and a software toolkit for model quantization.
在 Mastodon — mastodon.social 阅读 →
- AutoRound
- GGUF
- Intel
- LLMs
- Ngrok
- Qwen
- Qwen3-0.6B
- Sam Rose
- SGLang
- SignRoundV1
- SignRoundV2
- Transformers
- vLLM
- LLM-Compressor
- Qwen/Qwen3-0.6B
- Stateful Transformers
AI 生成摘要 · Google Gemini · 来自 4 个来源。 我们如何撰写摘要 →