Brief

last 24h

[6/6] 221 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

RESEARCH · arXiv cs.LG English(EN) · 3d · [2 sources]

Approaching I/O-optimality for Approximate Attention

Researchers have developed a new technique to significantly reduce the I/O complexity of attention mechanisms in large language models. This method aims to minimize data transfers between fast and slow memory, a critical factor in the efficiency of these models. The new approach achieves an almost-linear I/O cost with respect to the input size, a substantial improvement over existing quadratic costs, and is inspired by recent approximate attention frameworks. AI

IMPACT Reduces computational overhead for attention, potentially enabling larger models or faster inference.
- FlashAttention
- Alman and Song
RESEARCH · arXiv cs.CL English(EN) · 1w · [11 sources]

Dynamic Chunking for Diffusion Language Models

Researchers are exploring new methods to improve the efficiency and scalability of diffusion language models (DLMs) for generating long sequences of text. One approach, Block Approximate Sparse Attention (BA-Att), accelerates attention computation by downsampling the attention space, achieving significant speedups while maintaining near full-attention performance. Another development, Dynamic Chunking Diffusion Models (DCDM), replaces fixed positional blocks with content-defined semantic chunks to better capture sequence structure. Additionally, advancements in continuous diffusion models, like RePlaid, demonstrate competitive performance against discrete DLMs, suggesting they are a viable and scalable alternative. AI

IMPACT New techniques promise faster and more scalable text generation from diffusion models, potentially enabling longer and more coherent outputs.
RESEARCH · Hugging Face Daily Papers English(EN) · 2w · [5 sources]

Variational Linear Attention: Stable Associative Memory for Long-Context Transformers

Researchers are developing new attention mechanisms to handle increasingly long contexts in large language models. One approach, Runtime-Certified Bounded-Error Quantized Attention, uses tiered KV caches to compress memory while guaranteeing fallback to exact attention, ensuring quality for tasks like language modeling and retrieval. Another method, DashAttention, employs differentiable sparse hierarchical attention to adaptively select relevant tokens, achieving high sparsity with comparable accuracy to full attention and offering improved performance over existing hierarchical methods. Variational Linear Attention (VLA) reframes linear attention as a regularized least-squares problem, limiting state norm growth and improving associative recall accuracy, while also achieving significant speedups. AI

IMPACT These advancements in attention mechanisms promise to significantly improve the efficiency and capability of LLMs in processing and understanding long contexts.
TOOL · Together AI blog English(EN) · 1mo

Inside the Together AI kernels team

The Together AI kernels team, including researchers Dan Fu and Tri Dao, developed FlashAttention, a software layer that significantly optimizes GPU performance for AI models. This breakthrough, achieved by applying database system principles to GPU memory movement, resulted in 2-3x speedups, challenging the notion that transformer attention was already fully optimized. The team's subsequent work, including the ThunderKittens library, aims to accelerate kernel development for new hardware like NVIDIA's Blackwell GPUs, addressing the critical software-hardware gap in AI infrastructure. AI

IMPACT Optimizes AI inference and training by bridging the software-hardware gap, potentially lowering costs and improving responsiveness.
- ThunderKittens
- NVIDIA
- Stanford
- Together AI
- Andrej Karpathy
- Tesla
- GPU
- FlashAttention
- Tri Dao
- Dan Fu
SIGNIFICANT · Together AI blog English(EN) · 4mo · [7 sources]

Optimizing inference speed and costs: Lessons learned from large-scale deployments

Together AI has launched a brand refresh, emphasizing its role as an "AI Native Cloud" designed for builders of AI-native applications. The company is focusing on optimizing inference for efficiency and cost-effectiveness, a critical factor for AI products that scale rapidly. They are integrating advanced research, such as adaptive speculative decoding and quantization techniques, into their platform to improve performance and reduce costs for customers like Cursor and Decagon. AI

IMPACT Together AI's focus on optimizing inference infrastructure and costs is crucial for the economic viability and scalability of AI-native applications.
RESEARCH · Hugging Face Daily Papers English(EN) · 12mo · [86 sources]

Rule2DRC: Benchmarking LLM Agents for DRC Script Synthesis with Execution-Guided Test Generation

Researchers have developed several new tools and frameworks to improve the efficiency and accuracy of large language model (LLM) operations. Charon and Frontier are simulators designed to predict LLM training and inference performance with high accuracy, aiding in optimization efforts. FT-Dojo provides a benchmark environment for autonomous LLM fine-tuning, while rePIRL offers an inverse RL-inspired framework for learning process reward models. Additionally, PALS focuses on power-aware LLM serving for Mixture-of-Experts models, and LlamaWeb enables memory-efficient LLM inference in web browsers using WebGPU. AI

IMPACT New simulators and frameworks promise more efficient, accurate, and power-aware LLM operations, potentially accelerating research and deployment.
- PagedAttention
- LLMs
- FlashAttention
- Llama-2-7B
- A100 GPU
- Nested WAIT
- LLM
- Asteria
- KVDrive
- Sarathi-Serve
- SCICONVBENCH
- vLLM
- A100
- Orca
- FasterTransformer
- TIDE
- LLaDA2.0-flash
- POPE benchmark
- DeepSeek-R1-Distill-7B
- V* benchmark
- LLaDA2.0-mini
- LLMEval-Logic
- FT-Dojo
- LlamaWeb
- FT-Agent
- WebGPU
- llama.cpp
- arXiv
- rePIRL
- Frontier
- PALS
- Charon

Brief

Approaching I/O-optimality for Approximate Attention

Dynamic Chunking for Diffusion Language Models

Variational Linear Attention: Stable Associative Memory for Long-Context Transformers

Inside the Together AI kernels team

Optimizing inference speed and costs: Lessons learned from large-scale deployments

Rule2DRC: Benchmarking LLM Agents for DRC Script Synthesis with Execution-Guided Test Generation