Together AI optimizes long-context inference with new sparse attention

By PulseAugur Editorial · [1 sources] · 2026-06-11 16:30

Together AI has developed a new inference system designed to efficiently handle long-context models. Their approach incorporates KV-block-major sparse attention and integrates multimodal preprocessing into a Rust gateway to optimize performance before requests reach GPU workers. This systems work is crucial for serving models with extended context windows at a production scale. AI

IMPACT Optimizes inference for long-context models, potentially enabling wider adoption of advanced AI capabilities.

RANK_REASON The cluster describes technical optimizations for inference systems related to long-context models, which falls under research and infrastructure improvements. [lever_c_demoted from research: ic=1 ai=0.7]

Read on X — Together (inference / OSS) →

Together AI

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

X — Together (inference / OSS) TIER_1 English(EN) · togethercompute · 2026-06-11 16:30

M3’s architecture makes long-context inference more efficient. Serving it at production scale required systems work.

M3’s architecture makes long-context inference more efficient. Serving it at production scale required systems work. Together’s kernel and inference teams built KV-block-major sparse attention, integrated MSA with paged KV cache, optimized decode index scoring, and moved

COVERAGE [1]

M3’s architecture makes long-context inference more efficient. Serving it at production scale required systems work.

RELATED ENTITIES

RELATED TOPICS