English(EN) M3’s architecture makes long-context inference more efficient. Serving it at production scale required systems work.

Together AI 通过新的稀疏注意力优化长上下文推理

作者 PulseAugur 编辑部 · [1 个来源] · 2026-06-11 16:30

Together AI 开发了一个新的推理系统，旨在高效处理长上下文模型。他们的方法采用了 KV-block-major 稀疏注意力，并将多模态预处理集成到 Rust 网关中，以在请求到达 GPU 工作节点之前优化性能。这项系统工作对于以生产规模提供具有扩展上下文窗口的模型至关重要。 AI

影响优化长上下文模型的推理，可能促进更广泛地采用先进的 AI 功能。

排序理由该集群描述了与长上下文模型相关的推理系统的技术优化，属于研究和基础设施改进的范畴。[lever_c_demoted from research: ic=1 ai=0.7]

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

X — Together (inference / OSS) TIER_1 English(EN) · togethercompute · 2026-06-11 16:30

M3’s architecture makes long-context inference more efficient. Serving it at production scale required systems work.

M3’s architecture makes long-context inference more efficient. Serving it at production scale required systems work. Together’s kernel and inference teams built KV-block-major sparse attention, integrated MSA with paged KV cache, optimized decode index scoring, and moved