Moonshot AI has released FlashKDA, an open-source implementation of Kimi Delta Attention. This new kernel achieves up to 2.5 times faster inference speeds on NVIDIA H200 GPUs. It is built using CUTLASS and optimized for variable-length batching, allowing for seamless integration into existing deep learning frameworks. AI
影响 Accelerates inference for attention-based models on high-end GPUs, potentially lowering costs and increasing throughput.
排序理由 Open-source release of a specialized kernel for attention mechanisms.
在 Mastodon — mastodon.social 阅读 →
AI 生成摘要 · Google Gemini · 来自 2 个来源。 我们如何撰写摘要 →