Brief · PulseAugur

AdapTive

Together AI has introduced ATLAS, a novel adaptive-learning system for speculative decoding that dynamically improves LLM inference performance without manual tuning. Unlike standard or custom speculators, ATLAS continuously learns from runtime usage and evolving workloads to optimize token drafting in real time. This system achieves significant speedups, reaching up to 500 TPS on DeepSeek-V3.1 and 460 TPS on Kimi-K2, outperforming even specialized hardware like Groq. AI

IMPACT Accelerates LLM inference speed and reduces costs by dynamically optimizing speculative decoding.

SIGNIFICANT · Together AI blog English(EN) · 9mo · [3 sources]

Together AI delivers fastest inference for the top open-source models

Together AI has launched a new service called Dedicated Container Inference, designed to optimize the deployment and performance of custom generative media models. This platform handles complex orchestration tasks like autoscaling, queuing, and traffic isolation, allowing teams to focus on their model logic. The service has already demonstrated significant inference speedups, with some customers experiencing up to 2.6x faster performance. Additionally, Together AI has announced advancements in their inference platform, achieving up to 2x faster serverless inference for top open-source models by leveraging next-generation GPU hardware and optimized kernels. AI

IMPACT Accelerates deployment and inference for custom and open-source AI models, potentially lowering costs and increasing accessibility for specialized AI applications.