PulseAugur / Brief
EN
LIVE 19:09:29

Brief

last 24h
[50/64] 222 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

  1. AdapTive

    Together AI has introduced ATLAS, a novel adaptive-learning system for speculative decoding that dynamically improves LLM inference performance without manual tuning. Unlike standard or custom speculators, ATLAS continuously learns from runtime usage and evolving workloads to optimize token drafting in real time. This system achieves significant speedups, reaching up to 500 TPS on DeepSeek-V3.1 and 460 TPS on Kimi-K2, outperforming even specialized hardware like Groq. AI

    AdapTive

    IMPACT Accelerates LLM inference speed and reduces costs by dynamically optimizing speculative decoding.

  2. Announcing General Availability of Together Instant Clusters, offering ready to use, self

    Together AI has launched Together Instant Clusters, a new service providing readily available, self-service GPU clusters for AI development and deployment. This offering aims to simplify the complex process of setting up multi-node GPU infrastructure, allowing users to provision clusters with hundreds of GPUs in minutes via API, CLI, or console. The service includes pre-configured components for distributed training and inference, supporting NVIDIA's latest GPU architectures and high-performance networking solutions. AI

    Announcing General Availability of Together Instant Clusters, offering ready to use, self

    IMPACT Simplifies GPU cluster provisioning, enabling faster experimentation and deployment for AI workloads.

  3. Improved Batch Inference API: Enhanced UI, Expanded Model Support, and 3000× Rate Limit Increase

    Together AI has significantly upgraded its Batch Inference API, introducing a more user-friendly interface and expanding model compatibility to include all serverless and private deployment models. The update dramatically increases rate limits by 3000x, from 10 million to 30 billion enqueued tokens per model per user, enabling much larger-scale data processing. These enhancements aim to make high-throughput workloads more cost-effective and accessible, with costs typically at 50% of their real-time API for most serverless models. AI

    Improved Batch Inference API: Enhanced UI, Expanded Model Support, and 3000× Rate Limit Increase

    IMPACT Enables more cost-effective and scalable processing for large AI workloads like synthetic data generation and model evaluation.

  4. FlashAttention

    Together AI has released FlashAttention-3 and FlashAttention-4, significant upgrades to their GPU-accelerated attention mechanism for large language models. FlashAttention-3, designed for Hopper GPUs, achieves up to 75% utilization and 1.5-2x speedup over its predecessor by exploiting new hardware features like Tensor Cores and Tensor Memory Accelerator, and supporting FP8 precision. FlashAttention-4, optimized for Blackwell GPUs, further enhances performance by pipelining computations and addressing bottlenecks in transcendental functions and memory traffic, reaching 71% utilization and offering substantial speedups over existing libraries. AI

    FlashAttention

    IMPACT These optimized attention mechanisms promise significantly faster LLM training and inference, enabling longer context windows and more efficient GPU utilization.

  5. 600+ new voices powered by MiniMax Speech 2.8 Turbo are now on Together AI @togethercompute 🎙️✨

    MiniMax AI has released over 600 new voices through its Speech 2.8 Turbo model. These voices are now accessible on the Together AI platform. This expansion aims to provide a wider range of synthetic speech options. AI

    600+ new voices powered by MiniMax Speech 2.8 Turbo are now on Together AI @togethercompute 🎙️✨

    IMPACT Expands the availability of synthetic voice options for developers and users on the Together AI platform.

  6. Introducing Qwen3.7-Max from @Alibaba_Qwen, Qwen’s flagship model for the agent era with 1M context and leading performance across agentic coding, reasoning, an

    Together AI is now offering access to Alibaba's Qwen3.7-Max model, a flagship offering designed for the agent era. This model boasts a 1 million token context window and demonstrates leading performance in areas such as agentic coding, reasoning, and long-horizon autonomy. Users can now leverage Qwen3.7-Max on Together's Serverless Inference platform for production-scale applications. AI

    Introducing Qwen3.7-Max from @Alibaba_Qwen, Qwen’s flagship model for the agent era with 1M context and leading performance across agentic coding, reasoning, an

    IMPACT Provides access to a new frontier model with advanced agentic capabilities and a large context window.

  7. RT @vipulved: PSA: Just added a thousand H100s and H200s to Together on-demand GPU clusters and Dedicated Endpoints: https://t.co/fr7yzZpPP8

    Together AI has significantly expanded its GPU capacity by adding one thousand NVIDIA H100 and H200 instances. These powerful GPUs are now available through Together's on-demand GPU clusters and dedicated endpoint services. This expansion aims to provide more robust infrastructure for AI inference and open-source model development. AI

    IMPACT Increases availability of high-end GPUs for AI inference and OSS model development.

  8. Violin: An open-source video translation skill that breaks language barriers

    Together AI has launched Violin, an open-source video translation tool designed to make online video content accessible across language barriers. The system utilizes advanced AI, including speech recognition, large language models, and speech synthesis, to provide high-quality translations. Violin also features interactive capabilities like a content-aware chat assistant and personalized voice selection, aiming to broaden the reach of video content globally. AI

    IMPACT Enhances accessibility of video content globally by leveraging multiple AI models for translation and interaction.

  9. Congrats to the @cursor_ai team on Composer 2.5 — a huge milestone for agentic coding models.

    Together AI has partnered with Cursor AI to launch Composer 2.5, a significant advancement for agentic coding models. This new version is noted for its speed and quality, pushing the boundaries of what coding agents can achieve. AI

    IMPACT Enhances capabilities for AI-powered coding assistants, potentially improving developer productivity.

  10. Together AI and Pearl Research Labs Team Up to Reduce the Cost of AI Inference

    Together AI has partnered with Pearl Research Labs to integrate blockchain technology into AI inference costs. This collaboration introduces a new inference endpoint for the Gemma-4-31B-it-pearl model, offering a discount of over 25% by offsetting costs with cryptocurrency emissions from the Pearl Network. The Pearl Network uses a Proof of Useful Work mechanism, where GPU computations for AI tasks simultaneously generate a cryptocurrency called PRL, aiming to fundamentally alter the economics of AI by reducing the price per token. AI

    IMPACT This partnership aims to reduce AI inference costs by integrating cryptocurrency generation, potentially impacting the unit economics for AI operators.

  11. Gemma-4-31B-it-Pearl supports 32K context, configurable thinking, function calling, and JSON mode.

    Together AI has released Gemma-4-31B-it-Pearl, an open-source model with enhanced capabilities. This model supports a 32K context window, configurable thinking processes, function calling, and JSON mode. It marks Together AI's initial offering powered by Pearl, with plans to broaden their Pearl-based product line in the future. AI

    IMPACT Provides developers with a new open-source model featuring advanced capabilities like extended context windows and function calling.

  12. RT @ZainHasan6: As inference workloads dominate, what if these matmuls could also perform useful work and generate beneficial byproducts!?…

    Pearl Research Labs has announced its first major enterprise partnership with Together AI, focusing on optimizing inference workloads. This collaboration aims to transform hyperscalers' inference capital expenditures into a more efficient model. The partnership highlights the growing importance of inference compute and its energy consumption within the AI landscape. AI

    IMPACT This partnership addresses the growing challenge of optimizing inference workloads, which are becoming a major consumer of compute and energy in AI.

  13. Introducing Gemma-4-31B-it-Pearl on Together AI, Pearl Research Labs’ instruction-tuned checkpoint of Gemma 4 31B powered by @prlnet Proof of Useful Work protoc

    Together AI has released Gemma-4-31B-it-Pearl, an instruction-tuned model based on Gemma 4 31B. This model integrates the Pearl Network's Proof of Useful Work protocol, which generates proofs from existing matrix multiplications during training and inference. Users can access this model via a serverless inference endpoint on Together AI, with a discount on costs. AI

    Introducing Gemma-4-31B-it-Pearl on Together AI, Pearl Research Labs’ instruction-tuned checkpoint of Gemma 4 31B powered by @prlnet Proof of Useful Work protoc

    IMPACT Provides a new inference endpoint for a specialized model, potentially lowering costs through its Proof of Useful Work mechanism.

  14. Together AI STT models now hold the top two spots for transcription speed on the @ArtificialAnlys Speech to Text leaderboard.

    Together AI's speech-to-text models have achieved the top two positions on the Artificial Analysis leaderboard for transcription speed. The NVIDIA Parakeet TDT 0.6B V3 model, running on Together AI, is currently ranked first, processing 303 seconds of audio for every second of computation. AI

    Together AI STT models now hold the top two spots for transcription speed on the @ArtificialAnlys Speech to Text leaderboard.

    IMPACT Sets new SOTA on transcription speed benchmarks, potentially improving efficiency for voice AI applications.

  15. Introducing voice finder — a new tool to quickly find the right voice for your app from over 600+ voices

    Together AI has launched Voice Finder, a new tool designed to help developers quickly select the most suitable voice for their applications from a catalog of over 600 options. The tool allows users to search for voices by describing their desired characteristics or by uploading an audio sample for comparison. Voice Finder categorizes each voice across more than 15 attributes, including pitch, accent, and emotion, to streamline the selection process for voice agents. AI

    IMPACT Simplifies voice selection for developers building voice agents, potentially accelerating deployment.

  16. Serving DeepSeek-V4: why million-token context is an inference systems problem

    Together AI has detailed the architectural innovations behind DeepSeek-V4's ability to handle a 1 million token context window. The model employs a hybrid attention design that compresses context before storing it in the KV cache, significantly reducing memory pressure. This architectural shift transforms the challenge of long-context inference from a model capability into an inference systems problem, requiring optimized serving engines to manage cache layouts and batching effectively. AI

    IMPACT DeepSeek-V4's architectural innovations enable practical long-context inference, pushing the boundaries of what's possible for AI applications requiring extensive context.

  17. Deploy and inference any model from HuggingFace

    Together AI has launched a new feature allowing developers to deploy and run any model from Hugging Face using their Dedicated Container Inference (DCI) infrastructure. This is facilitated by an agent-based CLI tool called Goose, which automates the complex setup process, including inference server configuration and container generation. The system aims to significantly reduce the lag time between a model's release and its practical use, as demonstrated by the rapid deployment of Netflix's void-model. AI

    IMPACT Accelerates the adoption of new AI models by drastically reducing deployment complexity and time.

  18. Announcing Together AI and Adaption Partnership

    Together AI has partnered with Adaption, a company co-founded by former Cohere and Google DeepMind leaders Sara Hooker and Sudip Roy. This collaboration integrates Adaption's data optimization tools with Together AI's fine-tuning infrastructure. The partnership aims to streamline the process for users to create high-quality, fine-tuned open-source models by improving dataset quality and simplifying the experimentation and deployment workflow. AI

    IMPACT Streamlines the creation of specialized open-source models by enhancing data quality and fine-tuning workflows.

  19. From 732 bytes to nowhere: shutting down Copy Fail in production

    Together AI has detailed its rapid response to a critical Linux kernel vulnerability, dubbed Copy Fail (CVE-2026-31431), which allows unprivileged local users to gain root access. The company treated the issue as a fleet-level emergency, disabling the vulnerable crypto socket interface across its infrastructure within hours. They also implemented a temporary kernel hardening step by unloading the vulnerable module and removing it from the module path, preventing its re-activation until stable upstream patches could be rolled out and tested. AI

    IMPACT Mitigation of a critical kernel vulnerability protects AI infrastructure from compromise, ensuring the stability and security of AI workloads.

  20. DeepSeek-V4 Pro now available on Together AI

    DeepSeek-V4 Pro, a large Mixture-of-Experts model with 1.6 trillion parameters, is now accessible on the Together AI platform. This model is designed for long-context reasoning, supporting up to a 512K-token context window in its initial Together AI deployment, with plans for a 1M-token context window. It features controllable reasoning modes to optimize for speed or depth and offers specialized pricing for cached input tokens to reduce costs on repeated queries. AI

    IMPACT Enables new applications requiring reasoning over extensive datasets, potentially lowering costs for repeated long-context queries.

  21. Heading to hashtag#MLSys2026? Come unwind with the Together AI team at Inference After Dark. Drinks, bites, shuffleboard, and a room full of researchers and AI-

    Together AI is hosting an event called "Inference After Dark" during the MLSys 2026 conference. The event will take place on Tuesday, May 19th, from 7:30 PM to 10:00 PM at Tavern Hall in Bellevue, WA. It is intended as a social gathering for researchers and AI-native builders. AI

  22. Parcae: Doing more with fewer parameters using stable looped models

    Together AI has introduced Parcae, a novel stable architecture for looped language models. This new design allows models to achieve the quality of larger Transformers while using significantly fewer parameters, by increasing recurrence rather than solely scaling data. Parcae demonstrates improved stability over previous looped models and establishes the first scaling laws for this type of architecture, suggesting a more efficient frontier for training memory-constrained on-device models. AI

    IMPACT Introduces a more parameter-efficient model architecture, potentially enabling higher quality on-device AI with reduced memory footprints.

  23. Wan 2.7 video model suite now available on Together AI

    Together AI has launched the Wan 2.7 model suite, offering advanced video generation and editing capabilities. This suite includes text-to-video generation and will soon expand to image-to-video, reference-to-video, and video editing functionalities. The models provide users with greater creative control through features like audio-driven generation, frame-level conditioning, and reference inputs, all accessible via a unified API on the Together AI platform. AI

    IMPACT Enhances creative control and workflow integration for AI video generation and editing tasks.

  24. Inside the Together AI kernels team

    The Together AI kernels team, including researchers Dan Fu and Tri Dao, developed FlashAttention, a software layer that significantly optimizes GPU performance for AI models. This breakthrough, achieved by applying database system principles to GPU memory movement, resulted in 2-3x speedups, challenging the notion that transformer attention was already fully optimized. The team's subsequent work, including the ThunderKittens library, aims to accelerate kernel development for new hardware like NVIDIA's Blackwell GPUs, addressing the critical software-hardware gap in AI infrastructure. AI

    IMPACT Optimizes AI inference and training by bridging the software-hardware gap, potentially lowering costs and improving responsiveness.

  25. Plan, divide, and conquer: How weak models excel at long context tasks

    Researchers at Together AI have developed a "Divide and Conquer" framework that enables smaller language models to effectively handle long context tasks. Their study, presented at ICLR 2026, demonstrates that by breaking down large inputs into smaller chunks and assigning them to multiple, less powerful models, performance can match or even surpass that of a single, large model like GPT-4o. This approach mitigates issues like model confusion and task-specific noise, leading to more efficient and cost-effective processing of extensive documents or codebases. AI

    IMPACT Enables cost-effective and efficient processing of long documents and codebases by smaller LLMs.

  26. Together AI expands fine-tuning service with tool calling, reasoning, and vision support

    Together AI has enhanced its fine-tuning service to better support advanced AI workflows. The update includes native support for tool call, reasoning, and vision-language model fine-tuning, addressing common issues like unreliable tool execution and degraded reasoning in complex interactions. These improvements aim to increase iteration speed and accuracy for AI teams building agentic applications, with enhanced throughput and larger dataset handling for models up to 1T parameters. AI

    IMPACT Enables more reliable and efficient fine-tuning of AI agents, potentially accelerating the development of complex AI applications.

  27. Mamba-3

    Together AI has released Mamba-3, a new state space model (SSM) prioritizing inference efficiency over training speed. This model features a more expressive recurrence formula, complex-valued state tracking, and a multi-input, multi-output (MIMO) variant that enhances accuracy without sacrificing decoding speed. Mamba-3 SISO has demonstrated superior performance in prefill and decode latency compared to previous Mamba versions and even the Llama-3.2-1B Transformer model at the 1.5B parameter scale. The team has also open-sourced the model's kernels, developed collaboratively with researchers from Carnegie Mellon University, Princeton University, and Cartesia AI. AI

    IMPACT Sets a new benchmark for inference efficiency in state space models, potentially influencing future LLM architectures and deployment strategies.

  28. Together AI Brings NVIDIA Nemotron 3 to Developers on Day 0

    Together AI has launched NVIDIA's Nemotron 3 models, including the multimodal Nano Omni and the large-context Super, on its platform. Nemotron 3 Nano Omni, a 30B parameter model, excels at reasoning across video, images, audio, and language simultaneously, making it ideal for agentic applications. The Nemotron 3 Super, a 120B parameter model, boasts a 1 million token context window and multi-token prediction for efficient handling of complex reasoning and long-context tasks. Both models are open-weights and optimized for production-scale inference on Together AI's managed infrastructure. AI

    IMPACT Accelerates development of multimodal and long-context AI applications by providing access to advanced, open-weight models on optimized infrastructure.

  29. New in Together GPU Clusters: Autoscaling, observability, and self-healing

    Together AI has enhanced its GPU clusters with new features aimed at improving efficiency and manageability for AI-native teams. The platform now supports multi-tenancy, allowing different teams to share compute resources securely and independently. Key additions include autoscaling for elastic capacity, robust observability tools, and self-healing capabilities to reduce downtime and operational overhead. AI

    IMPACT These infrastructure improvements enable AI teams to manage compute resources more efficiently, potentially reducing costs and accelerating development cycles.

  30. KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving

    Multiple research papers published in May 2026 introduce novel techniques to optimize the Key-Value (KV) cache in large language models, addressing memory and latency bottlenecks. These methods include offloading KV cache to object storage like S3 (ObjectCache), employing advanced compression strategies like three-way token routing (VECTOR), and using auxiliary models for selective KV cache recomputation (CacheClip). Other approaches focus on hardware-aware quantization (InnerQ, OCTOPUS) and service-aware adaptive compression (KVServe) to improve efficiency and reduce decode latency, especially for long-context inference and retrieval-augmented generation (RAG) systems. AI

    IMPACT These advancements in KV cache optimization promise to significantly improve the efficiency and speed of long-context LLM inference, making advanced AI applications more practical and cost-effective.

  31. How speech models fail where it matters the most and what to do about it

    Researchers at Together AI have found that current state-of-the-art speech recognition models exhibit a significant failure rate, averaging 39% error in transcribing street names, particularly for non-native English speakers who are 18% more likely to be misunderstood. This inaccuracy can lead to substantial real-world consequences, such as increased travel time and costs for services like ride-sharing. The study suggests that a synthetic data generation technique called "cross-lingual style transfer" can improve transcription accuracy by up to 60% with minimal training data. AI

    How speech models fail where it matters the most and what to do about it

    IMPACT Speech recognition systems need improvement for real-world applications, especially for diverse linguistic groups, to avoid costly errors.

  32. Fine

    Together AI has enhanced its fine-tuning platform to support a wider array of large language models, including recent releases from DeepSeek, Qwen, and Meta, alongside OpenAI's gpt-oss. The platform now offers expanded context lengths, up to 131k tokens for some models, at no additional cost, facilitating tasks like long-document processing and complex code editing. Separately, Together AI researchers have explored LLM behavior using minimal, topic-neutral prompts to uncover inherent model preferences, finding that GPT-OSS favors programming and math, Llama leans literary, DeepSeek often produces religious content, and Qwen tends toward multiple-choice questions. AI

    Fine

    IMPACT Together AI's platform updates enable developers to fine-tune a broader range of large models with extended context, potentially lowering costs and improving performance on complex tasks.

  33. Rime Arcana V3 Turbo and Rime Arcana V3 now available on Together AI

    Together AI has launched two new Rime models, V3 Turbo and V3, designed for natural code-switching in voice agents. V3 Turbo offers English-Spanish switching with a time-to-first-audio of approximately 120ms on dedicated endpoints, maintaining conversational flow and prosody. The V3 model supports switching across 11 languages, providing a unified solution for multilingual customer interactions without the need for separate language-specific models. AI

    Rime Arcana V3 Turbo and Rime Arcana V3 now available on Together AI

    IMPACT Enables more natural and efficient multilingual voice agent interactions, potentially reducing costs for high-volume deployments.

  34. DSGym: A holistic framework for evaluating and training data science agents

    Researchers have introduced DSGym, a new framework designed to standardize the evaluation and training of data science agents. This system addresses limitations in current benchmarks by providing a unified API and self-contained execution environments, ensuring fair comparisons and enabling agents to utilize underlying data. DSGym integrates existing benchmarks and introduces new datasets for bioinformatics and machine learning competitions, demonstrating its utility by training a 4B parameter model to state-of-the-art performance among open-source agents. AI

    DSGym: A holistic framework for evaluating and training data science agents

    IMPACT Standardizes evaluation and training for data science agents, potentially accelerating development and improving performance.

  35. Optimizing inference speed and costs: Lessons learned from large-scale deployments

    Together AI has launched a brand refresh, emphasizing its role as an "AI Native Cloud" designed for builders of AI-native applications. The company is focusing on optimizing inference for efficiency and cost-effectiveness, a critical factor for AI products that scale rapidly. They are integrating advanced research, such as adaptive speculative decoding and quantization techniques, into their platform to improve performance and reduce costs for customers like Cursor and Decagon. AI

    IMPACT Together AI's focus on optimizing inference infrastructure and costs is crucial for the economic viability and scalability of AI-native applications.

  36. Learn how Cursor partnered with Together AI to deliver real-time, low-latency inference at scale

    Cursor, an AI-powered coding platform, has partnered with Together AI to optimize its real-time inference capabilities. This collaboration focuses on achieving low-latency responses within the editor's feedback loop, which is crucial for the AI's predictive and refactoring features. The partnership leverages NVIDIA's Blackwell architecture, specifically the GB200 NVL72, to enhance performance and reduce response times for developers. AI

    IMPACT Enables faster, more responsive AI coding assistance by optimizing inference infrastructure, potentially improving developer productivity.

  37. Inside multi-node training: How to scale model training across GPU clusters

    Training large foundation models necessitates distributing the workload across numerous GPUs housed in multiple interconnected machines, a process known as multi-node training. This approach is essential for handling models with billions or trillions of parameters that exceed the memory capacity of single servers and would otherwise take months to train. Effective multi-node training relies on sophisticated parallelism strategies, high-speed network interconnects, and robust fault tolerance mechanisms to ensure efficient computation and progress. AI

    IMPACT Explains the critical infrastructure and techniques required to train massive AI models, enabling faster iteration and development.

  38. How to choose the right open model for production

    Choosing the right open-source AI model for production requires careful consideration of factors like transparency, adaptability, and control. While proprietary models offer tiered options, open models allow for deeper customization and ownership. However, legal licensing requirements, such as Apache-2.0 or MIT, must be strictly adhered to for commercial use, and model size should correlate with the capability tier of comparable closed models. AI

    IMPACT Provides guidance for AI operators on selecting and implementing open-source models effectively.

  39. MiniMax Speech 2.6 Turbo now available natively on Together AI

    Together AI has released MiniMax Speech 2.8 Turbo, an enterprise text-to-speech model designed for natural-sounding voice agents. This new model offers significant improvements in prosody, includes sound tags for vocal cues like laughter and sighs, and boasts high-fidelity voice cloning capabilities. It also provides end-to-end generation in under 250 milliseconds and is now available on Together AI's dedicated infrastructure, alongside over 600 new voices. AI

    MiniMax Speech 2.6 Turbo now available natively on Together AI

    IMPACT Enhances the naturalness and expressiveness of AI voice agents, potentially improving user interaction in applications.

  40. Rime voice models now available on Together AI

    Together AI has integrated Rime's enterprise-grade voice models, Arcana v2 and Mist v2, into its platform. Arcana v2 offers expressive, conversational voices trained on real customer interactions, while Mist v2 provides deterministic pronunciation control for high-volume applications. These models are designed to improve the reliability and naturalness of AI-powered voice agents, reducing latency and enhancing customer trust by ensuring consistent pronunciation and a more human-like conversational flow. AI

    Rime voice models now available on Together AI

    IMPACT Enhances AI voice agent capabilities by providing more natural and controllable speech synthesis, potentially improving customer experience in voice-based applications.

  41. Research POV: Yes, AGI Can Happen – A Computational Perspective

    Together AI's VP of Kernels, Dan Fu, argues that the pursuit of AGI is not hitting a hardware wall. He posits that current AI systems are significantly underutilizing existing hardware, with training runs often achieving only 20% Mean FLOP Utilization (MFU) and inference in the single digits. Fu suggests that advancements in software-hardware co-design and innovations like FP4 training could unlock substantial performance gains, and that future compute power from next-generation hardware has yet to be fully integrated. AI

    Research POV: Yes, AGI Can Happen – A Computational Perspective

    IMPACT Argues that significant performance gains are achievable through software-hardware co-design, potentially accelerating AGI development.

  42. nvidia/Nemotron-Labs-Diffusion-14B

    NVIDIA has released the Nemotron-Labs Diffusion family of language models, available in 3B, 8B, and 14B parameter sizes. These models uniquely support autoregressive (AR), diffusion, and self-speculation decoding modes within a single architecture, offering significant speed-ups. By generating tokens in parallel blocks rather than sequentially, Nemotron-Labs Diffusion achieves up to 6.4x higher throughput than traditional AR models, while maintaining or improving accuracy. This breakthrough addresses the memory-bandwidth bottleneck inherent in AR models, making them more efficient for production deployments and agentic systems. AI

    IMPACT Accelerates AI inference by breaking the sequential token generation bottleneck, enabling more efficient and cost-effective production deployments.

  43. Announcing Together Python SDK v2.0

    Together AI has released the Release Candidate for its new Python SDK, version 2.0. This updated SDK is built with a modern, type-safe architecture using OpenAPI specifications and Stainless, aiming for improved performance and easier maintenance. It replaces the legacy v1 SDK and introduces new features like Instant Clusters beta APIs, while also offering better type safety and editor support for developers. AI

    IMPACT Improves developer experience and efficiency for interacting with Together AI's services.

  44. How to run TorchForge reinforcement learning pipelines in the Together AI Native Cloud

    Together AI is enhancing its cloud platform to support advanced reinforcement learning (RL) pipelines, integrating TorchForge and Monarch for distributed training. The platform now offers low-latency GPU communication and heterogeneous scheduling for mixed CPU/GPU workloads, crucial for complex RL tasks. New integrations with Together CodeSandbox and Code Interpreter allow RL agents to interact with tools and execute code, expanding their capabilities beyond traditional game-playing scenarios. AI

    How to run TorchForge reinforcement learning pipelines in the Together AI Native Cloud

    IMPACT Enhances infrastructure for complex AI training, enabling more sophisticated RL applications and tool integration.

  45. Introducing AutoJudge: Streamlined inference acceleration via automated dataset curation

    Researchers at Together AI have developed AutoJudge, a novel method to accelerate large language model inference. This technique automates the curation of task-specific datasets, enabling lossy speculative decoding without manual annotation. AutoJudge identifies critical tokens that impact downstream quality, achieving up to a 2x speedup over standard speculative decoding with minimal accuracy loss. AI

    IMPACT Accelerates LLM inference by automating dataset curation for speculative decoding, potentially reducing operational costs.

  46. Announcing the fastest inference for realtime voice AI agents

    Together AI has launched a unified platform for building real-time voice agents, integrating speech-to-text (STT), large language models (LLM), and text-to-speech (TTS) within a single cloud environment. This co-location aims to reduce latency to under 500ms and simplify deployment by eliminating inter-vendor network hops. The platform now natively hosts models like Deepgram for STT and Cartesia Sonic-3 for TTS, offering developers more choice and a streamlined experience for production-ready voice applications. AI

    Announcing the fastest inference for realtime voice AI agents

    IMPACT Accelerates development of real-time conversational AI applications by simplifying infrastructure and reducing latency.

  47. LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models

    Several recent research papers explore the internal mechanisms and reasoning capabilities of Large Reasoning Models (LRMs). One paper, since withdrawn, proposed Entropy-Gradient Inversion and a related optimization technique (CorR-PO) to correlate token entropy with logit gradients for improved reasoning. Another withdrawn paper, LambdaPO, aimed to enhance reinforcement learning alignment by re-conceptualizing advantage estimation for finer-grained preference signals. A third paper introduced Convex Compositional Energy Minimization (CCEM) to address non-convexity in compositional reasoning models, enabling transfer to larger problem instances. Finally, a study on the "hidden critique ability" in LRMs identified a "critique vector" that can improve error detection and self-correction without additional training. AI

    IMPACT New research explores methods to improve LLM reasoning, instruction following, and self-correction capabilities, potentially leading to more reliable and controllable AI systems.

  48. Expanding Together AI Model Library into multimedia generation with 40+ new image and video models

    Together AI has expanded its platform to include advanced multimedia generation capabilities, integrating over 40 new image and video models. This move aims to simplify development by offering a unified API for text, image, and video generation, eliminating the need for developers to manage multiple providers. The platform now hosts models like FLUX.2 for consistent character and product image generation, alongside video models from major players such as OpenAI, Google, and ByteDance. AI

    Expanding Together AI Model Library into multimedia generation with 40+ new image and video models

    IMPACT Consolidates generative media tools, potentially reducing friction for developers building AI-native applications.

  49. Announcing the Together AI Startup Accelerator, purpose-built for AI Native Apps

    Together AI has launched a new startup accelerator program specifically designed for companies building AI-native applications. The accelerator will provide selected startups with platform credits, engineering expertise, go-to-market support, and access to a venture capital network. This initiative aims to help these startups scale their AI-native apps effectively on Together AI's platform. Early participants include Corridor.dev and PlayerZero, who are already leveraging the program's resources. AI

    Announcing the Together AI Startup Accelerator, purpose-built for AI Native Apps

    IMPACT Provides resources and support for startups building AI-native applications, potentially accelerating innovation in the AI ecosystem.

  50. Together AI welcomes Mahadev Konar as SVP for Infrastructure Engineering

    Together AI has appointed Mahadev Konar as its new SVP of Infrastructure Engineering to bolster its GPU cloud services. Konar, a key figure in Apache Hadoop's development and formerly VP of Infrastructure at Instacart, will lead efforts to enhance the reliability, performance, and scalability of Together AI's platform. The company aims to provide AI-native startups with a robust infrastructure, enabling them to focus on product development rather than managing complex GPU environments. AI

    IMPACT Strengthens Together AI's infrastructure capabilities, potentially improving scalability and reliability for AI startups using their platform.