Brief

last 24h

[36/36] 221 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

TOOL · Medium — MLOps tag English(EN) · 2h

How to Detect GPU Waste in a Kubernetes Cluster

This article discusses how to identify and address GPU waste within Kubernetes clusters, a problem that often goes unnoticed due to seemingly healthy utilization metrics. It highlights that inefficient GPU usage can occur even when overall cluster utilization appears normal. The piece aims to provide methods for detecting these hidden inefficiencies. AI

IMPACT Provides guidance for optimizing AI/ML infrastructure costs and efficiency.
- Kubernetes
- GPU
RESEARCH · arXiv cs.CL Deutsch(DE) · 3d · [3 sources]

FastKernels: Benchmarking GPU Kernel Generation in Production

Researchers have introduced FastKernels, a new benchmark designed to better evaluate GPU kernel generation agents used in production LLM inference. Existing benchmarks are misaligned with real-world systems, leading agents to produce kernels that perform poorly outside of testing environments. FastKernels aims to bridge this gap by serving as a production-grade inference framework that mirrors real-world deployment needs and covers a vast majority of HuggingFace Transformers architectures. AI

IMPACT Addresses a critical bottleneck in LLM inference by improving the alignment of GPU kernel generation benchmarks with production systems.
- FastKernels
- GPU kernel generation
- vLLM
- SGLang
- AI inference
- LLM
- GPU
TOOL · Modal blog English(EN) · 3d

How we achieved truly serverless GPUs

Modal has developed a system to achieve truly serverless GPUs for AI inference, addressing the challenge of rapidly scaling resources to meet variable demand. Their approach involves maintaining cloud buffers of idle GPUs, a custom filesystem for lazy container image serving, and efficient checkpoint/restore mechanisms for both CPU and GPU processes. This engineering effort, developed over five years, reduces AI inference replica scaling time from tens of minutes to mere seconds, aiming to maximize GPU Allocation Utilization. AI

IMPACT Enables faster, more efficient scaling of AI inference workloads, potentially lowering costs and improving resource utilization.
- xAI
- AWS
- Modal
- SGLang
- Marc Brooker
- AI inference
TOOL · Modal blog English(EN) · 3d

Autoscaling Autoresearch: Give your agents elastic GPUs on Modal

Modal has introduced an autoscaling feature for GPUs designed to support AI research agents. This new capability allows agents to dynamically provision and release compute resources as needed, addressing the challenge of managing unpredictable research workloads and associated costs. The system demonstrated its effectiveness by completing OpenAI's Parameter Golf challenge significantly faster than a traditional workstation, highlighting its efficiency and cost-effectiveness for AI-driven research. AI

IMPACT Enables more efficient and cost-effective use of compute for AI research, potentially accelerating development cycles.
RESEARCH · 36氪 (36Kr) 中文(ZH) · 1d

First quarter AI financing exceeds 110 billion yuan, domestic large models see a surge in funding

In the first quarter, the AI sector saw over 110 billion yuan in funding, with domestic large language models experiencing a significant surge. Companies like Moonshot AI and Jueyue Xingchen secured over 30 billion yuan in May alone, while embodied intelligence startups also attracted substantial investment. A significant portion of this funding is being directed towards research and development, GPU procurement, and talent acquisition, leading to rapid technological iteration and reduced inference costs. AI

IMPACT This funding surge indicates accelerated development and commercialization of AI technologies, particularly large language models and embodied intelligence, potentially driving faster iteration cycles and wider adoption.
RESEARCH · 36氪 (36Kr) 中文(ZH) · 3d

The Wireless Revolution of AI Intelligent Imaging Under the Computing Power Wave | 2026 AI Partner · Beijing Yizhuang AI+ Industry Conference

Shenmou, led by Yang Zuoxing, is developing ultra-low-power chip designs to free cameras from wires, envisioning a future with billions of smart visual terminals. Their first-generation chip achieves one-third the industry's power consumption, while the second generation reaches one-tenth, enabling all-weather smart cameras powered by a single watt of solar energy. Yang predicts a massive increase in camera demand, from hundreds of millions annually to potentially 100 billion by 2045, to feed real-time data into world-scale AI models. AI

IMPACT Enables massive scaling of real-world data input for AI models, potentially reducing hardware costs and expanding AI applications.
- 36Kr
- TSMC
- CUDA
- Groq
- GPU
- Yang Zuoxing
- Shenmou
- Nvidia
- AI
- DeepSeek
- Samsung
SIGNIFICANT · Mastodon — fosstodon.org Polski(PL) · 3d · [3 sources]

Governor Gavin Newsom signed unprecedented legislation protecting California's labor market from the effects of automation. The document includes, among other things, the introduction of

Anker is entering the processor market with its new Thus chip, which uses compute-in-memory architecture to deliver 150x more AI processing power for its upcoming Soundcore headphones. Meanwhile, Poland is vying for a Baltic AI GigaFactory within the new EU InvestAI fund, facing open competition for billions in funding and GPU resources. In California, Governor Gavin Newsom has signed legislation aimed at protecting the job market from automation, including proposals for universal basic income and machine-favoring tax reforms. AI

IMPACT New AI hardware, EU funding competition, and state-level automation policy signal diverse industry developments.
- California
- Gavin Newsom
- GPU
- Anker
- Thus
- Soundcore
- Poland
- InvestAI
- Baltic AI GigaFactory
RESEARCH · The Register — AI English(EN) · 1w

Uncle Sam's next big supercomputer might use something more exotic than GPUs

The US government is exploring alternative hardware for its next major supercomputer, potentially moving beyond traditional GPUs. This exploration is driven by the accelerating adoption of AI and the associated security challenges. Researchers are concerned that large AI companies may be subverting regulations, similar to past practices in the tobacco and oil industries, prioritizing industry interests over public concerns. AI

IMPACT Exploration of alternative hardware for supercomputing could impact AI development infrastructure, while regulatory concerns highlight potential shifts in AI governance.
- AI
- US
- GPU
TOOL · Mastodon — mastodon.social English(EN) · 4d

There is a new technique to speed up token generation called MTP. It predicts several future tokens, then the main model verifies them in parallel. There is a c

A new method called MTP (Multi-Token Prediction) has been developed to accelerate token generation in AI models. This technique involves predicting multiple future tokens simultaneously and then having the main model verify them in parallel. However, MTP requires a significant increase in VRAM, which can lead to slower generation or reduced context size on GPUs with limited memory. The technique does not appear to reduce model hallucinations. AI

IMPACT This technique could speed up AI inference but requires more VRAM, potentially limiting its use on consumer hardware.
- GPU
SIGNIFICANT · Mastodon — mastodon.social English(EN) · 5d · [2 sources]

Times of India | Who wins and who loses when AI comes for the workplace AI generated summary, Read the full article for complete information. AI is rapidly resh

Intercontinental Exchange, the parent company of the New York Stock Exchange, is planning to launch futures contracts for computing power, specifically focusing on GPUs. This initiative, in partnership with Ornn, aims to create a market for AI-driven technology demand, pending regulatory approval. Meanwhile, AI is transforming the workplace by automating routine tasks, leading to significant tech job cuts and threatening roles like data-entry clerks and customer service agents. The impact is disproportionately felt by younger workers, while demand increases for jobs requiring human judgment, creativity, and interpersonal skills. AI

IMPACT Establishes a financial market for AI computing power and highlights AI's role in job displacement and the creation of new skill demands.
TOOL · dev.to — LLM tag English(EN) · 5d

vLLM in Production: Ranked Configuration Decisions, Failure Modes, and the Architecture That Makes Them Work

This article provides a guide for optimizing vLLM deployments, focusing on three critical configuration decisions that impact performance and cost. It details how static KV cache allocation can lead to GPU out-of-memory errors and emphasizes the importance of selecting the right serving framework, managing memory budgets for KV cache versus model weights, and configuring batching strategies like chunked prefill and prefix caching. The guide also outlines common failure modes and offers architectural insights for effective vLLM operation. AI

IMPACT Provides crucial operational insights for efficiently deploying and managing large language models using vLLM.
COMMENTARY · Anyscale blog English(EN) · 3d

Architecting Data Pipelines for Multimodal Datasets at Scale

Anyscale's blog post details challenges in scaling multimodal AI data pipelines, where preprocessing often starves GPUs, leading to underutilization. The article explains that traditional staged batch execution, which involves writing intermediate data to storage between preprocessing and training, is inefficient due to significant I/O costs and delays. It proposes a disaggregated streaming architecture using Ray Data to directly stream preprocessed data from a dedicated preprocessing fleet to GPU workers, bypassing storage bottlenecks and improving GPU utilization. AI

IMPACT Provides architectural guidance for optimizing AI training and inference infrastructure, particularly for multimodal datasets.
- Spark
- Anyscale
- MapReduce
- multimodal AI
- Ray Data
TOOL · Mastodon — fosstodon.org Polski(PL) · 6d · [3 sources]

Manus introduces Scheduled Tasks 2.0, turning ordinary reminders into intelligent agents that remember context and self-update applications.

Manus has launched Scheduled Tasks 2.0, transforming basic reminders into intelligent agents capable of maintaining context and autonomously updating web applications. Meanwhile, the United Arab Emirates aims to generate 40% of Dubai's GDP from AI by 2031 through widespread automation of legislation and public services. In Poland, only 9% of companies possess the necessary GPU infrastructure for advanced AI, leading to a common bottleneck in pilot project implementation. AI

IMPACT AI adoption is being shaped by new agentic tools, ambitious national automation goals, and significant infrastructure limitations in various regions.
COMMENTARY · dev.to — LLM tag English(EN) · 2d

CPU vs GPU inference in llama.cpp isn’t just about speed — it’s about real-world constraints. In many local AI deployments, consistency and availability matter more than peak performance. Great breakdown of the tradeoffs in local LLM inference. #LLM

This article explores the practical differences between CPU and GPU inference for large language models (LLMs) using the llama.cpp framework. It highlights that while GPUs offer superior speed, CPUs can be a viable alternative when factors like consistency, availability, and resource constraints are more critical for local deployments. The piece provides a detailed analysis of the trade-offs involved in choosing between these hardware options for running LLMs. AI

IMPACT Provides practical guidance for operators on hardware choices for local LLM deployments, impacting cost and performance considerations.
- llama.cpp
- GPU
- CPU
- Maxim Saplin
COMMENTARY · The Register — AI English(EN) · 6d

Baidu says the quiet part out loud – you can’t build AI infrastructure, so clouds can cash in

Baidu's CFO stated that building AI infrastructure is prohibitively difficult, leading to cloud providers capitalizing on the situation. This difficulty stems from the high costs and complexity associated with AI hardware, particularly GPUs. Consequently, cloud services offering GPU rentals are positioned to achieve structurally higher profit margins compared to traditional CPU-based cloud offerings. AI

IMPACT Cloud providers are poised to capture higher margins as the complexity of building AI infrastructure deters direct investment by many companies.
- cloud providers
- Baidu
COMMENTARY · 36氪 (36Kr) 中文(ZH) · 4d

From Computing Power to Value: Infrastructure Reconstruction and New Engine for Industrial Growth in the AI Era | 2026 AI Partner · Beijing Yizhuang AI+ Industry Conference

The AI industry is shifting its focus from model parameters to computational efficiency, with "token economics" emerging as a new value unit. This transition is driving demand for "token factories" – intelligent computing centers optimized for inference, which is projected to consume significantly more power than training. Beijing Yingbo Digital Technology Co., Ltd. positions itself as a full-stack builder of these token factories, offering integrated solutions from planning to delivery and flexible billing models. AI

IMPACT Highlights the shift towards inference optimization and the rise of token economics, impacting infrastructure providers and AI service pricing.
COMMENTARY · Medium — MLOps tag English(EN) · 5d

The GPU Is the New Database

The article posits that GPUs are becoming the new databases, drawing parallels to the early days of database management. Just as teams fumbled through early database adoption, they are now navigating the complexities of large-scale GPU deployment. This shift signifies a fundamental change in how data is processed and stored, with GPUs taking on roles previously held by traditional database systems. AI

IMPACT Suggests a fundamental shift in data processing infrastructure, impacting how AI models are trained and deployed.
- GPU
- database
COMMENTARY · 雷峰网 (Leiphone) 中文(ZH) · 4d

SenseTime Guoxiang Capital Partner Li Yang: GPU Valuations Double, RISC-V Takes Center Stage, How Can Capital Lock in Certainty?

Li Yang, a partner at SenseTime Guoxiang Capital, discusses the AI chip investment landscape, emphasizing that product definition and future use cases are more critical than technology alone. He highlights the shift from cloud GPUs to edge AI chips and the rise of RISC-V, noting that successful investments depend on identifying genuine market needs and long-term trends. Li shares insights from their investment in Maxio (大普微), a server SSD manufacturer, which succeeded by focusing on a complete product offering to meet the demand for domestic alternatives in servers and data centers. AI

IMPACT Provides insights into investment strategies for AI hardware, guiding future capital allocation in the sector.
COMMENTARY · Mastodon — sigmoid.social 한국어(KO) · 3d

The current AI pricing was always going to go away. The existing fixed pricing for AI services has become unsustainable due to rising AI inference costs and surging usage. Soaring prices for GPUs and High Bandwidth Memory (HBM), along with increased power and cooling costs, have significantly driven up supply-side costs.

The existing fixed pricing models for AI services are becoming unsustainable due to rising inference costs and increased usage. Surging prices for GPUs and High Bandwidth Memory (HBM), coupled with higher power and cooling expenses, are pushing AI companies to raise prices to offset losses. Future AI products will likely focus on cost-effective use cases and adopt flexible pricing structures like API call-based billing, credit systems, or hybrid models to manage cost fluctuations. AI

IMPACT AI service providers must adapt pricing to manage rising hardware and operational costs, potentially impacting adoption and profitability.
- GPU
COMMENTARY · Mastodon — mastodon.social English(EN) · 5d

Higher Prices Could Slow the AI Building Boom https://www.wsj.com/finance/stocks/higher-prices-could-slow-the-ai-building-boom-627f6c2c?mod=rss_markets_main # A

The escalating costs of AI development, particularly for advanced hardware like GPUs, are beginning to strain the rapid expansion of the AI industry. This price surge, driven by high demand and limited supply, could potentially decelerate the pace of innovation and deployment of new AI technologies. Companies are facing increased operational expenses, which may lead to a more cautious approach to investment and growth in the sector. AI

IMPACT Rising hardware costs may force AI companies to optimize resource allocation and seek more efficient development strategies.
- AI
- Wall Street Journal
TOOL · arXiv cs.LG English(EN) · 3d

WarmServe: Enabling One-for-Many GPU Prewarming for Multi-LLM Serving

Researchers have developed WarmServe, a new system designed to improve the efficiency of serving multiple large language models (LLMs) on shared GPU clusters. WarmServe utilizes a one-for-many GPU prewarming strategy, proactively loading model parameters based on predicted workload patterns. This approach aims to reduce the time-to-first-token (TTFT) degradation often seen in multi-LLM serving systems. Evaluations indicate WarmServe can significantly decrease tail TTFT and increase request throughput compared to existing methods. AI

IMPACT Optimizes LLM serving infrastructure, potentially reducing latency and increasing throughput for deployed models.
- LLM
- GPU
- Chiheng Lou
- WarmServe
TOOL · arXiv cs.LG English(EN) · 3d

FlashSinkhorn: IO-Aware Entropic Optimal Transport on GPU

Researchers have developed FlashSinkhorn, a new GPU-accelerated solver for entropic optimal transport (EOT) that significantly reduces memory input/output operations. By rewriting stabilized log-domain Sinkhorn updates to mimic the normalization process in transformer attention, FlashSinkhorn enables fused kernels that stream data through on-chip SRAM. This approach achieves substantial speedups, up to 32x for forward passes and 161x end-to-end, compared to existing methods on A100 GPUs for tasks like point-cloud OT. AI

IMPACT This IO-aware solver could accelerate various machine learning applications that rely on optimal transport, potentially improving efficiency and scalability.
RESEARCH · arXiv stat.ML English(EN) · 4d · [2 sources]

From Sequential Nodes to GPU Batches: Parallel Branch and Bound for Optimal $k$-Sparse GLMs

Researchers have developed a new CPU-GPU framework to accelerate optimization problems with discrete variables, which have historically been challenging for GPUs. This framework processes branch and bound nodes in batches on GPUs, overcoming issues of sequential processing and data movement. Experiments demonstrate significant speedups and the ability to collect the full Rashomon set for further statistical analysis. AI

IMPACT Enables faster and more comprehensive analysis of complex models, potentially improving downstream AI applications.
TOOL · arXiv cs.AI English(EN) · 6d

GEM: GPU-Variability-Aware Expert to GPU Mapping for MoE Systems

Researchers have developed GEM, a framework designed to optimize the mapping of experts to GPUs in Mixture-of-Expert (MoE) AI models. This new approach accounts for variability in GPU performance, aiming to reduce inference latency by strategically placing experts. GEM's strategy involves distributing experts to ensure GPUs finish processing layers concurrently, thereby mitigating slowdowns caused by slower GPUs or overloaded experts. Experiments indicate that GEM can improve end-to-end latency by an average of 7.9%, with some cases showing improvements up to 16.5%. AI

IMPACT Optimizes MoE model inference, potentially reducing latency and improving efficiency for large-scale AI deployments.
RESEARCH · arXiv stat.ML English(EN) · 6d · [2 sources]

Understanding Deterioration Random Effects for Causal Discovery in Infrastructure Management

Researchers have developed a new framework for causal discovery in infrastructure management, focusing on pump equipment deterioration. This method combines Bayesian hierarchical hazard modeling with causal discovery to identify operational patterns that influence varying deterioration rates. The study analyzed 112 pumps and found significant heterogeneity, with one group showing causal effects 400 times larger than another, highlighting the need for distinct management approaches. AI

IMPACT Introduces a novel framework for heterogeneity-aware predictive maintenance in infrastructure, potentially improving asset management strategies.
SIGNIFICANT · 36氪 (36Kr) 中文(ZH) · 5d · [7 sources]

EU reaches provisional agreement on implementation plan for EU-US trade agreement

Alibaba's cloud division is facing scrutiny over its AI strategy, with investors closely monitoring its token revenue growth as a key indicator of future profitability. While AI compute sales offer high revenue, they yield low profit margins, prompting a shift towards a "model-as-a-service" (MaaS) approach. Despite initial concerns about multimodal capabilities and competitive pacing, Alibaba has accelerated its MaaS efforts with new product launches and internal restructuring, aiming to capture higher-margin revenue and deeper customer integration. AI

IMPACT Alibaba's strategic shift to high-margin AI token revenue signals a broader industry trend towards monetizing AI capabilities beyond raw compute.
- European Parliament
- EU-US trade agreement
- OpenAI
- EU
- US
- Alibaba
- H200
- Alibaba Cloud
- GPU
- HappyHorse
- MaaS
RESEARCH · Medium — MLOps tag English(EN) · 6d · [4 sources]

Your LLM Server Is Wasting 80% of Its GPU Memory — Here’s How vLLM Fixes That

Large language models (LLMs) face a significant bottleneck in serving efficiency due to the memory demands of KV cache, which stores intermediate attention calculations. This KV cache, essential for enabling faster responses and handling longer context windows, can consume up to 80% of GPU memory. Innovations like vLLM's PagedAttention, inspired by operating system memory management, are addressing this by optimizing KV cache storage and reducing memory fragmentation, leading to substantial improvements in inference throughput. AI

IMPACT Optimizing KV cache and memory usage is crucial for reducing LLM serving costs and improving inference speed, enabling wider adoption of AI applications.
- GPT-4
- Claude
- LLM
- KV cache
- vLLM
- GPU
- PagedAttention
- Llama-2-7b-hf
- Llama-2
- LLMs
- Tensormesh
- SemiAnalysis
- dev.to
- Medium
RESEARCH · Hugging Face Daily Papers English(EN) · 5d · [2 sources]

Mahjax: A GPU-Accelerated Mahjong Simulator for Reinforcement Learning in JAX

Researchers have developed Mahjax, a new GPU-accelerated simulator for the complex game of Riichi Mahjong, implemented in JAX. This tool is designed to facilitate reinforcement learning research, particularly for agents learning from scratch rather than relying on human play data. Mahjax achieves high throughput, processing up to 2 million steps per second on multiple GPUs, and has been validated for training agents to improve their performance. AI

IMPACT Enables large-scale reinforcement learning research for complex games, potentially leading to more general AI decision-making capabilities.
RESEARCH · dev.to — LLM tag English(EN) · 1w · [6 sources]

Designing Nvidia-Grade Ising Quantum AI Models for Robust Qubit Calibration

Nvidia has released open-source Ising quantum AI models designed to automate and improve the calibration of quantum processors. These models, which include a vision-language model for proposing calibration actions and CNNs for error correction decoding, are intended to be integrated into existing quantum control stacks. By treating calibration as an AI inference problem, similar to how LLMs are deployed, Nvidia aims to enhance the speed, accuracy, and robustness of quantum hardware operations, while also emphasizing the need for governance and security protocols. AI

IMPACT Enables more robust and automated calibration for quantum hardware, potentially accelerating quantum computing development.
- Nvidia
- LLM
- Cadence
- GPU
- AI Act
- Ising
- Quantum AI
- Qibo
- Qibocal
- ChipStack AI Super Agent
- Qibolab
- Ubuntu Inference Snaps
- CUDA-Q
RESEARCH · Lobsters — AI tag English(EN) · 3d · [3 sources]

Dissecting ThunderKittens, anatomy of a compact DSL for high-performance AI kernels

A new article details ThunderKittens, a compact domain-specific language (DSL) developed at Stanford's Hazy Research Lab for creating high-performance AI kernels. The DSL aims to strike a balance between research productivity and hardware efficiency by abstracting repetitive GPU programming tasks like tile layouts and memory allocation. This allows developers to maintain close reasoning about data movement and scheduling while still enabling performance optimization for modern AI workloads on hardware like NVIDIA's Hopper and Blackwell architectures. AI

IMPACT Enables more efficient AI model training and inference by optimizing low-level GPU kernel performance.
- AI
- Stanford
- FlashAttention-2
- Hopper
- PyTorch
- CUDA
- GPU
- Blackwell
- Hazy Research Lab
- Triton
- ThunderKittens
- NVIDIA
RESEARCH · dev.to — LLM tag English(EN) · 3d · [4 sources]

Stop paying for idle GPUs in your CI: batching LLM eval jobs

The integration of Large Language Models (LLMs) into professional workflows is shifting from experimental use to essential tooling, emphasizing collaboration rather than automation. However, the reliability of these LLM providers is becoming a critical concern, with frequent outages necessitating robust fallback mechanisms. To address this, open-source solutions like Bifrost are emerging to manage adaptive model routing and fallback logic at the gateway tier, ensuring application uptime even during provider incidents. Concurrently, optimizing the cost of LLM evaluations within CI/CD pipelines is crucial, as batching jobs and implementing tiered testing strategies can significantly reduce GPU expenditure. AI

IMPACT Emerging infrastructure solutions are crucial for maintaining application uptime and reducing operational costs as LLM adoption grows.
- LiteLLM
- OpenAI
- Claude
- LLM
- GPU
- Llama 3.1 8B Instruct
- Bifrost
- Maxim AI
- ChatGPT
- Llama
SIGNIFICANT · Mastodon — fosstodon.org English(EN) · 3w · [2 sources]

Datavault AI Secures $120M Funding for Nationwide GPU Network Expansion Datavault AI gets $120 million from Scilex Holding to build a GPU network in 100 US citi

Datavault AI has secured $120 million in funding from Scilex Holding to establish a nationwide GPU network. This initiative aims to provide increased computing power for companies engaged in artificial intelligence development. The network will be deployed across 100 cities in the United States, focusing on edge computing infrastructure. AI

IMPACT Expands access to critical GPU resources, potentially accelerating AI development and deployment for businesses.
- AI
- GPU
- Datavault AI
- Scilex Holding
- Edge Computing
COMMENTARY · Mastodon — sigmoid.social English(EN) · 3w · [12 sources]

Critical Minerals AI Supply Chain: Who Controls the Future Six chokepoints control every GPU, HBM chip, and data center cooling system. China processes 90% of r

Six critical chokepoints in the AI supply chain, from raw materials to finished chips, are dominated by China. The country processes 90% of rare earths, highlighting its significant control over the production of GPUs, HBM chips, and data center cooling systems essential for AI development. AI

IMPACT Highlights geopolitical risks and resource dependencies in AI hardware production, potentially impacting future development and accessibility.
TOOL · Anyscale blog English(EN) · 1mo

Announcing DP Group Fault Tolerance for vLLM WideEP Deployments with Ray Serve LLM

Anyscale has introduced a new fault tolerance feature for its vLLM serving engine, integrated with Ray Serve. This enhancement specifically addresses the challenges of deploying large Mixture-of-Experts (MoE) models, which are sharded across multiple GPUs. The new system can now identify and restart entire groups of GPUs that form a data-parallel (DP) group when a single GPU within that group fails, preventing the entire deployment from becoming unavailable. AI

IMPACT Enhances the reliability and operational efficiency of serving large, complex Mixture-of-Experts models, which are becoming increasingly common.
TOOL · Together AI blog English(EN) · 1mo

Inside the Together AI kernels team

The Together AI kernels team, including researchers Dan Fu and Tri Dao, developed FlashAttention, a software layer that significantly optimizes GPU performance for AI models. This breakthrough, achieved by applying database system principles to GPU memory movement, resulted in 2-3x speedups, challenging the notion that transformer attention was already fully optimized. The team's subsequent work, including the ThunderKittens library, aims to accelerate kernel development for new hardware like NVIDIA's Blackwell GPUs, addressing the critical software-hardware gap in AI infrastructure. AI

IMPACT Optimizes AI inference and training by bridging the software-hardware gap, potentially lowering costs and improving responsiveness.
- NVIDIA
- Stanford
- Together AI
- Andrej Karpathy
- Tesla
- GPU
- FlashAttention
- ThunderKittens
- Tri Dao
- Dan Fu
TOOL · Together AI blog English(EN) · 4mo

Inside multi-node training: How to scale model training across GPU clusters

Training large foundation models necessitates distributing the workload across numerous GPUs housed in multiple interconnected machines, a process known as multi-node training. This approach is essential for handling models with billions or trillions of parameters that exceed the memory capacity of single servers and would otherwise take months to train. Effective multi-node training relies on sophisticated parallelism strategies, high-speed network interconnects, and robust fault tolerance mechanisms to ensure efficient computation and progress. AI

IMPACT Explains the critical infrastructure and techniques required to train massive AI models, enabling faster iteration and development.
- Together AI
- GPU
- foundation models
- Qwen2.5-72B
- NVLink
- InfiniBand
- B300 GPU