Brief

last 24h

[4/4] 221 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

TOOL · Modal blog English(EN) · 3d

How we achieved truly serverless GPUs

Modal has developed a system to achieve truly serverless GPUs for AI inference, addressing the challenge of rapidly scaling resources to meet variable demand. Their approach involves maintaining cloud buffers of idle GPUs, a custom filesystem for lazy container image serving, and efficient checkpoint/restore mechanisms for both CPU and GPU processes. This engineering effort, developed over five years, reduces AI inference replica scaling time from tens of minutes to mere seconds, aiming to maximize GPU Allocation Utilization. AI

IMPACT Enables faster, more efficient scaling of AI inference workloads, potentially lowering costs and improving resource utilization.
- Marc Brooker
- xAI
- AWS
- Modal
- SGLang
- AI inference
RESEARCH · Together AI blog English(EN) · 3d · [2 sources]

FlashAttention

Together AI has released FlashAttention-3 and FlashAttention-4, significant upgrades to their GPU-accelerated attention mechanism for large language models. FlashAttention-3, designed for Hopper GPUs, achieves up to 75% utilization and 1.5-2x speedup over its predecessor by exploiting new hardware features like Tensor Cores and Tensor Memory Accelerator, and supporting FP8 precision. FlashAttention-4, optimized for Blackwell GPUs, further enhances performance by pipelining computations and addressing bottlenecks in transcendental functions and memory traffic, reaching 71% utilization and offering substantial speedups over existing libraries. AI

IMPACT These optimized attention mechanisms promise significantly faster LLM training and inference, enabling longer context windows and more efficient GPU utilization.
RESEARCH · X — SemiAnalysis English(EN) · 6d

AMD ALERT 🚀 MI355 is now 40% cheaper than B200 on GLM5 architecture for Single Node serving FP8 14 weeks after the initial launch of GLM5 on both non-MTP &

AMD's MI355 accelerator is now 40% cheaper than Nvidia's B200 for serving on the GLM5 architecture. This cost reduction comes 14 weeks after the initial launch of GLM5, which supports both non-MTP and other configurations. AI

IMPACT This pricing shift could significantly impact enterprise AI infrastructure choices, favoring AMD for GLM5 deployments.
- AMD
- GLM5
- Nvidia
- MI355
SIGNIFICANT · Mastodon — mastodon.social 日本語(JA) · 5d · [6 sources]

Cohere releases Command A+, an MoE multimodal AI built for agent tasks, a high-performance open-source model for enterprises that can be deployed in their own environments https://fed.brid.gy/r/https://gigazine.net/news/20260522-cohere-command-a-p

Cohere has released Command A+, an open-source, multimodal AI model designed for enterprise use and agentic tasks. This new model integrates reasoning, vision, and multilingual capabilities, supporting 48 languages and offering significant improvements in speed and efficiency over previous versions. Command A+ is available on Hugging Face with various quantization options, including W4A4, which drastically reduces serving footprint with minimal performance loss, making it suitable for on-premises deployment. AI

IMPACT Accelerates enterprise adoption of advanced AI agents by providing a powerful, efficient, and customizable open-source model.

Brief

How we achieved truly serverless GPUs

FlashAttention

AMD ALERT 🚀 MI355 is now 40% cheaper than B200 on GLM5 architecture for Single Node serving FP8 14 weeks after the initial launch of GLM5 on both non-MTP &amp;

Cohere releases Command A+, an MoE multimodal AI built for agent tasks, a high-performance open-source model for enterprises that can be deployed in their own environments https://fed.brid.gy/r/https://gigazine.net/news/20260522-cohere-command-a-p

AMD ALERT 🚀 MI355 is now 40% cheaper than B200 on GLM5 architecture for Single Node serving FP8 14 weeks after the initial launch of GLM5 on both non-MTP &