Wafer.ai has successfully deployed GLM5.2 on AMD MI355X hardware, achieving a throughput of 2626 tokens/second/node and 213 tokens/second for single-stream inference. This deployment offers a cost advantage, with MI355X GPUs being approximately 2.75 times cheaper than NVIDIA's Blackwell B300. The optimization involved quantizing GLM5.2 to MXFP4 using AMD Quark and employing the sglang inference framework, with specific modifications to enable speculative decoding on ROCm. AI
IMPACT Accelerates adoption of cost-effective inference solutions, potentially lowering the barrier to entry for deploying large language models.
RANK_REASON The cluster details a cost-effective deployment of a frontier model on alternative hardware, highlighting a significant industry trend in optimizing AI inference costs.
Read on Hacker News — AI stories ≥50 points →
AI-generated summary · Google Gemini · from 5 sources. How we write summaries →