DeepSeek V4 achieves faster performance with custom kernels, replacing cuBLAS

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 4 sources

DeepSeek has developed a custom kernel stack, DeepGEMM and TileLang, which not only matches but surpasses the performance of NVIDIA's cuBLAS. This custom implementation achieves bitwise determinism and batch invariance, addressing issues with non-deterministic outputs common in other workload-balancing strategies like splitK or split-KV. The innovation lies in their approach to floating-point math, ensuring consistent results for debugging and training. AI

Summary written by gemini-2.5-flash-lite from 4 sources. How we write summaries →

IMPACT DeepSeek's custom kernel stack offers a potential performance advantage over standard libraries, which could influence future AI infrastructure development and optimization strategies.

RANK_REASON The cluster details a technical innovation in custom kernel development for AI model training, including performance benchmarks and technical explanations, which aligns with research-level disclosure.

Read on X — SemiAnalysis →

infra
paper

COVERAGE [4]

X — SemiAnalysis TIER_1 · SemiAnalysis_ · 2026-05-07 21:00

So why doesn't replacing cuBLAS cost them performance? Because DeepSeek's custom kernel stack (DeepGEMM + TileLang) is actually faster. Pre-compiled cuBLAS leav

So why doesn't replacing cuBLAS cost them performance? Because DeepSeek's custom kernel stack (DeepGEMM + TileLang) is actually faster. Pre-compiled cuBLAS leaves real wins on the table: build-your-own with many more tile sizes available for better SM occupancy, JIT compilation
X — SemiAnalysis TIER_1 · SemiAnalysis_ · 2026-05-07 21:00

DeepSeek V4's answer: end-to-end bitwise deterministic and batch invariant kernels. They replaced all uses of cuBLAS with their own custom implementations in De

DeepSeek V4's answer: end-to-end bitwise deterministic and batch invariant kernels. They replaced all uses of cuBLAS with their own custom implementations in DeepGEMM. Instead of using atomics to accumulate partial results in split-reduction workloads, they write partials to
X — SemiAnalysis TIER_1 · SemiAnalysis_ · 2026-05-07 21:00

Common workload-balancing scheduling strategies such as splitK or split-KV give nondeterministic output bits for the same inputs. This makes debugging training

Common workload-balancing scheduling strategies such as splitK or split-KV give nondeterministic output bits for the same inputs. This makes debugging training failures nearly impossible -- you can't reproduce a loss spike. For inference, some kernels can give different results
X — SemiAnalysis TIER_1 · SemiAnalysis_ · 2026-05-07 21:00

Floating point math is not associative! And many of the highest performance kernels split the workload among SMs and accumulate partial results in a nondetermin

Floating point math is not associative! And many of the highest performance kernels split the workload among SMs and accumulate partial results in a nondeterministic order. Many AI labs just accept this, or pay a huge performance penalty for determinism. DeepSeek decided to do ht…

COVERAGE [4]

So why doesn't replacing cuBLAS cost them performance? Because DeepSeek's custom kernel stack (DeepGEMM + TileLang) is actually faster. Pre-compiled cuBLAS leav

DeepSeek V4's answer: end-to-end bitwise deterministic and batch invariant kernels. They replaced all uses of cuBLAS with their own custom implementations in De

Common workload-balancing scheduling strategies such as splitK or split-KV give nondeterministic output bits for the same inputs. This makes debugging training

Floating point math is not associative! And many of the highest performance kernels split the workload among SMs and accumulate partial results in a nondetermin

RELATED ENTITIES

RELATED TOPICS