PulseAugur
LIVE 13:43:13
research · [4 sources] ·
0
research

DeepSeek V4 achieves faster performance with custom kernels, replacing cuBLAS

DeepSeek has developed a custom kernel stack, DeepGEMM and TileLang, which not only matches but surpasses the performance of NVIDIA's cuBLAS. This custom implementation achieves bitwise determinism and batch invariance, addressing issues with non-deterministic outputs common in other workload-balancing strategies like splitK or split-KV. The innovation lies in their approach to floating-point math, ensuring consistent results for debugging and training. AI

Summary written by gemini-2.5-flash-lite from 4 sources. How we write summaries →

IMPACT DeepSeek's custom kernel stack offers a potential performance advantage over standard libraries, which could influence future AI infrastructure development and optimization strategies.

RANK_REASON The cluster details a technical innovation in custom kernel development for AI model training, including performance benchmarks and technical explanations, which aligns with research-level disclosure.

Read on X — SemiAnalysis →

COVERAGE [4]

  1. X — SemiAnalysis TIER_1 · SemiAnalysis_ ·

    So why doesn't replacing cuBLAS cost them performance? Because DeepSeek's custom kernel stack (DeepGEMM + TileLang) is actually faster. Pre-compiled cuBLAS leav

    So why doesn't replacing cuBLAS cost them performance? Because DeepSeek's custom kernel stack (DeepGEMM + TileLang) is actually faster. Pre-compiled cuBLAS leaves real wins on the table: build-your-own with many more tile sizes available for better SM occupancy, JIT compilation

  2. X — SemiAnalysis TIER_1 · SemiAnalysis_ ·

    DeepSeek V4's answer: end-to-end bitwise deterministic and batch invariant kernels. They replaced all uses of cuBLAS with their own custom implementations in De

    DeepSeek V4's answer: end-to-end bitwise deterministic and batch invariant kernels. They replaced all uses of cuBLAS with their own custom implementations in DeepGEMM. Instead of using atomics to accumulate partial results in split-reduction workloads, they write partials to

  3. X — SemiAnalysis TIER_1 · SemiAnalysis_ ·

    Common workload-balancing scheduling strategies such as splitK or split-KV give nondeterministic output bits for the same inputs. This makes debugging training

    Common workload-balancing scheduling strategies such as splitK or split-KV give nondeterministic output bits for the same inputs. This makes debugging training failures nearly impossible -- you can't reproduce a loss spike. For inference, some kernels can give different results

  4. X — SemiAnalysis TIER_1 · SemiAnalysis_ ·

    Floating point math is not associative! And many of the highest performance kernels split the workload among SMs and accumulate partial results in a nondetermin

    Floating point math is not associative! And many of the highest performance kernels split the workload among SMs and accumulate partial results in a nondeterministic order. Many AI labs just accept this, or pay a huge performance penalty for determinism. DeepSeek decided to do ht…