GEMM
PulseAugur coverage of GEMM — every cluster mentioning GEMM across labs, papers, and developer communities, ranked by signal.
2 day(s) with sentiment data
-
New book details modern GPU programming for AI workloads
A new book titled "Modern GPU Programming for MLSys" aims to demystify high-performance GPU kernel development for machine learning systems. The book, originating from Carnegie Mellon University's Machine Learning Syste…
-
Apple M4 Max GPU's Tensor Compute Path Emulated, Not Accelerated
Researchers have reverse-engineered the Metal 4.1 tensor compute path on Apple's M4 Max GPU, revealing that the fp8 matmul2d operation is emulated rather than hardware-accelerated. This means the operation runs on the G…
-
New framework enhances LLM-generated Verilog with feedback and skill evolution
Researchers have developed Verilog-Evolve, a novel framework designed to enhance the generation of Verilog code using large language models. This system moves beyond isolated sampling and functional checking by incorpor…
-
TileLang simplifies GPU kernel writing with Python interface
A new programming language called TileLang aims to simplify GPU kernel development by offering a middle ground between high-level frameworks like Triton and low-level control like CUTLASS. TileLang allows developers to …
-
Sakana AI, NVIDIA unveil TwELL for faster LLM training and inference
Researchers from Sakana AI and NVIDIA have developed TwELL, a novel method that significantly speeds up large language model (LLM) operations. By targeting the feedforward layers, which are computationally intensive, Tw…
-
Tempus framework offers scalable, resource-efficient GEMM for edge AI
Researchers have developed Tempus, a new framework designed to optimize General Matrix Multiplication (GEMM) for edge AI deployments on AMD Versal SoCs. Unlike existing spatial scaling methods that fail on resource-cons…