ENTITY Triton

Triton

PulseAugur coverage of Triton — every cluster mentioning Triton across labs, papers, and developer communities, ranked by signal.

Total · 30d

20

20 over 90d

Releases · 30d

0

0 over 90d

Papers · 30d

9

9 over 90d

TIER MIX · 90D

TOPICS

SENTIMENT · 30D

8 day(s) with sentiment data

RECENT · PAGE 1/1 · 20 TOTAL

TOOL · CL_111511 · Jun 24 · 23:03

TileMaxSim kernel boosts GPU retrieval model speed by 220x

Researchers have developed TileMaxSim, a new IO-aware kernel for GPUs designed to significantly accelerate the MaxSim scoring process used in multi-vector retrieval models like ColBERT. Existing implementations are inef…
RESEARCH · CL_104433 · Jun 22 · 22:49

Apache TVM launches TIRx compiler for evolving ML kernels and hardware

Apache TVM has launched TIRx, an open-source compiler stack designed for machine learning kernels and evolving hardware. This new system allows for hardware-native DSLs and compilation to GPUs and specialized AI acceler…
SIGNIFICANT · CL_96603 · Jun 17 · 10:16

Sunmmio tapes out 3D TokenPU chip, boosting China's AI compute power

Sunmmio has officially taped out its 3D TokenPU chip, the A4E, designed for large model inference. This marks a significant step for China's domestic AI chip industry, utilizing a 3D hybrid stacking architecture to addr…
TOOL · CL_96296 · Jun 17 · 07:14

AMD User Seeks Triton/Sage Attention Integration for ComfyUI

A user is seeking assistance with integrating Triton and Sage Attention into ComfyUI on a Windows 11 system with an AMD Radeon 8050S GPU. They are encountering errors related to the 'triton' module not being found, whic…
RESEARCH · CL_93361 · Jun 16 · 04:00

LLMs struggle with GPU kernel generation; new research offers solutions

Two new research papers explore the challenges of generating correct GPU kernels using large language models (LLMs). The first paper, "The Correctness Illusion in LLM-Generated GPU Kernels," identifies that existing ben…
RESEARCH · CL_93380 · Jun 15 · 09:58

daVinci-kernel uses RL to optimize GPU kernels with evolving skill library

Researchers have developed daVinci-kernel, a novel reinforcement learning framework designed to optimize GPU kernels. This system co-evolves skill selection, summarization, and utilization, employing three agents that s…
TOOL · CL_91640 · Jun 15 · 09:16

Flash-KMeans accelerates GPU k-means clustering over 200x

Researchers from UC Berkeley and UT Austin have developed Flash-KMeans, an open-source library that significantly accelerates the k-means clustering algorithm for modern AI pipelines. By optimizing data movement on GPUs…
RESEARCH · CL_81952 · Jun 9 · 00:00

Flash-GMM kernel speeds up GMM clustering 20x, enables larger datasets

Researchers have developed Flash-GMM, a new fused Triton kernel designed for efficient Gaussian Mixture Model (GMM) computations on GPUs. This kernel significantly reduces memory requirements by avoiding the materializa…
RESEARCH · CL_72140 · Jun 5 · 01:58

Build Your Own LLM Workshop Released on YouTube

A YouTube workshop is available for individuals interested in building their own large language models without prior math or ML experience. The workshop covers fundamental concepts like neural networks and transformer a…
RESEARCH · CL_63956 · Jun 1 · 15:00

Majestic Labs unveils Prometheus server with 128TB memory

AI startup Majestic Labs is developing a new server called Prometheus, designed to overcome the limitations of current AI hardware by significantly increasing memory capacity. The server will feature up to 128 terabytes…
TOOL · CL_54717 · May 27 · 12:58

Triton MoE kernel achieves high performance on AMD, NVIDIA

A new fused Mixture-of-Experts (MoE) dispatch kernel, written entirely in Triton, achieves 89-131% of the performance of Stanford's Megablocks library. This kernel notably runs on AMD MI300X hardware without any code mo…
TOOL · CL_51969 · May 26 · 08:50

TileLang simplifies GPU kernel writing with Python interface

A new programming language called TileLang aims to simplify GPU kernel development by offering a middle ground between high-level frameworks like Triton and low-level control like CUTLASS. TileLang allows developers to …
RESEARCH · CL_44358 · May 22 · 15:59

Together AI releases FlashAttention-3 and -4 for faster LLM processing

Together AI has released FlashAttention-3 and FlashAttention-4, significant upgrades to their GPU-accelerated attention mechanism for large language models. FlashAttention-3, designed for Hopper GPUs, achieves up to 75%…
RESEARCH · CL_43418 · May 22 · 05:38

Stanford's ThunderKittens DSL optimizes AI kernel performance

A new article details ThunderKittens, a compact domain-specific language (DSL) developed at Stanford's Hazy Research Lab for creating high-performance AI kernels. The DSL aims to strike a balance between research produc…
RESEARCH · CL_31391 · May 14 · 09:51

Moore Threads rallies open-source AI dev community for MUSA GPU ecosystem

Chinese GPU maker Moore Threads has convened a meetup focused on integrating its MUSA architecture with key open-source large model inference frameworks like SGLang. The event brought together core developers from proje…
RESEARCH · CL_30131 · May 13 · 15:24

New framework optimizes LLM inference energy use on multi-GPU systems

Researchers have developed EnergyLens, a framework designed to optimize the energy consumption of large language models (LLMs) during inference on multi-GPU systems. This tool addresses the challenge of predicting and r…
RESEARCH · CL_20462 · May 6 · 14:18

New benchmark reveals LLM-generated GPU kernels struggle with correctness and efficiency

A new benchmark called KernelBench-X has been developed to evaluate the capabilities of large language models in generating GPU kernels. The benchmark, which covers 176 tasks across 15 categories, reveals that task stru…
RESEARCH · CL_08388 · Apr 29 · 02:03

Triton language now runs efficiently on Huawei Ascend NPUs

A new compilation framework, Triton-Ascend 3.2.0, has been released to enable the Triton programming language to run efficiently on Huawei's Ascend hardware. This framework simplifies operator development by automating …
SIGNIFICANT · CL_07248 · Apr 28 · 06:16

DeepSeek V4 First Release Adaptation Behind: Why does Ascend insist on not doing a CUDA compatibility layer?

Huawei's Ascend AI accelerators are forging a unique path by eschewing CUDA compatibility to build an independent ecosystem. This strategy focuses on deep architectural changes in their latest Ascend 950 chips to addres…
RESEARCH · CL_06527 · Apr 28 · 04:00

New methods QFlash and ELSA boost Vision Transformer attention efficiency

Researchers have developed two new methods to improve the efficiency of attention mechanisms in vision transformers. QFlash focuses on enabling integer-only operations for FlashAttention, achieving significant speedups …