TileMaxSim kernel boosts GPU retrieval model speed by 220x

By PulseAugur Editorial · [1 sources] · 2026-06-24 23:03

Researchers have developed TileMaxSim, a new IO-aware kernel for GPUs designed to significantly accelerate the MaxSim scoring process used in multi-vector retrieval models like ColBERT. Existing implementations are inefficient, utilizing only a small fraction of the available GPU bandwidth. TileMaxSim addresses this by employing multi-query SRAM tiling, dimension tiling, and fused product-quantization scoring, achieving up to 80.2% of peak HBM bandwidth on NVIDIA H100 GPUs. This results in a substantial speedup, enabling the scoring of 82 million documents per second and drastically reducing latency for retrieval tasks. AI

IMPACT Significantly accelerates retrieval model performance, potentially enabling faster and more efficient AI-powered search and recommendation systems.

RANK_REASON The item is a research paper detailing a new technical method for improving GPU performance in information retrieval. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.IR (Information Retrieval) →

infra
paper

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

TileMaxSim kernel boosts GPU retrieval model speed by 220x

COVERAGE [1]

arXiv cs.IR (Information Retrieval) TIER_1 English(EN) · Ashutosh Sharma · 2026-06-24 23:03

TileMaxSim: IO-Aware GPU MaxSim Scoring with Dimension Tiling and Fused Product Quantization

Multi-vector retrieval models such as ColBERT achieve state-of-the-art accuracy through fine-grained token-level MaxSim scoring, yet existing GPU implementations leave most hardware performance unused. We give a roofline analysis of MaxSim on modern GPUs and identify a severe ban…

COVERAGE [1]

TileMaxSim: IO-Aware GPU MaxSim Scoring with Dimension Tiling and Fused Product Quantization

RELATED ENTITIES

RELATED TOPICS