PulseAugur
EN
LIVE 06:59:56

TileMaxSim kernel boosts GPU retrieval model speed by 220x

Researchers have developed TileMaxSim, a new IO-aware kernel for GPUs designed to significantly accelerate the MaxSim scoring process used in multi-vector retrieval models like ColBERT. Existing implementations are inefficient, utilizing only a small fraction of the available GPU bandwidth. TileMaxSim addresses this by employing multi-query SRAM tiling, dimension tiling, and fused product-quantization scoring, achieving up to 80.2% of peak HBM bandwidth on NVIDIA H100 GPUs. This results in a substantial speedup, enabling the scoring of 82 million documents per second and drastically reducing latency for retrieval tasks. AI

IMPACT Significantly accelerates retrieval model performance, potentially enabling faster and more efficient AI-powered search and recommendation systems.

RANK_REASON The item is a research paper detailing a new technical method for improving GPU performance in information retrieval. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.IR (Information Retrieval) →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

TileMaxSim kernel boosts GPU retrieval model speed by 220x

COVERAGE [1]

  1. arXiv cs.IR (Information Retrieval) TIER_1 English(EN) · Ashutosh Sharma ·

    TileMaxSim: IO-Aware GPU MaxSim Scoring with Dimension Tiling and Fused Product Quantization

    Multi-vector retrieval models such as ColBERT achieve state-of-the-art accuracy through fine-grained token-level MaxSim scoring, yet existing GPU implementations leave most hardware performance unused. We give a roofline analysis of MaxSim on modern GPUs and identify a severe ban…