Entity embeddings lead in high-cardinality fraud detection benchmarks

By PulseAugur Editorial · [2 sources] · 2026-07-01 05:51

A new research paper explores the effectiveness of different categorical encoding methods for high-cardinality fraud detection. The study tested seven encoders on the IEEE-CIS fraud benchmark dataset, comparing their performance using LightGBM and CatBoost learners. Entity embeddings achieved the highest AUC-ROC score, closely followed by CatBoost, and significantly outperformed tier group encoding. However, on AUC-PR, CatBoost led, indicating no single encoder dominated both metrics. The research suggests that entity embeddings offer an advantage due to their ability to capture joint multi-column representations. AI

IMPACT This research provides insights into optimizing fraud detection models by comparing different encoding techniques, potentially improving accuracy in financial applications.

RANK_REASON Academic paper detailing a new methodology and benchmark results. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.LG →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

Entity embeddings lead in high-cardinality fraud detection benchmarks

COVERAGE [2]

arXiv cs.LG TIER_1 English(EN) · Xiao Han, Jingjing Liu, Moxuan Zheng, Zhen Zhang, Chenyu Wu · 2026-07-02 04:00

Interpretable vs Learned Encoders for High-Cardinality Fraud Detection

arXiv:2607.00477v1 Announce Type: new Abstract: A total of seven categorical encoding methods were tested on the IEEE-CIS fraud benchmark dataset (590,540 records, 3.5% positives, 8 high-cardinality columns). The encoders were evaluated using a stratified 5-fold cross-validation …
arXiv cs.LG TIER_1 English(EN) · Chenyu Wu · 2026-07-01 05:51

Interpretable vs Learned Encoders for High-Cardinality Fraud Detection

A total of seven categorical encoding methods were tested on the IEEE-CIS fraud benchmark dataset (590,540 records, 3.5% positives, 8 high-cardinality columns). The encoders were evaluated using a stratified 5-fold cross-validation (CV) with three repetitions. Five of the encoder…

COVERAGE [2]

Interpretable vs Learned Encoders for High-Cardinality Fraud Detection

Interpretable vs Learned Encoders for High-Cardinality Fraud Detection

RELATED ENTITIES

RELATED TOPICS