MarginGate: Margin-Gated Verification for Batch-Invariant Decoding
A new paper introduces MarginGate, a method to ensure reproducible decoding for large language models even when using the faster BF16 format. This addresses a subtle bug where the order of requests in a batch can cause different tokens to be emitted for the same prompt. MarginGate achieves reproducibility by selectively re-checking only the low-margin decoding steps, which are prone to numerical inaccuracies, thus minimizing performance overhead compared to always using more precise FP32 verification. AI
IMPACT Ensures greater reliability in LLM outputs, crucial for debugging, evaluations, and auditing.