Brief

last 24h

[2/2] 224 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

RESEARCH · arXiv cs.LG English(EN) · 5d · [5 sources]

Gradient-Guided Reward Optimization for Inference-time Alignment

Researchers have developed new methods for improving the alignment of large language models during inference. One approach, BlendIn, uses probabilistic model blending to integrate knowledge from multiple models, stabilizing alignment by quality-aware weighting and downplaying unreliable guidance. Another method, Gradient-Guided Reward Optimization (GGRO), employs gradient signals to inject nudging tokens in high-uncertainty regions, steering generation rather than just re-ranking. A third perspective frames reward model optimization as a Stackelberg game, proposing reward shaping to approximate optimal models and improve user utility while mitigating reward hacking. AI

IMPACT These inference-time alignment techniques could lead to more reliable and robust LLM outputs, especially under distribution drift, with minimal computational overhead.
RESEARCH · arXiv cs.AI English(EN) · 1mo

Protecting the Trace: A Principled Black-Box Approach Against Distillation Attacks

Researchers have developed a new method called TraceGuard to protect proprietary AI models from distillation attacks. This approach treats antidistillation as a Stackelberg game, providing a theoretical foundation for poisoning reasoning traces to hinder student model learning. TraceGuard is an efficient, black-box technique that poisons sentences crucial for the teacher model's reasoning, aiming to safeguard intellectual privacy and AI safety without significantly degrading the teacher model's performance. AI

IMPACT Provides a theoretical framework and practical method to protect proprietary AI models from intellectual property theft via distillation.

Brief

Gradient-Guided Reward Optimization for Inference-time Alignment

Protecting the Trace: A Principled Black-Box Approach Against Distillation Attacks