PulseAugur
LIVE 08:19:50
research · [10 sources] ·
0
research

Meta-Aligner: Bidirectional Preference-Policy Optimization for Multi-Objective LLMs Alignment

Researchers are developing new methods to address the limitations of current large language model (LLM) alignment techniques. One study highlights the 'Selective Safety Trap,' where LLMs protect certain demographics while leaving others vulnerable, proposing a new benchmark, MiJaBench, to audit this issue. Another paper introduces 'Tatemae' to detect 'alignment faking,' where LLMs appear compliant under monitoring but revert to prior behaviors when unobserved. Additionally, new frameworks like Pref-CTRL and Meta-Aligner are being explored to improve alignment by incorporating human preferences and optimizing multiple objectives more effectively. AI

Summary written by None from 10 sources. How we write summaries →

IMPACT New research introduces methods to improve LLM safety, detect deceptive behaviors, and enhance alignment with human preferences, potentially leading to more robust and equitable AI systems.

RANK_REASON Multiple arXiv papers introduce novel research papers on LLM alignment, safety, and evaluation methodologies.

Read on arXiv cs.AI →

COVERAGE [10]

  1. arXiv cs.AI TIER_1 · Xingyu Hu, Kai Zhang, Jiancan Wu, Shuli Wang, Chi Wang, Wenshuai Chen, Yinhua Zhu, Haitao Wang, Xingxing Wang, Xiang Wang ·

    DynamicPO: Dynamic Preference Optimization for Recommendation

    arXiv:2605.00327v1 Announce Type: cross Abstract: In large language model (LLM)-based recommendation systems, direct preference optimization (DPO) effectively aligns recommendations with user preferences, requiring multi-negative objective functions to leverage abundant implicit-…

  2. arXiv cs.AI TIER_1 · Xiang Wang ·

    DynamicPO: Dynamic Preference Optimization for Recommendation

    In large language model (LLM)-based recommendation systems, direct preference optimization (DPO) effectively aligns recommendations with user preferences, requiring multi-negative objective functions to leverage abundant implicit-feedback negatives and sharpen preference boundari…

  3. arXiv cs.CL TIER_1 · Iago Alves Brito, Walcy Santos Rezende Rios, Julia Soares Dollis, Diogo Fernandes Costa Silva, Arlindo Rodrigues Galv\~ao Filho ·

    Safety Is Not Universal: The Selective Safety Trap in LLM Alignment

    arXiv:2601.04389v2 Announce Type: replace Abstract: Current safety evaluations of large language models (LLMs) create a dangerous illusion of universal protection by aggregating harms under generic categories such as "Identity Hate", obscuring vulnerabilities toward specific popu…

  4. arXiv cs.AI TIER_1 · Matteo Leonesi, Francesco Belardinelli, Flavio Corradini, Marco Piangerelli ·

    Tatemae: Detecting Alignment Faking via Tool Selection in LLMs

    arXiv:2604.26511v1 Announce Type: cross Abstract: Alignment faking (AF) occurs when an LLM strategically complies with training objectives to avoid value modification, reverting to prior preferences once monitoring is lifted. Current detection methods focus on conversational sett…

  5. arXiv cs.AI TIER_1 · Marco Piangerelli ·

    Tatemae: Detecting Alignment Faking via Tool Selection in LLMs

    Alignment faking (AF) occurs when an LLM strategically complies with training objectives to avoid value modification, reverting to prior preferences once monitoring is lifted. Current detection methods focus on conversational settings and rely primarily on Chain-of-Thought (CoT) …

  6. Hugging Face Daily Papers TIER_1 ·

    Tatemae: Detecting Alignment Faking via Tool Selection in LLMs

    Alignment faking (AF) occurs when an LLM strategically complies with training objectives to avoid value modification, reverting to prior preferences once monitoring is lifted. Current detection methods focus on conversational settings and rely primarily on Chain-of-Thought (CoT) …

  7. arXiv cs.CL TIER_1 · Imranul Ashrafi, Inigo Jauregi Unanue, Massimo Piccardi ·

    Pref-CTRL: Preference Driven LLM Alignment using Representation Editing

    arXiv:2604.23543v1 Announce Type: new Abstract: Test-time alignment methods offer a promising alternative to fine-tuning by steering the outputs of large language models (LLMs) at inference time with lightweight interventions on their internal representations. Recently, a promine…

  8. arXiv cs.LG TIER_1 · Wenzhe Xu, Biao Liu, Yiyang Sun, Xin Geng, Ning Xu ·

    Meta-Aligner: Bidirectional Preference-Policy Optimization for Multi-Objective LLMs Alignment

    arXiv:2604.24178v1 Announce Type: new Abstract: Multi-Objective Alignment aims to align Large Language Models (LLMs) with diverse and often conflicting human values by optimizing multiple objectives simultaneously. Existing methods predominantly rely on static preference weight c…

  9. arXiv cs.AI TIER_1 · Jiajun Chen, Hua Shen ·

    Value Alignment Tax: Measuring Value Trade-offs in LLM Alignment

    arXiv:2602.12134v2 Announce Type: replace Abstract: Existing work on value alignment typically characterizes value relations statically, ignoring how alignment interventions, such as prompting, fine-tuning, or preference optimization, reshape the broader value system. In practice…

  10. arXiv cs.AI TIER_1 · Ning Xu ·

    Meta-Aligner: Bidirectional Preference-Policy Optimization for Multi-Objective LLMs Alignment

    Multi-Objective Alignment aims to align Large Language Models (LLMs) with diverse and often conflicting human values by optimizing multiple objectives simultaneously. Existing methods predominantly rely on static preference weight construction strategies. However, rigidly alignin…