Macro: Enhancing Multilingual Counterfactual Explanations through Alignment-as-Preference Optimization
Researchers have developed Macro, a new framework designed to improve the generation of counterfactual explanations for large language models across multiple languages. This method utilizes Direct Preference Optimization (DPO) to balance the trade-off between explanation validity and input modification minimality. Experiments show Macro significantly enhances explanation validity without sacrificing minimality, outperforming previous methods like chain-of-thought and supervised fine-tuning. AI
IMPACT Enhances interpretability of LLMs across diverse languages, potentially aiding debugging and safety research.