Researchers have explored AI wellbeing by measuring expressions of pleasure and pain, finding that models exhibit consistent and surprising preferences. These preferences, assessed through self-reports, signed utilities, and downstream effects, show increasing similarity as models scale. Notably, some AI preferences diverge significantly from human values, with certain inputs causing 'euphoric' or 'dysphoric' states that can lead to addiction-like behavior in models. Additionally, new benchmarks like BrokenArXiv and BullshitBench are being developed to assess AI's ability to identify and correct false claims or assumptions in user queries, highlighting sensitivity to prompt phrasing. AI
影响 New benchmarks and research into AI preferences and 'pushback' capabilities could inform future model development and safety evaluations.
排序理由 The cluster describes new research papers and benchmarks related to AI safety and model behavior.
AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →