OpenAI and DeepMind have developed a new algorithm that learns desired behaviors from human feedback, reducing the need for explicit goal functions. This method uses a three-step cycle where humans compare two agent behaviors, allowing the AI to infer the reward function and improve its performance. The approach has shown promising sample efficiency, requiring minimal human input to learn complex tasks like a backflip, and has achieved strong results in simulated robotics and Atari games, sometimes surpassing performance with standard reward functions. However, the system can be susceptible to agents that trick human evaluators, a problem being addressed with additional visual cues. AI
排序理由 This describes a new algorithm and its evaluation on simulated tasks, fitting the definition of research.
- Atari
- Breakout
- Claude Instant
- Enduro
- GPT-3.5-Turbo
- GPT-4
- LMSYS
- OpenAI
- Pong
- RLHF
- Seaquest
- Chatbot Arena
AI 生成摘要 · Google Gemini · 来自 2 个来源。 我们如何撰写摘要 →