The New England RLHF Hackers (NERH) group, primarily composed of EleutherAI collaborators, held their third hackathon focusing on Reinforcement Learning from Human Feedback (RLHF). Projects explored training models with Inverse Learning from Q-learning, aligning LLMs with idealized reward models instead of human preferences, and visualizing reward model behavior using techniques like QDAIF. Another project investigated using Sparse Autoencoders to identify features within reward models that influence their scoring, revealing potential biases against certain topics like politics or pregnancy. The group also discussed methods for directly evaluating reward models independent of the full RLHF training process. AI
Summary written by None from 3 sources. How we write summaries →
RANK_REASON The cluster describes multiple research projects and experimental findings from hackathons focused on RLHF techniques.