A user on Reddit's r/MachineLearning subreddit is seeking advice on how to fine-tune a large language model (LLM) for open-ended mathematical problems, specifically proof-based tasks. The user notes that standard reinforcement learning from human feedback (RLHF) methods, which rely on final answers as rewards, are insufficient for this type of problem. They are considering using the MathNet dataset for training data and are looking for alternative fine-tuning techniques beyond supervised fine-tuning (SFT) and standard RL algorithms like GRPO/PPO due to the lack of a clear reward function. AI
IMPACT Discusses challenges in adapting LLMs for complex reasoning tasks, highlighting the need for new fine-tuning approaches beyond standard RLHF.
RANK_REASON User-generated question seeking technical advice on LLM fine-tuning, not a formal release or research paper.
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →