RLHF - How It Really Works

RLHF (Reinforcement Learning from Human Feedback) is a critical technique for aligning large language models with human preferences. Here’s how it really works.

The Core Problem

Language models trained on massive text corpora learn to predict the next token, but this doesn’t necessarily align with what humans find helpful, harmless, or accurate. RLHF addresses this by incorporating human preferences into the training process.

The RLHF Pipeline

Step 1: Supervised Fine-Tuning (SFT) First, a base language model is fine-tuned on a dataset of human demonstrations. This creates a “policy” model that can generate responses in the desired format.

Step 2: Reward Model Training A separate reward model is trained to predict human preferences. This involves:

Collecting human comparisons (e.g., “response A is better than response B”)
Training a model to predict which response humans would prefer
Using ranking or pairwise comparison data

Step 3: Reinforcement Learning Optimization The policy model is optimized using the reward model as a proxy for human feedback:

Generate multiple responses to prompts
Use the reward model to score each response
Update the policy to maximize expected reward using algorithms like PPO (Proximal Policy Optimization)

Key Techniques

Preference Modeling Approaches:

Pairwise comparisons: Humans rank two responses, training the reward model to prefer one over the other
Ranking-based methods: Multiple responses ranked by quality
Absolute ratings: Direct quality scores (less common due to inconsistency)

RL Algorithms:

PPO (Proximal Policy Optimization): Most common, balances exploration and exploitation while preventing large policy updates
REINFORCE: Simpler but less stable
DPO (Direct Preference Optimization): Newer approach that avoids training a separate reward model

Challenges and Limitations

Data Efficiency:

RLHF requires significant human feedback data
Off-policy data refers to using previous human evaluations, even if they were generated by older model versions, which makes training more data efficient but might introduce stale or misaligned preferences if the model changes significantly.

Handling Hallucinations Post-Training:

Instructive Feedback Loops: Hallucinations are often mitigated post-training by fine-tuning the LLM on specially curated, verified datasets or adding negative samples (incorrect facts) and training the LLM to recognize and penalize hallucinations.
Contradiction Detection: Models can be fine-tuned using contradiction labels where human evaluators provide ‘false statements,’ and the LLM is trained to identify contradictions or refuse to answer confidently, enhancing its reliability.
Fact-Based Reward Functions: One promising direction is using fact-verifying models as part of the reward function. The LLM’s outputs are compared against known facts, and deviations are penalized.

Future Directions in Post-Training:

Curriculum RLHF: A future direction involves curriculum-based RLHF, where the model is gradually introduced to increasingly difficult or nuanced feedback to enhance learning stability.
Multi-Agent Feedback Systems: Using multiple LLMs as evaluators instead of humans is gaining traction for initial preference labeling. Ensemble feedback from different models can help create more robust reward signals before human refinement.
Combining Reinforcement and Supervised Learning: Integrating reinforcement objectives with supervised learning (hybrid training) allows post-training to benefit from both direct human instruction and dynamic adaptation, enhancing overall LLM alignment quality.

Critical Notes on RLHF

RLHF merely mitigates the most frequent mistakes without actually fixing the inherent tendency for hallucinations that LLMs exhibit.
Inherently, relying on humans is complex. Human preferences are unreliable (low likelihood of reproducing them), and modeling them is unreliable as well.
Reward hacking is a common RL problem.
Chatbots are rewarded to produce responses that seem helpful regardless of truthfulness.
Constitutional AI: An alternative approach that uses LLMs to provide feedback and then applies RL from AI feedback, reducing reliance on human annotators.

RLHF represents a significant step forward in aligning language models, but it’s not a panacea. Understanding its mechanisms and limitations is crucial for building better AI systems.