RLHF Explained: How Human Feedback Steers Reinforcement Learning
A clear, practical guide to RLHF—how human preferences train models, the pipeline, pitfalls, and modern variants like DPO and RLAIF.
Image used for representation purposes only.
Overview
Reinforcement Learning from Human Feedback (RLHF) is a training paradigm that teaches models to act in ways people prefer by learning from human judgments rather than solely from static datasets or hand‑crafted reward functions. In practice, RLHF wraps a base model with an additional loop: humans compare model outputs, a reward model learns to predict those preferences, and the model is optimized to produce responses that the reward model scores highly—while staying close to the original capabilities.
This article explains how RLHF works end to end, why it matters, where it fails, and how modern variants like DPO and RLAIF fit in.
Why RLHF exists
Classical reinforcement learning requires a numerical reward function. But for tasks like helpful conversation, safe code suggestions, or nuanced writing, specifying a reward mathematically is impractical. RLHF replaces hand‑written rewards with human preference signals:
- We ask people which of two model outputs is better for a given prompt.
- We generalize those judgments through a reward model so we don’t need human input for every output.
- We optimize the policy (the model) to maximize predicted human approval.
The result is a system aligned to practical, fuzzy objectives—helpfulness, harmlessness, and honesty—without needing to encode them explicitly.
The RLHF pipeline at a glance
- Supervised fine‑tuning (SFT)
- Start with a pretrained model.
- Fine‑tune it on high‑quality, instruction‑following demonstrations. This creates a reasonable starting policy.
- Preference data collection
- For each prompt, sample multiple candidate responses from the SFT policy.
- Human raters compare pairs (A vs B) using a rubric (e.g., correctness, safety, clarity) and choose a preferred answer (or tie).
- Reward modeling
- Train a reward model r(x, y) that scores a response y to prompt x, fitting it so that preferred outputs receive higher scores than rejected ones.
- Policy optimization
- Use reinforcement learning (commonly PPO) to adjust the policy to produce outputs with higher predicted reward, while penalizing divergence from the SFT/reference model (a KL penalty).
- Evaluation and iteration
- Measure gains in helpfulness, safety, and faithfulness via human evals and automated tests; refine data, rubric, and training.
Stage 1: Supervised fine‑tuning (SFT)
SFT anchors the policy in the desired task manifold. It reduces the burden on RL by giving the model examples of desirable behavior. Without SFT, RL steps can wander or over‑optimize on spurious reward model signals, harming coherence or factuality. Good SFT data also improves sample efficiency for the preference and RL phases.
Practical tips:
- Curate diverse instructions and high‑quality, fully worked responses.
- Deduplicate, filter toxicity and PII, and use strong style guides.
- Keep an SFT reference checkpoint fixed to measure and control policy drift during RL.
Stage 2: Preference data and reward modeling
Human preference data conventionally consists of tuples (x, y_w, y_l) where y_w is the “winner” response and y_l is the “loser.” We then train a reward model r_θ to satisfy r_θ(x, y_w) > r_θ(x, y_l). A common loss is the Bradley–Terry / logistic pairwise loss:
- Minimize: −log σ(r(x, y_w) − r(x, y_l))
Key details:
- Calibration: include “ties/uncertains” to reduce label noise; optionally weight examples by rater confidence.
- Regularization: prevent reward inflation by constraining magnitude or normalizing per token.
- Coverage: ensure prompts span real user tasks and known failure cases (adversarial questions, safety edge cases).
Quality control for labeling:
- Clear rubrics with examples of good/bad answers.
- Inter‑rater agreement checks and continuous rater feedback.
- Active learning to focus labeling budget on prompts where the model is uncertain or frequently wrong.
Stage 3: Policy optimization (PPO, DPO, and friends)
Historically, PPO has been the workhorse for RLHF because it’s stable and supports explicit control of deviation from a reference policy via a KL penalty.
- Objective (schematically): maximize E[r(x, y)] − β KL(π(·|x) || π_ref(·|x))
- β trades off reward-seeking vs. staying close to the SFT behavior to preserve quality and avoid reward hacking.
Implementation notes:
- Optimize at the token level with per‑token advantages.
- Normalize advantages and clip policy ratios (PPO’s core stabilization trick).
- Use entropy bonuses to maintain exploration.
Direct Preference Optimization (DPO) is a popular alternative that bypasses training an explicit reward model and RL loop. It fits the policy directly to match pairwise preferences while controlling KL to the reference. DPO tends to be simpler to implement and can be more sample‑efficient. Other related techniques include IPO, KTO, and ORPO; each reshapes the objective around preferences without a full RL rollout.
When to choose what:
- PPO: when you need fine‑grained control, explicit reward shaping, or compatibility with existing RL infra.
- DPO/IPO/KTO/ORPO: when you want a simpler pipeline, less hyperparameter tuning, and strong results on instruction following with pairwise data.
Controlling drift: KL and reference models
The KL penalty ensures the trained model stays near a stable reference (often the SFT checkpoint). This guards against:
- Reward over‑optimization that harms truthfulness or style.
- Distribution shift away from known‑good behaviors.
Tuning strategies:
- Target‑KL control: adapt β so the observed KL matches a desired range.
- Per‑prompt or per‑domain β: higher penalties for safety‑critical domains.
- Early stopping based on eval score vs. KL growth.
Evaluation: does RLHF actually help?
Evaluate across three layers:
- Capability and utility
- Human A/B testing on real prompts (win rate, Elo).
- Task metrics where available (e.g., code tests passed, math accuracy).
- Truthfulness and robustness
- Hallucination benchmarks; citation‑required tasks; adversarial question sets.
- Safety and policy compliance
- Toxicity, bias and fairness probes; prompt injection and jailbreaking tests; privacy and data‑leak checks.
Also measure operational signals:
- Reward model generalization gap (train vs. held‑out pairwise data).
- KL divergence vs. reference across domains.
- Response length, refusal rates, and latency.
Common failure modes and mitigations
-
Reward hacking: the policy learns to please the reward model’s quirks rather than real human preferences.
- Mitigate with stronger KL constraints, reward model ensembling, periodic human re‑labeling, and counterexamples.
-
Over‑refusal or excessive safety conservatism: the model refuses benign requests.
- Balance the rubric, add “assist safely” demonstrations, and diversify preference data.
-
Hallucinations that slip past the reward model: good‑sounding but false answers.
- Add verifiability rubrics, require citations, and integrate tool‑use or retrieval to ground responses.
-
Bias amplification: reflecting rater or dataset biases.
- Diversify raters, audit by demographics and topic, and introduce fairness constraints or counterfactual data.
-
Mode collapse/verbosity drift: outputs become longer or stylistically narrow.
- Penalize length explicitly, track per‑domain KL, and add style‑diverse SFT data.
Variants and extensions
-
RLAIF (Reinforcement Learning from AI Feedback): use high‑quality teacher or judge models to generate synthetic preferences, reserving human time for difficult or safety‑critical cases.
-
Constitutional AI: replace much of the human comparison work with a set of principles (a “constitution”) and an AI judge to critique and revise outputs, then optionally fine‑tune with human spot checks.
-
Critique‑and‑revise loops: the model generates an answer, a separate module generates critiques, and the model revises accordingly, with preferences gathered over critiques or final outputs.
-
Multi‑objective RLHF: separate reward heads for helpfulness, safety, and faithfulness with tunable weights; or condition the policy on a “preference vector.”
-
Tool‑aware RLHF: integrate retrieval, code execution, or calculators in the loop; evaluate both final answers and tool traces.
Data strategy: the real bottleneck
RLHF quality is largely determined by data:
- Instruction coverage: broad prompts that reflect real users and edge cases.
- Preference diversity: comparisons that target failure regions discovered through red teaming and active learning.
- Continual refresh: as users and contexts change, keep collecting fresh comparisons and retrain reward models.
Invest in rater training:
- Clear, concrete rubrics with domain examples.
- Calibrated practice rounds and feedback on disagreements.
- Periodic audits for leakage, privacy, and bias.
Minimal working sketch (for intuition)
# Pseudo-code: train reward model and improve policy with PPO-like steps
# 1) Supervised fine-tuning (SFT) – assume we already have policy_ref
policy = load_model("sft_policy.ckpt")
policy_ref = freeze(copy_model(policy))
# 2) Preference dataset: list of (prompt, winner, loser)
prefs = load_pairwise_dataset()
# 3) Train reward model with pairwise loss
reward_model = init_reward_model()
for epoch in range(RM_EPOCHS):
for x, y_w, y_l in batch(prefs):
rw = reward_model(x, y_w)
rl = reward_model(x, y_l)
loss = -log_sigmoid(rw - rl) # Bradley–Terry
loss.backward(); opt_rm.step(); opt_rm.zero_grad()
# 4) Policy optimization loop
for step in range(RL_STEPS):
# sample rollouts
batch_prompts = sample_prompts()
with torch.no_grad():
responses, logp_old = policy.generate_with_logprobs(batch_prompts)
logp_ref = policy_ref.logprobs(batch_prompts, responses)
rewards = reward_model(batch_prompts, responses)
kl = kl_divergence(logp_old, logp_ref) # per-token KL
shaped_reward = rewards - beta * kl
# compute advantages/returns
adv = gae(shaped_reward)
# PPO update
logp = policy.logprobs(batch_prompts, responses)
ratio = exp(logp - logp_old)
clipped = clip(ratio, 1 - eps, 1 + eps)
loss_policy = -mean(min(ratio * adv, clipped * adv))
loss_ent = -ent_coef * entropy(logp)
(loss_policy + loss_ent).backward(); opt_pol.step(); opt_pol.zero_grad()
# optional: adapt beta to maintain target KL
beta = adjust_beta(target_kl, observed_kl=mean(kl))
This sketch omits many engineering details—tokenization, batching by length, mixed precision, reward normalization, and evaluation—but it reflects the core loop.
Tooling and infrastructure
- Dataset versioning: track SFT, preference, and safety sets separately.
- Training telemetry: monitor loss curves, reward model AUC, KL, and output length.
- Evaluation harness: nightly A/B tests, jailbreak suites, and domain‑specific checklists.
- Red teaming: mix automated adversaries and human experts; feed failures back into preference data.
When (not) to use RLHF
Use RLHF when:
- Objectives are fuzzy or multi‑criteria (helpfulness, safety, tone).
- You can afford iterative human or AI judging to sculpt behavior.
Consider alternatives when:
- You have clear, programmatic rewards (e.g., games, simulations).
- You only need localized corrections—then SFT or instruction tuning may suffice.
- Latency or cost prohibits rollout‑based optimization—then DPO‑style methods can be simpler and cheaper.
Ethics, safety, and governance
RLHF aligns models to the values expressed in its data and rubrics. That means governance choices—who the raters are, which principles guide decisions, and how failures are handled—directly shape user experience. Good practice includes transparency about rubrics, bias audits, opt‑out and privacy protections for users, and continuous measurement of downstream impacts.
Key takeaways
- RLHF replaces hand‑coded rewards with learned preferences, enabling practical alignment on complex human values.
- The trio of SFT, reward modeling, and policy optimization—with KL control—defines the standard pipeline.
- Data quality and evaluation discipline matter more than exotic algorithms.
- Simpler preference‑direct methods (DPO/IPO/KTO/ORPO) increasingly deliver strong results with less complexity.
- Continual feedback, red teaming, and governance are essential for safe, useful systems.
Related Posts
Practical Techniques to Reduce AI Hallucinations
A practical, end-to-end guide to reducing AI hallucinations with data, training, retrieval, decoding, and verification techniques.
The Transformer Architecture, Visually Explained: From Tokens to Attention Maps
A clear, visual walkthrough of Transformer architecture—from tokens and positions to multi-head attention, residuals, and FFNs.
QLoRA Quantized Fine-Tuning: A Practical Guide to Training LLMs on a Single GPU
Step-by-step QLoRA guide with concepts, setup, memory tips, and code to fine-tune LLMs using 4-bit quantization on a single GPU.