Jahanzaib
Models & Training

RLHF (Reinforcement Learning from Human Feedback)

Training method that uses human preference rankings to align model outputs with human values.

Last updated: April 26, 2026

Definition

RLHF is the alignment technique that turned GPT-3 (a raw text predictor) into ChatGPT (a helpful assistant). Process: humans rank model outputs from best to worst, those rankings train a reward model, the reward model fine-tunes the LLM via reinforcement learning. RLHF is what makes models follow instructions, refuse harmful requests, and feel "polite." It is also what causes some failure modes. Over-refusal, sycophancy, mode collapse. As an application engineer you don't do RLHF yourself, but understanding it explains a lot of model quirks.

When To Use

You don't. RLHF happens during model training. But knowing it exists explains why models behave the way they do.

Related Terms

Building with RLHF (Reinforcement Learning from Human Feedback)?

I've shipped this pattern in real production systems. If you want a second pair of eyes on your architecture, that's what I do.