Skip to main content
Bluecoders
← Tech glossary

RLHF

TechConcept

RLHF (Reinforcement Learning from Human Feedback) is a technique for aligning LLMs: after pre-training, the model is fine-tuned using comparisons made by humans between multiple possible responses, so that it adopts the…

RLHF (Reinforcement Learning from Human Feedback) is a technique for aligning LLMs: after pre-training, the model is fine-tuned using comparisons made by humans between multiple possible responses, so that it adopts the desired behaviours (helpful, honest, harmless).

The process typically has three steps: supervised fine-tuning on human demonstrations, training a reward model that learns human preferences, and then optimising the LLM against that reward model with PPO or DPO.

RLHF is what made ChatGPT usable and took LLMs from the lab to the mainstream. Anthropic proposed a variant with Constitutional AI (RLAIF), in which part of the feedback is produced by other models following explicit principles.

Ready to find the missing piece of your team?

Let's talk about your hiring needs. A team member will get back to you quickly to qualify the brief and kick off the search.