Pre-training teaches a model language and world facts, but post-training makes it helpful and safe. Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) are the key methods used to align model behavior with human intent.
However, post-training is prone to bottlenecks. Over-alignment can lead to models that refuse harmless requests out of excessive caution, while under-alignment risks producing toxic or inaccurate outputs. Striking the right balance requires deep curation of preference datasets.
As automated alignment methods gain traction, models are increasingly trained on feedback generated by other, stronger models (RLAIF). This shifts the human role from manual labeling to designing high-level behavioral rules and evaluative criteria.