Video accessible from your Account page after purchase.
More Than 4 Hours of Video Instruction
Learn the post-training techniques that make modern LLMs safe, aligned, and capable of complex reasoning.
Overview
Pretraining gives LLMs capability but not judgment. In Reinforcement Learning for LLM Alignment and Reasoning, youll learn how reinforcement learning techniques like DPO and GRPO shape model behavior, safety, and reasoning, and how to build the evaluation and governance systems that keep alignment on track.
Learn How To
Who Should Take This Course
Developers, data scientists, and ML engineers who are fine-tuning or deploying LLMs and want to improve their safety, helpfulness, and reasoning capabilities.
Course Requirements
Lesson Descriptions
Lesson 1: Foundations of Alignment
Lesson 1 sets the stage by defining what alignment is, explaining why modern LLMs need it, and exploring what can go wrong when models are not aligned properly. Sinan breaks down the four major forms of alignment: instructional, behavior, style, and value. He presents concrete examples of each, including how models can become too agreeable, too cautious, or outright unsafe. Sinan introduces the core problem in alignment: pretraining does not produce alignment. The lesson then turns to the three stages of modern LLM development: pretraining, supervised post-training, and preference post-training. You will explore how each phase shapes a models capability and potential risks. Finally, Sinan highlights calibration as a key foundation of trustworthy reasoning and previews how post-training techniques fit into a broader alignment pipeline.
Lesson 2: Reinforcement Learning Basics for LLMs
Lesson 2 goes under the hood of reinforcement learning, explaining how it can shape model behavior through trial and error. The lesson maps traditional reinforcement learning concepts such as agent, environments, actions, rewards, and exploration on to the world of large language models and breaks down the reinforcement loop that powers modern alignment. Sinan also covers why exploration models matter for LLMs, what happens when you explore too little or even too much, and how sampling can control and affect training stability. From there he discusses the classic algorithm that anchored early reinforcement learning from human feedback (RLHF) efforts and proximal policy optimization (PPO), walking through how it works at a conceptual level. By the end of the lesson, you will have the foundation to understand how modern RLHF emerged, how it will evolve beyond PPO, and how that evolution will unlock better alignment and better reasoning.
Lesson 3: Building a Reward Model
Lesson 3 dives into the key ingredient that makes reinforcement learning LLM models work at scale: reward model. The lesson covers what a reward model is, how it is trained, and why it is essential to capture things like helpfulness, harmfulness, groundedness, tone, and value alignment. Sinan explores how pair-wise comparisons and the Bradley-Terry laws can help create a reward model and help it learn which outputs you prefer and which ones you dont. Reward models learn preferences, not truth. Next comes evaluating reward models with measures like pair-wise accuracy and Kendalls T to ensure that these reward models are reliable before using them in a reinforcement loop. By the end of this lesson, you will have a clear understanding of how reward models function and how they influence everything that follows in the alignment process.
Lesson 4: Modern Post-Training
In Lesson 4, Sinan looks at how reinforcement learning for alignment has evolved. He starts by examining why PPO, the algorithm behind early RLHF systems like InstructGPT, became difficult to scale and maintain. From there he explores two major families of modern post-training techniques. The first is offline preference optimization, which includes methods like direct preference optimization (DPO). These approaches remove the need for reward modeling and online sampling loops, making alignment a little bit simpler and far more stable. The second family is online lightweight reinforcement learning, including reinforce leave-one-out (RLOO) and group relative policy optimization (GRPO). These bring exploration back into the mix, but with far less complexity than PPO. Sinan dives into DPO and GRPO in particular, comparing when each of these techniques works best and when they help models align more efficiently. The lesson will help give you the mental model for choosing the right method, depending on your goals, whether you care about safety, style, or deep reasoning performance.
Lesson 5: Advanced Post-Training for Reasoning
In Lesson 5, the emphasis is on why reasoning-heavy tasks require a different post-training method. Sinan looks at why supervised fine-tuning (DPO, PPO) struggles with long chains of logic and why outcome-only preferences are not going to be enough for tasks that need multi-step planning and verification. From there Sinan introduces group relative policy optimization (GRPO), a new reinforcement learning approach designed specifically for reasoning chains. Sinan breaks down how GRPO uses groups of candidate reasoning chains, how verifiers score those chains, and how models learn how to produce better reasoning trajectories over time. By the end of the lesson you will understand how GRPO bridges the gap between alignment and logical consistency and when to use it over simpler preference optimizers.
Lesson 6: Future Directions, Safety, and Governance
In Lesson 6 Sinan shifts from alignment and reasoning to evaluation, safety, and organizational readiness. He explores safety and robustness benchmarks, including toxicity tests, truthfulness evaluations, and hazard category assessments, and how to interpret them. He then walks you through how to build a practical safety evaluation harness and create automated tests to monitor model drift and emerging risks. Then he introduces governance frameworks like NISTs AI RMF and discusses how to implement model registries, evaluation gates, kill switches, and lightweight change controls, so alignment is treated not as a late step but as an ongoing process. Finally, to wrap up the exploration of alignment techniques Sinan puts everything together into a full constitutional AI pipeline. Sinan walks you through how to design a small constitution of safety principles, generate self-critiques, revise unsafe responses, and then use supervised fine-tuning and DPO to consolidate that behavior.
About Pearson Video Training
Pearson publishes expert-led video tutorials covering a wide selection of technology topics designed to teach you the skills you need to succeed. These professional and personal technology videos feature world-leading author instructors published by your trusted technology brands: Addison-Wesley, Cisco Press, Pearson IT Certification, and Que. Topics include IT Certification, Network Security, Cisco Technology, Programming, Web Development, Mobile Development, and more. Learn more about Pearson Video training at http://www.informit.com/video.
Video Lessons are available for download for offline viewing within the streaming format. Look for the green arrow in each lesson.
Lesson 1: Foundations of Alignment
1.1 Alignment goals and failure modes
1.2 Framing the alignment problem for LLMs
1.3 Post-training overview: SFT, preference optimization, RLHF
Lesson 2: Reinforcement Learning Basics for LLMs
2.1 Policies, rewards, and exploration
2.2 Common algorithms used in post-training
2.3 PPO in action using Flan-T5
Lesson 3: Building a Reward Model
3.1 Foundations of reward modeling
3.2 Training a reward model
Lesson 4: Modern Post-Training
4.1 Why RLHF evolved
4.2 Direct preference optimization (DPO)
4.3 Reinforce leave-one-out (RLOO)
4.4 RLHF evals: Preference tests, safety screens, regression checks
Lesson 5: Advanced Post-training for Reasoning
5.1 Why reasoning needs a different post-training method
5.2 Group relative policy optimization (GRPO)
5.3 Case study: Inducing reasoning in LLMs with GRPO
Lesson 6: Future Directions, Safety, and Governance
6.1 Practical safety and robustness benchmarks
6.2 Building a simple safety evaluation harness
6.3 Governance and risk frameworks
6.4 Joint safety evaluations and red teaming
6.5 Case study: Constitutional AI plus safety evaluation
