Register your product to gain access to bonus material or receive a coupon.
Video accessible from your Account page after purchase.
5 Hours of Video Instruction
Ensure the delivery of reliable, resilient, scalable, and efficient cloud services through the combined power of AI, Lean, and Reliability Engineering.
Overview:
This video course teaches engineering strategies for promoting chaos engineering practices, observability and monitoring techniques, disaster recovery exercises, reliability metrics, fast data-driven decision-making, and the application of Gen AI/LLMs.
Participants will learn how to increase the reliability and scalability of their systems in the cloud, improve the efficiency of their operations, and gain valuable skills to enable faster incident response. They will also learn to automate operations to improve time to restore and time to detect to the greatest possible extent using modern cloud services, AI/LLMs, and best-in-class tools. The course will help participants understand how operational agility, lean principles, and chaos experimentation can foster a culture of continuous improvement built on collaboration and knowledge sharing. Given the lack of literature and established frameworks in this domain, learners will benefit from practical, domain-specific approaches and examples they can apply directly within their organizations and teams.
Check out Mariya and Carlos's book Reliability Engineering in the Cloud: Strategies and Practices for AI-Powered Cloud-Based Systems (Addison-Wesley, 2025) for an even deeper dive.
Learn How To:
Who Should Take This Course:
Lesson Descriptions:
Lesson 1: How to Design, Build, Operate, and Stress Test Highly Reliable Systems
Lesson 1 explores the foundational principles of resilience, reliability, and engineering excellence, and how these principles shape high-performing engineering teams and system architectures. You will gain a deep understanding of what it means to build systems that are not only functional but also robust, fault-tolerant, and scalable. You will learn how to design and build resilient and reliable systems and test the resilience of applications. You will learn how to respond and mitigate potential issues and how to leverage Artificial Intelligence (AI) and Large Language Models (LLMs) to enhance system observability, diagnostics, and incident resolution.
Lesson 2: Defining Engineering Strategies for Building Resilient, Available, and Scalable Systems
This lesson explores key principles like scalability, resilience, automation, and high availability and how they impact your system architecture and operations. You will learn how to make strategic decisions related to disaster recovery, strategic management, and building a culture of operational excellence. This lesson is focused on defining engineering strategies for building resilient, available, and scalable systems.
Lesson 3: The Power of Artificial Intelligence, Value Streams, and a Culture of Innovation in Cloud Reliability Engineering (CRE)
This lesson explores key principles and foundational definitions of AI, as well as the implementation of value streams and aspects of culture of innovation that are needed to formalize CRE (Cloud Reliability Engineering) strategies. The concepts covered in the lesson include understanding foundational AI components, applying ML/Gen AI to CRE strategy, incorporating value streams and the CRE strategy, and fostering a culture of innovation through leadership ownership and fast decision making.
Lesson 4: Leveraging Observability, Monitoring, and Reliability Metrics
This lesson explores key principles like scalability, resilience, automation, and high availability, and examines how they impact your system architecture and operations. You will learn how to make strategic decisions around disaster recovery, incident management, and how to build a culture of operational excellence. This lesson focuses on leveraging observability, monitoring, and reliability metrics. Concepts covered include defining observability and monitoring; deploying a 10-step process to create effective monitoring; surveying monitoring and alerting tools from leading cloud providers; recognizing and proactively mitigating known service disruptions; and establishing objectives and key results (OKRs).
Lesson 5: CRE Tooling and Chaos Engineering
Proper tooling is essential in CRE to maintain reliability, availability, and performance of cloud-based systems. Automation helps streamline recovery operations. eIt reduces manual interventions for testing scenarios and ensures that teams can proactively respond to issues. Cloud providers opt for many automation tools that embrace these principles and techniques. This lesson reviews such tools and discusses why they are important in CRE. Concepts covered include distributing load with autoscaling and load balancing; enabling automatic failovers for high availability; implementing continuous deployments with rollback strategies; and leveraging chaos engineering for resilience testing.
Lesson 6: Incident Response for Fast Recovery
This lesson is focused on incident response best practices. Concepts covered include understanding incident response; implementing a structured approach to incident response and CRE tools; understanding incident handling in CRE; defining time to detect (TTD) and time-to-recover (TTR); and understanding playbooks and runbooks and the difference between both.
Lesson 7: Operational Excellence and Change Management
This lesson focuses on the foundational elements of operational excellence within CRE and how effective change management supports long-term success. Concepts covered include defining operational excellence in CRE; identifying processes, people and tools for operational excellence; establishing key performance indicators to support your strategy; understanding concepts of root cause analysis (RCA) and correction of error (CoE) form and also identifying tools for operational excellence assessments that are provided and included by your cloud service providers.
About Pearson Video Training:
Pearson publishes expert-led video tutorials covering a wide selection of technology topics designed to teach you the skills you need to succeed. These professional and personal technology videos feature world-leading author instructors published by your trusted technology brands: Addison-Wesley, Cisco Press, Pearson IT Certification, Sams, and Que Topics include IT Certification, Network Security, Cisco Technology, Programming, Web Development, Mobile Development, and more. Learn more about Pearson Video training at http://www.informit.com/video.
Video Lessons are available for download for offline viewing within the streaming format. Look for the green arrow in each lesson.
Introduction
Lesson 1: How to Design, Build, Operate, and Stress Test Highly Reliable Systems
1.1 Defining Resilience, Reliability, Engineering, and Engineering Excellence
1.2 Ensuring Engineering Excellence in the Cloud: Why Your Business Cant Succeed Without It
1.3 Understanding How to Design and Build Resilient and Reliable Systems
1.4 Understanding How to Test the Resilience of Your Applications
1.5 Responding to and Mitigating Potential Issues
1.6 Understanding How to Leverage Artificial Intelligence (AI) and Large Language Models (LLM)
1.7 Lesson 1 Review and an Exercise
Lesson 2: Defining Engineering Strategies for Building Resilient, Available, and Scalable System
2.1 Understand the Foundational Concepts of Reliability, such as Fault Tolerance, High Availability, Scalability, and Recovery
2.2 Choosing Between Alternatives for Uptime and Architecture Design
2.3 Implementing Service Level Objectives (SLO) and Service Level Indicators (SLI) as Performance Measurements
2.4 Exploring Immutable Infrastructure, Containerization, and Event-Driven Architecture
2.5 Validating Application and Infrastructure Resilience with Chaos Engineering and Other Modern Techniques
2.6 Lesson 2 Review and an Exercise
Lesson 3: The Power of Artificial Intelligence, Value Streams, and a Culture of Innovation in Cloud Reliability Engineering (CRE)
3.1 Understanding Foundational AI Components
3.2 Applying ML/GenAI to CRE
3.3 Incorporating Value Streams and the CRE Strategy
3.4 Fostering a Culture of Innovation: Leadership, Ownership, and Fast Decision-Making
3.5 Lesson 3 Review and an Exercise
Lesson 4: Leveraging Observability, Monitoring, and Reliability Metrics
4.1 Defining Observability and Monitoring
4.2 Deploying a 10-Step Process to Create Effective Monitoring
4.3 Surveying Monitoring and Alerting Tools from Leading Cloud Providers
4.4 Recognizing and Proactively Mitigating Known Service Disruptions
4.5 Establishing Objectives and Key Results (OKRs)
4.6 Lesson 4 Review and an Exercise
Lesson 5: CRE Tooling and Chaos Engineering
5.1 Distributing Load with Autoscaling and Load Balancing
5.2 Enabling Automatic Failovers for High Availability
5.3 Implementing Continuous Deployments with Rollback Strategies
5.4 Leveraging Chaos Engineering for Resilience Testing
5.5 Lesson 5 Review and an Exercise
Lesson 6: Incident Response for Fast Recovery
6.1 Understanding Incident Response Foundational Concepts
6.2 Implementing a Structured Approach to Incident Response and CRE Tools
6.3 Understanding Incident Handling in CRE
6.4 Defining Time-to-Detect (TTD) and Time-to-Recover (TTR)
6.5 Understanding Playbooks and Runbooks
6.6 Lesson 6 Review and an Exercise
Lesson 7: Operational Excellence and Change Management
7.1 Defining Operational Excellence in CRE
7.2 Identifying Processes, People, and Tools for Operational Excellence
7.3 Establishing Key Performance Indicators
7.4 Understanding Root Cause Analysis (RCA) and Correction of Error (CoE) Form
7.5 Identifying Tools for Operational Excellence Assessments
7.6 Lesson 7 Review and an Exercise
Summary