Principle 4: Optimize for Resilience
As Allspaw points out, there are two fundamental approaches to designing a system. You can optimize for mean time between failures (MTBF), or for mean time to restore service (MTRS). For example, a BMW is optimized for MTBF, whereas a Jeep is optimized for MTRS (see Figure 6). You pay more for a BMW up front, because failure is rarebut when it happens, fixing that car is going to cost you. Meanwhile, Jeeps notoriously break down all the time, but it's possible to disassemble and reassemble one in under four minutes.
Figure 6 Two different approaches to system design.
Like a Jeep, it should be possible to provision a running production system from bare metal hardwareor via a virtualization APIto a baseline ("known good") state in a predictable time. You should be able to do this in a fully automated way by using configuration information stored in version control and known good packages (in ITIL-world, these come from your definitive media library).
This ability to restore your system to a baseline state in a predictable time is vital not just when a deployment goes wrong, but also as part of your disaster-recovery strategy. When Netflix moved its infrastructure to Amazon Web Services, building resilience into the system was so important that the developers created a system called "Chaos Monkey," which randomly killed parts of the infrastructure. Chaos Monkey was essential both to verify that the system worked effectively in degraded modea key attribute of resilient systemsand to test Netflix' automated monitoring and provisioning systems.
The biggest enemy in creating resilient system is what the Visible Ops Handbook (Kevin Behr, Gene Kim, and George Spafford; Information Technology Process Institute, 2004) calls "works of art": components of your system that have been hand-crafted over the years and which, if they failed, would be impossible to reproduce in a predictable time. When dealing with such components, you must find a way to create a copy of that component by using virtualization technologyboth for testing purposes, and so you can create new instances of it in the event of a disaster.
But the most important element in creating resilient systems is human, as Richard I Cook's short and excellent paper "How Complex Systems Fail" points out. This is one of the reasons that the DevOps movement focuses so much on culture. When a service goes down, it's imperative both that everyone knows what procedures to follow to diagnose the problem and get the system up and running again, and also that all the roles and skills necessary to perform these tasks be available and able to work together well. Training and effective collaboration are key hereissues discussed at more length in John Allspaw and Jesse Robbins' book Web Operations: Keeping the Data on Time (O'Reilly, 2010).