Home > Articles > Programming

  • Print
  • + Share This
Like this article? We recommend Conclusions


At many of my early gigs, any change you wanted to put out had to include a rollback procedure in case the change went wrong. Often rollback meant redeployment of the previous version, along with rolling back database changes—processes that (rather like restoring from backups) had not been tested nearly as well as the deployment process, if at all.

One of the concepts introduced by ITIL is remediation, defined as "recovery to a known state after a failed change or release." The patterns and practices described here provide a way to remediate in a low-risk way—perhaps by changing a router setting or switching off a problematic feature—without resorting to rollback to a previous version of your system.

With these techniques, you can dramatically reduce the risk of releasing to users. However, they come with an added development cost and require some upfront planning, so you pay a certain amount in advance in order to achieve this lowered risk. Often these kinds of costs are hard to justify, partly because people have a tendency to undervalue a reward that some way exists in the future. (This is a behavioral bias known as temporal discounting.) This is one of the reasons why reducing batch size—and thus decreasing lead time—is important: You also get feedback much sooner on the benefits of changing your delivery process, which increases motivation.

These practices also depend on having good foundations in place—effective monitoring, comprehensive configuration management, a deployment pipeline, and an automated deployment process. If you're lacking in any of these areas, you'll need to address them as part of implementing a more reliable release process.

Operations teams often resist change, on the basis that any change carries risk. While this is true, it doesn't follow that we should attempt to reduce the frequency of changes, since this in turn leads to high-risk "big bang" deployments. Instead, create more stable and reliable services by building resilience into systems and working to minimize and mitigate the risk of each individual change.

Thanks to Max Lincoln, Mark Needham, Ilias Bartolini, Peter Gillard-Moss, and Joanne Molesky for feedback on an earlier version of this article. Thanks also to Martin Fowler and John Allspaw for permission to reproduce their diagrams.

  • + Share This
  • 🔖 Save To Your Account