Four Principles of Low-Risk Software Releases
One key goal of continuous deployment is to reduce the risk of releasing software. Counter-intuitively, increased throughput and increased production stability are not a zero-sum game, and effective continuous deployment actually reduces the risk of any individual release. In the course of teaching continuous delivery and talking to people who implement it, I've come to see that "doing it right" can be reduced to four principles:
- Low-risk releases are incremental.
- Decouple deployment and release.
- Focus on reducing batch size.
- Optimize for resilience.
Principle 1: Low-Risk Releases Are Incremental
Any organization of reasonable maturity will have production systems composed of several interlinked components or services, with dependencies between those components. For example, my application might depend on some static content, a database, and some services provided by other systems. Upgrading all of those components in one big-bang release is the highest-risk way to roll out new functionality.
Instead, deploy components independently, in a side-by-side configuration wherever possible, as shown in Figure 1. For example, if you need to roll out new static content, don't overwrite the old content. Instead, deploy that content in a new directory so it's accessible via a different URIbefore you deploy the new version of the application that requires it.
Figure 1 Upgrading incrementally.
Database changes can also be rolled out incrementally. Even organizations like Flickr, which deploy multiple times a day, don't roll out database changes that frequently. Instead, they use the expand/contract pattern. The rule is that you never change existing objects all at once. Instead, divide the changes into reversible steps:
- Before the release goes out, add new objects to the database that will be required by that new release.
- Release the new version of the app, which writes to the new objects, but reads from the old objects if necessary so as to migrate data "lazily." If you need to roll back at this point, you can do so without having to roll back the database changes.
- Finally, once the new version of the app is stable and you're sure you won't need to roll back, apply the contract script to finish migrating any old data and remove any old objects.
Similarly, if the new version of your application requires a new version of some service, you should have the new version of that service up, running, and tested before you deploy the new version of your app that depends on it. One way to do this is to write the new version of your service so that it can handle clients that expect the old version. (How easy this is depends a lot on your platform and design.) If this is impossible, you'll need to be able to run multiple versions of that service side by side. Either way, your service needs to be able to support older clients. For example, when accessing Amazon's EC2 API over HTTP, you must specify the API version number to use. When Amazon releases a new version of the API, the old versions carry on working.
Designing services to support clients that expect older versions comes with costsmost seriously in maintenance and compatibility testing. But it means that the consumers of your service can upgrade at their convenience, while you can get on with developing new functionality. And of course if the consumers need to roll back to an older version of their app that requires an older version of your service, they can do that.
Of course, you must consider lots of edge cases when using these techniques, and they require careful planning and some extra development work, but ultimately they're just applications of the branch-by-abstraction pattern.
Finally, how do we release the new version of the application incrementally? This is the purpose of the blue-green deployment pattern. Using this pattern, we deploy the new version of the application side by side with the old version. To cut over to the new versionand roll back to the old versionwe change the load balancer or router setting (see Figure 2).
Figure 2 Blue-green deployment.
A variation on blue-green deployment, applicable when running a cluster of servers, is canary releasing. With this pattern, rather than upgrading a whole cluster to the latest version all at once, you do it incrementally. For example, as described in an excellent talk by Facebook's release manager, Chuck Rossi, Facebook pushes new builds to production in three phases (see Figure 3):
- First the build goes to A1a small set of production boxes to which only employees are routed.
- If the A1 deployment looks good, the build goes to A2, a "couple of thousand" boxes to which only a small percentage of users are routed.
- A1 and A2 are like canaries in a coal mineif a problem is discovered at these stages, the build goes no further. Only when no problems occur is the build promoted to A
Figure 3 Facebook's three phases for pushing new builds to production.
An interesting extension of this technique is the cluster immune system. Developed by the engineers at IMVU, this system monitors business metrics as a new version is being rolled out through a canary releasing system. It automatically rolls back the deployment if any parameters exceed tolerance limits, emailing everyone who checked in since the last deployment so that they can fix the problem.