Scalable Internet Architectures: Working in Mission-Critical Environments
Mission-critical is a commonly abused term. Some think it describes any architecture that they run; others believe it is a term for "systems that launch spacecraft." For the purpose of further discussion, we will equate mission-critical systems with business-critical systems. Business-critical is easy to define: Each business can simply choose what it believes to be vital to its operations.
Perhaps the most important issue to address from a technical perspective is to determine what aspects of a technical infrastructure are critical to the mission. Note the word is aspects and not components. This isn't solely about equipment and software; it is also about policies and procedures. Without going into painful detail, we will touch on five key aspects of mission-critical environments:
- High availability (HA)
- Monitoring
- Software management
- Overcomplication
- Optimization
To effectively manage and maintain any sizable mission-critical environment, these aspects must be mastered. Mission-critical architectures are typically managed by either a few focused teams or a few multidisciplinary teams. I prefer the latter because knowledge and standards tend to be contagious, and all five aspects are easy to master when aggregating the expertise of all individuals on a multidisciplinary team. Although it is not essential that every participant be an expert in any or all of these areas, it is essential that they be wholly competent in at least one area and always cognizant of the others.
Being mindful of the overall architecture is important. As an application developer, if you habitually ignore the monitoring systems, you are likely to make invalid assumptions resulting in decisions that negatively impact the business.
High Availability
Because high availability was the first item in the previous list, the first thing in your mind might be: "What about load balancing?" The criticality of an environment has absolutely nothing to do with its scale. Load balancing attempts to effectively combine multiple resources to handle higher loads—and therefore is completely related to scale. Throughout this book, we will continue to unlearn the misunderstood relationship between high availability and load balancing.
When discussing mission-critical systems, the first thing that comes to mind should be high availability. Without it, a simple system failure could cause a service interruption, and a service interruption isn't acceptable in a mission-critical environment. Perhaps it is the first thing that comes to mind due to the rate things seem to break in many production environments. The point of this discussion is to understand that although high availability is a necessity, it certainly won't save you from ignorance or idiocy.
High availability from a technical perspective is simply taking a single "service" and ensuring that a failure of one of its components will not result in an outage. So often high availability is considered only on the machinery level—one machine is failover for another. However, that is not the business goal.
The business goal is to ensure the services provided by the business are functional and accessible 100% of the time. Goals are nice, and it is always good to have a few unachievable goals in life to keep your accomplishments in perspective. Building a system that guarantees 100% uptime is an impossibility. A relatively useful but deceptive measurement that was widely popular during the dot-com era was the n nines measurement. Everyone wanted an architecture with five nines availability, which meant functioning and accessible 99.999% of the time.
Let's do a little math to see what this really means and why a healthy bit of perspective can make an unreasonable technical goal reasonable. Five nines availability means that of the (60 seconds/minute * 60 minutes/hour * 24 hours/day * 365 days/year =) 31,536,000 seconds in a year you must be up (99.999% * 31,536,000 seconds =) 31,535,684.64 seconds. This leaves an allowable (31,536,000 - 31,535,684.64 =) 315.36 seconds of unavailability. That's just slightly more than 5 minutes of downtime in an entire year.
Now, in all fairness, there are different perspectives on what it means to be available. Take online banking for example. It is absolutely vital that I be able to access my account online and transfer money to pay bills. However, being the night owl that I am, I constantly try to access my bank account at 3 a.m., and at least twice per month it is unavailable with a message regarding "planned maintenance." I believe that my bank has a maintenance window between 2 a.m. and 5 a.m. daily that it uses every so often. Although this may seem like cheating, most large production environments define high availability to be the lack of unplanned outages. So, what may be considered cheating could also be viewed as smart, responsible, and controlled. Planned maintenance windows (regardless of whether they go unused) provide an opportunity to perform proactive maintenance that reduces the risk of unexpected outages during non-maintenance windows.