- 3.1 Making Design Decisions
- 3.2 Design Concepts: The Building Blocks for Creating Structures
- 3.3 Design Concepts to Support Performance
- 3.4 Design Concepts to Support Availability
- 3.5 Design Concepts to Support Modifiability
- 3.6 Design Concepts to Support Security
- 3.7 Design Concepts to Support Integrability
- 3.8 Summary
- 3.9 Further Reading
- 3.10 Discussion Questions
3.4 Design Concepts to Support Availability
Availability is a system property. A highly available system is compliant with its specifications—always ready to carry out its tasks when you need it to be. A failure in a system occurs when the system no longer delivers a service consistent with its specification. A fault has the potential to cause a failure. Availability encompasses the ability of a system to mask or repair faults so they do not become failures. It builds on the quality attribute of reliability by adding the notion of recovery (repair). The goal is to minimize service outage time by mitigating faults.
3.4.1 Availability Tactics
Availability tactics enable a system to endure faults. The tactics keep faults from becoming failures (or bound the effects of the fault and make repairs). A failure’s cause is a fault. A fault can be internal or external to the system. Faults can be prevented, tolerated, removed, or forecast. Through these actions, a system can become “resilient” to faults.
Figure 3.5 shows the availability tactics categorization. There are three major categories of availability tactics: Detect Faults, Recover from Faults, and Prevent Faults. It is difficult to envision a system that could achieve high availability if it does not address all of these categories.
FIGURE 3.5 Availability tactics categorization
When designing for availability, we are concerned with how faults are detected, how frequently they occur, what happens when they occur, how long a system may be out of operation, how faults or failures can be prevented, and which notifications are required when a failure occurs. Each of these concerns can be addressed architecturally, through the appropriate selection of availability tactics.
Within the Detect Faults category, the tactics are:
Monitor: a component used to monitor the state of health of other parts of the system. A system monitor can detect failure or congestion in the network or other shared resources, such as from a denial-of-service attack.
Ping/echo: asynchronous request/response message pair exchanged between nodes, used to determine reachability and the round-trip delay through the associated network path.
Heartbeat: a periodic message exchange between a system monitor and a process being monitored.
Timestamp: used to detect incorrect sequences of events, primarily in distributed message-passing systems.
Condition monitoring: checks conditions in a process or device, or validates assumptions made during the design.
Sanity checking: checks the validity or reasonableness of a component’s operations or outputs; typically based on a knowledge of the internal design, the state of the system, or the nature of the information under scrutiny.
Voting: checks that replicated components are producing the same results. Comes in various flavors: replication, functional redundancy, and analytic redundancy.
Exception detection: detection of a system condition that alters the normal flow of execution (e.g., system exception, parameter fence, parameter typing, timeout).
Self-test: procedure for a component to test itself for correct operation.
Within the Recover from Faults category, there are two subcategories: “Preparation and Repair” and “Reintroduction”. Let us first look at the tactics for “Preparation and Repair”:
Redundant spare: a configuration in which one or more duplicate components can step in and take over the work if the primary component fails. (This tactic is at the heart of the hot spare, warm spare, and cold spare patterns, which differ primarily in how up-to-date the backup component is at the time of its takeover, as we will discuss in Section 3.4.2.)
Rollback: revert to a previous known good state, referred to as the “rollback line”.
Exception handling: dealing with the exception by reporting it or handling it, potentially masking the fault by correcting the cause of the exception and retrying.
Software upgrade: in-service upgrades to executable code images in a non-service-affecting manner.
Retry: where a failure is transient, retrying the operation may lead to success.
Ignore faulty behavior: ignoring messages sent from a source when it is determined that those messages are spurious.
Graceful degradation: maintains the most critical system functions in the presence of component failures, dropping less critical functions.
Reconfiguration: reassigning responsibilities to the resources left functioning, while maintaining as much functionality as possible.
Within the “Reintroduction” subcategory, the tactics are:
Shadow: running a previously failed/upgraded component in “shadow mode” prior to reverting it to an active role.
State resynchronization: partner to the redundant spare tactic, where state is sent from active to standby components.
Escalating restart: recover from faults by varying the granularity of the component(s) restarted.
Nonstop forwarding: functionality is split into supervisory and data components. If a supervisor fails, a router continues forwarding packets along known routes while protocol information is recovered.
The final category of availability tactics is Prevent Faults. The tactics within this category are:
Removal from service: temporarily placing a system component in an out-of-service state for the purpose of mitigating potential failures.
Transactions: bundling state updates so that messages exchanged between components are atomic, consistent, isolated, and durable.
Predictive model: monitoring the state of a process to ensure that the system is operating properly; taking corrective action when conditions are predictive of likely future faults.
Exception prevention: preventing system exceptions from occurring by masking a fault, or preventing it via mechanisms such as smart pointers, abstract data types, and wrappers.
Increase competence set: designing a component to handle more cases—faults—as part of its normal operation.
3.4.2 Availability Patterns
A number of redundancy patterns are commonly used to achieve high availability. In this section, we will look at them as a group, as this helps us understand their strengths and weaknesses, and the scope of the design space around this issue. These patterns deploy a group of active components and a group of redundant spare components. In general, the greater the level of redundancy, the higher the availability and the cost and complexity will be.
The benefit of using a redundant spare is that the system continues to function correctly with only a brief delay after a failure. The alternative is a system that stops functioning correctly (or altogether) until the failed component is repaired.
All of these patterns incur additional cost and complexity from providing a spare. The tradeoff among the three alternatives is the time to recover from a failure versus the runtime cost incurred to keep a spare up-to-date. A hot spare carries the highest cost but leads to the fastest recovery time, for example.
3.4.2.1 Hot Spare (Active Redundancy)
In the hot spare configuration, all of the components belong to the active group and receive and process identical inputs in parallel, allowing the redundant spares to maintain a synchronous state with the active components. Because the redundant spare possesses an identical state to the active component, it can take over from a failed component in a matter of milliseconds.
3.4.2.2 Warm Spare (Passive Redundancy)
In the warm spare variant, only the members of the active group process input. One of their duties is to provide the redundant spares with periodic state updates. Because this state is loosely coupled with the active components, the redundant components are referred to as warm spares.
Passive redundancy achieves a balance between the more highly available but more compute-intensive (and expensive) hot spare pattern and the less available but less complex (and cheaper) cold spare pattern.
3.4.2.3 Cold Spare (Spare)
In the cold spare configuration, spares remain out of service until a failure occurs, at which point a power-on-reset is initiated on the spare prior to it being placed in service. Due to its poor recovery performance, and hence its high MTTR (mean time to recovery), this pattern is poorly suited to systems with high availability requirements.
3.4.2.4 Tri-Modular Redundancy
Tri-modular redundancy (TMR) is a widely used implementation in which the voting tactic is combined with the hot spare pattern; it employs three identical components (or, more generally, N identical components in N-modular redundancy). In the example shown in Figure 3.6, each component receives the same inputs and forwards its output to the voter. If the voter detects any inconsistency, it reports a fault. For this reason, the value of N is usually set to an odd number, to avoid ties in the voting. The voter must also decide which output to use. Typical choices are letting the majority rule or choosing a computed average of the outputs.
FIGURE 3.6 Tri-modular redundancy pattern
The TMR pattern is simple to understand and implement. It is independent of what might be causing disparate results and is only concerned with making a choice so that the system can continue to function.
As with the other redundancy patterns, there is a tradeoff between increasing the level of replication, which raises the cost, and the resulting availability. The statistical likelihood of two or more components failing is small, and three components is a common sweet spot between availability and cost.
3.4.2.5 Circuit Breaker
The Circuit Breaker pattern is sometimes seen as a performance pattern, but actually addresses both performance and availability concerns. (As this example shows, the label on a pattern is less important than the response measures that it helps you to achieve.) This pattern does not use redundancy, but it is important for achieving high availability in networked contexts, such as microservice architectures. A common availability tactic in networked systems is retry. For example, in the event of a timeout or fault when invoking a service, an invoker may try again (and again and again). A circuit breaker keeps the invoker from retrying infinitely many times, waiting for a response that may never come. This breaks the endless retry cycle when the circuit breaker deems that the system is dealing with a fault. Until the circuit is “reset”, subsequent invocations will return immediately without requesting the service.
One important benefit of this pattern is that it removes from individual components the policy of how many retries to allow before declaring a failure. The circuit breaker, in conjunction with software that listens to it and begins recovery, prevents cascading failures.
But care must be taken in choosing the timeout (or retry) values. If the timeout is too long, then unnecessary latency is added. If the timeout is too short, then the circuit breaker will trip when it does not need to, which can lower the availability and performance of services.