Failures in Complex Systems
This section discusses failure modes and effects in complex systems. From this discussion you can gain an appreciation of how complex systems can fail in complex ways. Before you can design a system to recover from failures, you must understand how systems fail.
Failures are the primary focus of the systems architect designing highly available (HA) systems. Understanding the probability, causes, effects, detection, and recovery of failures is critical to building successful HA systems. The professional HA expert has many years of study and experience with a large variety of esoteric systems and tools that are used to design HA systems. The average systems architect is not likely to have such tools or experience but will be required to design such systems. Fortunately, much of the detailed engineering work is already done by vendors, such as Sun Microsystems, who offer integrated HA systems.
A typical systems design project is initially concerned with defining what the system is supposed to do. The systems architect designing highly available clusters must also be able to concentrate on what the system is not supposed to do. This is known as testing for unwanted modes, which can occur as a result of integrating components that individually perform properly but may not perform together as expected. The latter can be much more difficult and time consuming than the former, especially during functional testing. Typical functional tests attempt to show that a system does what it is supposed to do. However, it is just as important, and more difficult, to attempt to show that a system does not do what it is not supposed to do.
A defect is anything that, when exercised, prevents something from functioning in the manner in which it was intended. The defect can, for example, be due to design, manufacture, or misuse and can take the form of a badly designed, incorrectly manufactured, or damaged hardware or software component. An error usually results from the defect being exercised, and if not corrected, may result in a failure.
Examples of defects include:
Hardware factory defect A pin in a connector is not soldered to a wire correctly, resulting in data loss when exercised.
Hardware field defect A damaged pin no longer provides a connection, resulting in data loss when exercised.
Software field defect An inadvertently corrupted executable file can cause an application to crash.
An error occurs when a component exhibits unintended behavior and can be a consequence of:
A defect being exercised
A component being used outside of its intended operational parameters
Some other cause, for example a random, though anticipated, environmental effect
A fault is usually a defect, but possibly an imprecise error, and should be qualified. Fault may be synonymous with bug in the context of software faults1, but need not be, as in the case of a page fault.
Highly available computer systems are not systems that never fail. They experience, more or less, the same failure rates on a per component basis as any other systems. The difference between these types of systems is how they respond to failures. You can divide the basic process of responding to failures into five phases.
FIGURE 1-1 shows the five phases of failure response:
Fault isolation to determine the source of the fault and the component or field- replaceable unit (FRU) that must be repaired
Fault correction, if possible, in the case of automatically recoverable components, such as error checking and correction (ECC) memory
Failure containment so that the fault does not propagate to other components
System reconfiguration so you can repair the faulty component
Figure 1-1. HA System Failure Response
Fault detection is an important part of highly available systems. Although it may seem simple and straightforward, it is perhaps the most complex part of a cluster. The problem of fault detection in a cluster is an open problem one for which not all solutions are known. The Sun Cluster strategy for solving this problem is to rely on industry-standard interfaces between cluster components. These interfaces have built-in fault detection and error reporting. However, it is unlikely that all failure modes of all components and their interactions are known.
While this may sound serious, it is not so bad when understood in the context of the cluster. For example, consider unshielded twisted pair (UTP), 10BASE-T Ethernet interfaces. Two classes of failures can affect the interface physical and logical.
These errors can be further classified or assigned according to the four layers of the TCP/IP stack, but for the purpose of this discussion, the classification of physical and logical is sufficient. Physical failures are a bounded set. They are often detected by the network interface card (NIC). However, not all physical failures can be detected by a single NIC, nor can all physical failures be simulated by simply removing a cable.
Knowing how the system software detects and handles error conditions as they occur is important to a systems architect. If the network fails in some way, the software should be able to isolate the failure to a specific component.
For example, TABLE 1-1 lists some common 10BASE-T Ethernet failure modes. As you can see from this example, there are many potential failure modes. Some of these failure modes are not easily detected.
Table Common 10BASE-T Ethernet Failure Modes
|Cable unplugged||Physical||NIC||Yes, unless Software Query Enable (SQE) is enabled|
|Cable wired in reverse polarity||Physical||NIC||Yes|
|Cable too long||Physical||NIC (in some cases only)||Difficult because the error may range from no link (with SQE disabled) to high bit error rate (BER) that must be detected by logical tests|
|Cable receive pair wiring failure||Physical||NIC||Yes, unless SQE is enabled|
|Cable transmit pair wiring failure||Physical||Remote device||Yes, unless SQE is enabled|
|Electromagnetic interference (EMI)||Physical||NIC (in some cases only)||Difficult, because the errors may be intermittent with the only detection being changes in the BER|
|Duplicate medium access control address (MAC)||Logical||Solaris operating environment||Yes|
|Duplicate IP address||Logical||Solaris operating environment||Yes|
|Incorrect IP network address||Logical||Not automatically detectable for the general case|
|No response from remote host||Logical||Sun Cluster software||Sun Cluster software uses a series of progressive tests to try to establish connection to the remote host|
You may be tempted to simulate physical or even logical network errors by disconnecting cables, but this does not simulate all possible failure modes of the physical network interface. Full physical fault simulation for networks can be a complicated endeavor.
Probes are tools or software that you can use to detect most system faults and to detect latent faults. You can also use probes to gather information and improve the fault detection. Hardware designs use probes for measuring environmental conditions such as temperature and power. Software probes query service response or act like end users completing transactions.
You must put probes at the end points to effectively measure end-to-end service level. For example, if a user community at a remote site requires access to a service, a probe system must installed at the remote site to measure the service and its connection to the end users. It is not uncommon for a large number of probes to exist in an enterprise that provides mission-critical services to a geographically distributed user base. Collecting the probe status at the operations control center that supports the system is desirable. However, if the communications link between the probe and operations control center is down, the probe must be able to collect and store status information for later retrieval. For more information on probes, see Failure Detection.
Complex services require complex probes to inquire about all capabilities of the service. This complexity produces opportunity for defects in the probe itself. You cannot rely on a faulty probe to deliver an accurate status of the service.
To detect latent faults in hardware, you can use special test software such as the Sun Management Center Hardware Diagnostic Suite (HWDS) software. The HWDS allows you to perform tests that exercise the hardware components of a system at scheduled intervals. Because these tests consume some system resources, they are done infrequently. Detected errors are treated as normal errors and reported to the system by the normal error reporting mechanisms.
The most obvious types of latent faults are those that exist but are not detected during the testing process. These include software bugs that are not simulated by software testing, and hardware tests that do not provide adequate coverage of possible faults.
Fault isolation is the process of determining, from the available data, which component caused a failure. Once the faulty component is identified or isolated, it can be reset or replaced with a functioning component.
The term fault isolation is sometimes used as a synonym for fault containment, which is the process of preventing the spread of a failure from one component to others.
For analysis of potential modes of failure, it is common to divide a system into a set of disjointed fault isolation zones. Each error or failure must be attributed to one of these zones. For example, a FRU or an application process can represent a fault isolation zone.
When recovering from a failure, the system can reset or replace numerous components with a single action. For example, a Sun Quad FastEthernet™ card has four network interfaces located on one physical card. Because recovery work is performed on all components in a fault recovery zone, replacing the Sun Quad FastEthernet card affects all four network interfaces on the card.
Fault reporting notifies components and humans that a fault has occurred. Good fault reporting with clear, unambiguous, and concise information goes a long way toward improving system serviceability.
Berkeley Standard Distribution (BSD) introduced the syslogd daemon as a general- purpose message logging service. This daemon is very flexible and network aware, making it a popular interface for logging messages. Typically, the default syslogd configuration is not sufficient for complex systems or reporting structures. However, correctly configured, syslogd can efficiently distribute messages from a number of systems to centralized monitoring systems. The logger command provides a user level interface for generating syslogd messages and is very useful for systems administration shell scripts.
Not all fault reports should be presented to system operators with the same priority. To do so would make appropriate prioritized responses difficult, particularly if the operator was inexperienced. For example, media defects in magnetic tape are common and expected. A tape drive reports all media defects it encounters, but may only send a message to the operator when a tape has exceeded a threshold of errors that says that the tape must be replaced. The tape drive continues to accumulate the faults to compare with the threshold, but not every fault generates a message for the operator.
Faults can be classified in terms of correctability and detectability. Correctable faults are faults that can be corrected internally by a component and that are transparent to other components (faults inside the black box). Recoverable faults, a superset of correctable faults, include faults that can be recovered through some other method such as retrying transactions, rerouting through an alternate path, or using an alternate primary. Regardless of the recovery method, correctable faults are faults that do not result in unavailability or loss of data. Uncorrectable errors do result in unavailability or data loss. Unavailability is usually measured over a discrete time period and can vary widely depending on the SLAs with the end users.
Reported correctable (RC) errors are of little consequence to the operator. Ideally, all RC errors should have soft-error rate discrimination algorithms applied to determine whether the rate is excessive. An excessive rate may require the system to be serviced.
Error correction is the action taken by a component to correct an error condition without exposing other components to the error. Error correction is often done at the hardware level by ECC memory or data path correction, tape write operation retries, magnetic disk defect management, and so forth. The difference between a correctable error and a fault that requires reconfiguration is that other components in the system are shielded from the error and are not involved in its correction.
Reported uncorrectable (RU) errors notify the operators that something is wrong and give the service organization some idea of what to fix.
Silent correctable (SC) errors cannot have rate-discrimination algorithms applied because the system receives no report of the event. If the rate of an SC error is excessive because something has broken, no one ever finds out.
Silent uncorrectable (SU) errors are neither reported nor recoverable. Such errors are typically detected some time after they occur. For example, a bank customer discovers a mistake while verifying a checking account balance at the end of the month. The error occurred some time before its eventual discovery. Fortunately, most banks have extensive auditing capabilities and processes ultimately to account for such errors.
TABLE 1-2 shows some examples of reported and correctable errors.
Table Reported and Correctable Errors
|Reported||RC DRAM ECC error, dropped TCP/IP packet||RU Kernel panic, serial port parity error|
|Silent||SC Processor branch prediction table error||SU Undetected data corruption|
For additional details on detectable faults, see Fault Detection and Failure Detection.
Fault containment is the ability to contain the effects and prevent the propagation of an error or failure, usually due to some boundary. For clusters, computing nodes are often the fault containment boundary. The assumption is that the node halts, as a direct consequence of the failure or by a failfast or failstop, before it has a chance to do significant I/O and propagate the fault.
Fault containment can also be undertaken proactively, for example, through failure fencing. The concept of failure fencing is closely related to failstop. One way to ensure that a faulty component cannot propagate errors is to prevent it from accessing other components or data. Disk Fencing describes the details of how the Sun Cluster software products use this failure fencing technique.
Fault propagation occurs when a fault in one component causes a fault in another component. This propagation can occur when two components share a common component; it is a common mode fault. For example, a SCSI-2 bus is often used for low-cost, shared storage in clusters. The SCSI-2 bus represents a shared component that can propagate faults. If a disk or host failure hangs the SCSI-2 bus (interferes with the bus arbitration), the fault is propagated to all targets and hosts on the bus. Similarly, before the development of unshielded twisted pair (UTP) Ethernet (10BASE-T, 100BASE-T), many implementations of Ethernet networks used coaxial cable (10BASE-2). The network is subject to node faults, which interfere with the arbitration or transmission of data on the network, and can propagate to all nodes on the network through the shared coaxial cable.
Another form of fault propagation occurs when incorrect data is stored and replicated. For example, mirrored disks are synchronized closely. Bad data written to one disk is likely to be propagated to the mirror. Sources of bad data may include operator error, undetected read faults in a read-modify-write operation, and undetected synchronization faults.
Operator error can be difficult to predict and prevent. You can prevent operator errors from propagating throughout the system by implementing a time delay between the application of changes on the primary and remote site. If this time delay is large enough to ensure detection of operator error, changes on the remote site can be prevented so that the fault is contained at the primary site. This containment prevents the fault from propagating to the remote site. For details, see High Availability Versus Disaster Recovery.
Reconfiguration Around Faults
Reconfiguring the system around faults is a technique commonly employed in clusters. A faulty cluster node causes reconfiguration of the cluster to remove the faulty node from the cluster. Any cluster-aware services that were resident on the faulty node are started on one or more surviving nodes in the cluster. Reconfiguration around faults can be a complicated process. The paragraphs that follow examine this process in more detail.
A number of features in the Solaris operating environment allow system reconfiguration around faults without requiring clustering software. These features are:
Dynamic reconfiguration (DR)
Internet protocol multipathing (IPMP)
I/O multipathing (Sun StorEdge™ Traffic Manager)
Alternate pathing (AP)
DR attaches and detaches system components to an active Solaris operating environment system without causing an outage. Thus, DR is often used for servicing components. Note that DR does not include the fault detection, isolation, containment, or reconfiguration capabilities available in Sun Cluster software.
IPMP automatically reconfigures around failed network connections.
I/O multipathing balances loads across host bus adapters (HBAs). This feature, which is also known as MPxIO, was implemented in the Solaris 8 operating environment with kernel patch 108528-07 for SPARC ¨-based systems and 108529-07 for Intel-based systems.
AP reconfigures around failed network and storage paths. AP is somewhat limited in capability, and its use is discouraged in favor of IPMP and the Sun StorEdge Traffic Manager.
Future plans include tighter alignment and integration between these Solaris operating environment features and Sun Cluster software.
Fault prediction is the process of observing a component over time to predict when a fault is likely. Fault prediction works best when the component includes a consumable subcomponent or a subcomponent that has known decay properties. For example, an automobile computer knows the amount of fuel in the fuel tank and the instantaneous consumption rate. The computer can predict when the fuel tank will be empty a state that would cause a fault condition. This information is displayed to the driver, who can take corrective action.
Practical fault prediction in computer systems today focuses primarily on storage media. Magnetic media, in particular, has behavior that can be used to predict when the ability to store and retrieve data will fall out of tolerance and result in a read failure in the future. For disks, this information is reported to the Solaris operating environment as soft errors. These errors, along with predictive failure information, can be examined using the iostat (1M) or kstat (1M) command.
Unfortunately, a large number of unpredictable faults can occur in computer systems. Software bugs make software prone to unpredictable faults.