Computer systems use timeouts to help maintain state information. Since digital computers are discrete systems, you must use some mechanism to create an event and to verify that an event was not completed. You can use timeouts for cluster heartbeats, arbitration cycles to determine if a processor has hung, and so forth.
Reducing the size of timeouts in an effort to improve the ability to detect component failures is tempting. You must avoid destabilizing conditions in which the timeout values are too small. Ensuring that the system remains stable when timeouts are changed requires due diligence and detailed engineering analysis.
Timeouts are also opportunities for defects. In particular, the stability of a system can be undermined if you design a series of nested timeouts incorrectly. FIGURE 3 is a nested timing diagram example.
FIGURE 3 Nested Timing DiagramExample
In this example, component A has a state change at time t1 that causes a state change in component B at time t2. Component B then has a state change at time t3 that causes a state change in component A at t4. This timing diagram example is analogous to a host, A, issuing a read data command to a disk drive, B. This system has deterministic, sequential behavior. However, if a failure occurs in component B between t2 and t3, the state change of component A at t4 will never occur. Component A will hang. If a timeout is implemented, component A can detect the failure of component B. FIGURE 4 shows the failure of B and the point at which A times out.
FIGURE 4 Nested Timing Diagram With Timeout
In this case, A implements an internal timeout, Ato. At time t1, component A starts the timeout counter, Ato, at t2 and causes a state change in B at t3. At time t4, component B fails, never to recover. At time t5, the Ato timeout expires, causing A to change its state. Since both A and Ato are part of component A, A knows an error condition occurred in component B.
FIGURE 5 shows a stable system implementing timeouts. In this case, the component timeout of A is greater than the service time of component B.
Timeout for Ato = (t6 - t2) > (t4 - t3):
FIGURE 5 Stable System With Timeout
FIGURE 6 shows an unstable system implementing timeouts. In this case, the timeout value component A is less than the service time of component B.
Timeout for Ato = (t4 - t2) > (t6 - t3):
FIGURE 6 Unstable System With Timeout
The system is unstable because it detects false errors. The action taken by component A, because of the presumed failure of component B, determines the stability of the overall system.
The stability problems inherent in complex computer systems using timeouts are difficult to predict or prove mathematically because of the multivariate and nonlinear nature of these systems.
Timeouts that are too long increase the time to detect errors. Timeouts that are too short generate false error conditions. False error conditions cause unnecessary failovers.
Timeouts are a source of systems integration errors. For components that use timeouts, the default timeout values are set according to their expected uses. When combined to build a larger system, the timeouts can cause the system instability. You must understand the component timeouts and their effect on event detection.