Home > Articles > Networking > Network Administration & Management

  • Print
  • + Share This
This chapter is from the book

Dependable Systems

Fault-tolerant DC was an active research field during the last two decades of the 20th century and continues to be in the current era. Once the domain of mainframe systems, dependability in NDC systems is a natural result of global competitive pressures. Dependability in any system can be defined as the ability of the system to ensure that it (and the services it may deliver) can be relied upon within certain measurable parameters, the definition of which depends on the context of deployment. Generic concepts such as reliability, availability, scalability (RAS), and security define dependable NDC systems characteristics. Measures such as mean time between failures (MTBF) traditionally evaluate the reliability for such systems.

As global dependence on NDC continues to increase, the probability of crises rooted in network and system failures also increases. While the consequences of these failures are often petty inconvenience (my pager stopped working), the probability that key application failure could give rise to large economic perturbations or even loss of life also increases. As more NDC applications become the norm, failures too become more distributed. Dependable systems must engender trust from many perspectives if NDC is to continue enriching human activities without introducing equally large measures of risk.

Dependable NDC systems require dependable hardware, which is beyond the scope of this book. A bigger part of the equation, however, is NDC software. A brief discussion of software dependability is germane at this juncture.

Below are two examples of many fault-tolerant software approaches that are applicable to NDC application development—techniques which, when used with other well-engineered development processes and components, will serve to provide more dependable NDC systems software going forward.

Checkpoint-Restart Technique

While discussions of dependable NDC software date back to the earliest experiences with networked computing,[15] a growing body of research in this category parallels the growth of the Internet over the same period. An excellent summary of the state of software fault tolerance status relevant to this era was published in 2000 by Wilfredo Torres-Pomales from the NASA Langley Research Center in Hampton, Virginia.[16] Torres-Pomales cited a number of general approaches to software fault tolerance, many of which are applicable to NDC, including Single-Version Software Fault Tolerance techniques (that is, redundancy applied to a single version of a piece of software, designed to detect and recover from faults). The most common example of this approach cited by Torres-Pomales is the checkpoint-and-restart mechanism pictured in Figure 3.5.[17]

03fig05.gifFigure 3.5. Single-version, checkpoint-restart technique

Most software faults (after development has been completed) are unanticipated and usually depend on state. Faults of this type often behave similarly to spurious hardware faults in that they may appear, do their damage, and then disappear leaving no vapor trail. In such cases, restarting the module is often the best strategy for successful completion of its task, one that has several advantages and is general enough to be used at multiple levels in an NDC system or environment. A restart can be dynamic or static, depending on context: a static restart brings the module to a predetermined state; a dynamic one may use dynamically created checkpoints at fixed intervals or at certain key points during execution. All this depends on error detection, of course, which also has several applicable techniques that can be used.

Recovery-Blocks Technique

Multiversion software fault tolerance techniques are, as the name implies, based on the use of two or more variants of a piece of software (executed either in sequence or in parallel), the assumption being that components built differently (by different designers using different approaches, tools, and so on) will fail differently. So if one version fails with a given input, an alternative version should provide appropriate output.

One example Torres-Pomales cites is a "Recovery Blocks" technique, which shares some attributes with Byzantine agreements discussed later.[18] The Recovery-Blocks technique combines the basics of checkpoint and restart with multiple versions of a given component; if an error is detected during processing in one variant, a different version executes. As shown in Figure 3.6, a checkpoint is created before execution, and error detection in a given module can occur at various checkpoints along the way, rather than through an output-only test.

03fig06.gifFigure 3.6. Recovery Blocks technique

Although most of the time the primary version will execute successfully, the Recovery-Blocks technique allows alternative variants to process in parallel (perhaps to lesser accuracy, depending on CPU resources available) in order to ensure overall performance if such are the requirements of the application.

  • + Share This
  • 🔖 Save To Your Account