Home > Articles > Hardware

  • Print
  • + Share This
This chapter is from the book

Availability

A system with 100% 'Availability' is a system that is always up and never down. At present, the 100% threshold is only theoretically attainable, in part due to the unpredictability of random errors, natural disasters, required maintenance, upgrades and of course 'hard' and 'soft' errors. Typically, the reason systems tend to crash and go down is due to these last two kinds of errors.

Hard Errors and Soft Errors

A 'hard' error, as the name implies, is where something is physically changed on the system, or a piece of hardware crashes, freezes, or burns up. This can happen to a power supply if it shorts out, a coolant fan that seizes up, or a component that short-circuits. Hard errors also include outside interference that results in downtime—such as a lightning strike that causes a power surge, resulting in knocked out equipment.

A 'soft error' is of the type described in the previous section on reliability. Here a cosmic ray or electrical noise on a bus will unintentionally and randomly reset a data bit. But the parity and error correcting code circuits (ECC) of the Itanium processor will actually fix these errors as they are detected and keep the user's computing environment safe in spite of these unexpected events. The on-chip parity checks and ecc will detect and correct both hard and soft errors.

Software Errors

Software Errors, as the name implies, are errors that occur in the software. They typically occur when software has encountered something in the programming or data that confuses the program or is unanticipated by the programmer, so the program doesn't know where it is going, or what it should do next.

Sometimes these errors could be initially caused by faulty hardware (a hard error leading to a software error) or by programming mistakes. For example, this could be caused by a situation where a program is trying to branch to an address in memory and that address happens to be in the middle of a block of data or in open memory.

Ideally, the system should be able to catch these software errors without having to stop and wait for some kind of operator or other intervention. The best result is that the system has a place to go and 'recover' from the error and not crash. An example would be allowing the program to jump to a new address. This allows it to back up and take a look at what happened to keep executing at a slower speed, or even go into the execution process to generate a recovery in a graceful way. It may even go into another routine that 'branches' around the area with the problem.

  • + Share This
  • 🔖 Save To Your Account