Home > Articles > Software Development & Management

IT Management Reference Guide

Jul 9, 2004

␡

⎙ Print

< Back Page 69 of 205 Next >

Achieving high availability does not happen by accident. Careful planning, clever design, flawless execution and reliable support are just some of the characteristics required to keep critical systems up and operating for months on end. In this third segment of my four-part series on improving high availability, I describe several characteristics I believe are essential for achieving maximum availability. I refer to these attributes as the Seven R's of High Availability.

The Seven Rs of High Availability

The goal of all availability process owners is to maximize the uptime of the various online systems for which they are responsible—in essence, to make them completely fault tolerant. Constraints inside and outside the IT environment make this challenge close to impossible. Budget limitations, component failures, faulty code, human error, flawed design, natural disasters, and unforeseen business shifts such as mergers, downturns, and political changes are just some of the factors working against that elusive goal of 100% availability —the ultimate expression of high availability.

There are several approaches that can be taken to maximize availability without breaking the budget bank. Each of these approaches start with the same letter, so we refer to them as the seven Rs of high availability (see Figure 9).

Figure 9. The Seven Rs of High Availability

Let's begin with redundancy. Manufacturers have been designing this into their products for years in the form of redundant power supplies, multiple processors, segmented memory, and redundant disks. This can also refer to entire server systems running in a hot standby mode. Infrastructure analysts can take a similar approach by configuring disk and tape controllers, and servers with dual paths, splitting network loads over dual lines, and providing alternate control consoles—in short, eliminate as much as possible any single points of failure that could disrupt service availability.

The next three approaches—reputation, reliability, and repairability—are closely related. Reputation refers to the track record of key suppliers. Reliability pertains to the dependability of the components and the coding that go into their products. Repairability is a measure of how quickly and easily suppliers can fix or replace failing parts. We will look at each of these a bit more closely.

The reputation of key suppliers of servers, disk storage systems, database management systems, and network hardware and software plays a principle role in striving for high availability. It is always best to go with the best. Reputations can be verified in several ways. Percent of market share is one measure. Reports from industry analysts and Wall Street are another. Track record in the field is a third. Customer references can be especially useful when it comes to confirming such factors as cost, service, quality of the product, training of service personnel, and trustworthiness.

The reliability of the hardware and software can also be verified from customer references and industry analysts. Beyond that, you should consider performing what we call an empirical component reliability analysis. Figure 2 lists the steps to perform to accomplish this. An analysis of problem logs should reveal any unusual patterns of failure and should be studied by supplier, product, using department, time and day of failures, frequency of failures, and time to repair. Suppliers often keep onsite repair logs that can be perused to conduct a similar analysis.

Figure 10. Empirical Methods for Component Reliability Analysis

Feedback from operations personnel can often be candid and revealing as to how components are truly performing. This can especially be the case for offsite operators. For example, they may be doing numerous resets on a particular network component every morning prior to start-up, but they may not bother to log it since it always comes up. Similar conversations with various support personnel such as systems administrators, network administrators, and database administrators may solicit similar revelations. You might think that feedback from repair personnel from suppliers could be biased, but in my experience they can be just as candid and revealing about the true reliability of their products as the people using them. This then becomes another valuable source of information for evaluating component reliability, as is comparing experiences with other shops. Shops that are closely aligned with your own in terms of platforms, configurations, services offered, and customers can be especially helpful. Reports from reputable industry analysts can also be used to predict component reliability.

Repairability is the relative ease with which service technicians can resolve or replace failing components. Two common metrics used to evaluate this trait are how long it takes to do the actual repair and how often the repair work needs to be repeated. In more sophisticated systems, this can be done from remote diagnostic centers where failures are detected and circumvented, and arrangements are made for permanent resolution with little or no involvement of operations personnel.

The next characteristic of high availability is recoverability. This refers to the ability to overcome a momentary failure in such a way that there is no impact on end-user availability. It could be as small as a portion of main memory recovering from a single-bit memory error, and as large as having an entire server system switch over to its standby system with no loss of data or transactions. Recoverability also includes retries of attempted reads and writes out to disk or tape, as well as the retrying of transmissions down network lines.

Responsiveness is the sense of urgency all people involved with high availability need to exhibit. This includes having well-trained suppliers and in-house support personnel who can respond to problems quickly and efficiently. It also pertains to how quickly the automated recovery of resources such as disks or servers can be enacted.

The final characteristic of high availability is robustness, which describes the overall design of the availability process. A robust process will be able to withstand a variety of forces—both internal and external—that could easily disrupt and undermine availability in a weaker environment. Robustness puts a high premium on documentation and training to withstand technical changes as they relate to platforms, products, services, and customers; personnel changes as they relate to turnover, expansion, and rotation; and business changes as they relate to new direction, acquisitions, and mergers.

In part four of this series I will discuss techniques for assessing and measuring the quality and robustness of an infrastructure's availability process.

References

Schiesser, Rich, IT Systems Management, Prentice Hall, 200

< Back Page 69 of 205 Next >

🔖 Save To Your Account

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

Email Address