Home > Articles > Software Development & Management

IT Management Reference Guide

Jul 9, 2004

␡

⎙ Print

< Back Page 68 of 205 Next >

In the first piece of this four-part series on improving high availability, I offered up and explained a definition for availability and presented some key terms used with this process. In this second part I compare and contrast additional terms associated with this process, discuss desirable traits of an availability process owner, present some effective methods for measuring availability.

Differentiating Slow Response from Downtime

Slow response can infuriate users and frustrate infrastructure specialists. The growth of a database, traffic on the network, contention for disk volumes, or the disabling of processors or portions of main memory in servers can all contribute to response time slowdowns. Each of these conditions requires analysis and resolution by infrastructure professionals. Users understandably are normally unaware of these root causes and sometimes interpret extremely slow response as downtime to their system. The threshold of time at which this interpretation occurs varies from user to user. It does not matter to users whether the problem is due to slowly responding software (slow response) or malfunctioning hardware (downtime). What does matter is that slow or non-responsive transactions can infuriate users who expect quick, consistent response times.

But slow response is different from downtime, and the root cause of these problems does matter a great deal to infrastructure analysts and administrators. They are charged with identifying, correcting, and permanently resolving the root causes of these service disruptions. Understanding the type of problem it is affects the course of action taken to resolve it. Slow response is usually a performance and tuning issue involving different personnel, different processes, and different process owners than those involved with downtime, which is an availability issue.

Differentiating Availability from High Availability

The primary difference between availability and high availability is that the latter is designed to tolerate virtually no downtime. All online computer systems are intended to maximize availability, or to minimize downtime, as much as possible. In high-availability environments, a number of design considerations are employed to make online systems as fault tolerant as possible. I refer to these considerations as the seven Rs of high availability and discuss them later in this chapter.

Desired Traits of an Availability Process Owner

As we mentioned previously, the most robust infrastructures select a single individual to be the process owner of availability. Some shops refer to this person as the availability manager. In some instances it is the operations managers; in others it is a strong technical lead in technical support. Regardless of who these individuals are, or to whom they report, they should be knowledgeable in a variety of areas, including systems, networks, databases, and facilities, and they must be able to think and act tactically. A slightly less critical, but desirable, trait of an ideal candidate for availability process owner is a knowledge of software and hardware configurations, backup systems, and desktop hardware and software.

Methods for Measuring Availability

The percentage of system availability is a very common measurement. It is found in almost all service level agreements and is calculated by dividing the amount of actual time a system was available by the total time it was scheduled to be up. For example, suppose an online system is scheduled to be up from 6:00 a.m. to midnight Monday through Friday and from 7:00 a.m. to 5:00 p.m. on Saturday. The total time it is scheduled to be up in hours is (18 x 5) + 10 = 100 hours. When online systems first began being used for critical business processing in the 1970s, online availability rates between 90% and 95% was common, expected, and reluctantly accepted. In our example, that would mean the system was up 90–95 hours per week or, more significantly, down for 5–10 hours per week and 20–40 hours per month.

Customers quickly realized that 10 hours a week of downtime was unacceptable and began negotiating service levels of 98% and even 99% guaranteed availability. As companies expanded worldwide and 24/7 systems became prevalent, the 99% level was questioned. Systems needing to operate around the clock were scheduled for 168 hours of uptime per week. At 99% availability, these systems were down, on average, approximately 1.7 hours per week. Infrastructure groups began targeting 99.9% uptime as their goal for availability for critical business systems. This target allowed for just over 10 minutes of downtime per week, but even this was not acceptable for systems such as worldwide email or an e-commerce Web site.

So the question becomes: Is the percentage of scheduled service delivered really the best measure of quality and of availability? An incident at Federal Express several years ago involving the measurement of service delivery will illustrate some points that could apply to the IT industry. FedEx had built its reputation on guaranteed overnight delivery. For many years its principal slogan was

When it positively, absolutely has to be there overnight, Federal Express.

FedEx guaranteed a package or letter would arrive on time, at the correct address, and in the proper condition. One of its key measurements of service delivery was the percentage of time that this guarantee was met. Early on, the initial goals of 99% and later 99.9% were easily met. The number of letters and packages they handled on a nightly basis was steadily growing from a few thousand to over 10,000, and less than 10 items were lost or delivered improperly.

A funny thing happened as the growth of their company started to explode in the 1980s. The target goal of 99.9% was not adjusted as the number of items handled daily started approaching one million. This meant that 1,000 packages or letters could be lost or improperly delivered every night and their service metric would still be met. One proposal to address this was to increase the target goal to 99.99%, but this goal could have been met while still allowing 100 items a night to be mishandled. A new set of deceptively simple measurements was established in which the number of items lost, damaged, delivered late, and delivered to the wrong address was tracked nightly regardless of the total number of objects handled.

The new set of measurements offered several benefits. By not tying it to percentages, it gave more visibility to the actual number of delivery errors occurring nightly. This helped in planning for anticipated customer calls, recovery efforts, and adjustments to revenue. By breaking incidents into three subcategories, each incident could be tracked separately as well as looked at in totals. Finally, by analyzing trends, patterns, and relationships, managers could pinpoint problem areas and recommend corrective actions.

In many ways, this experience with service delivery metrics at Federal Express relates closely to availability metrics in IT infrastructures. A small, start-up shop may initially offer online services only on weekdays for 10 hours and target for 99% availability. The 1% against the 50 scheduled hours allows for 0.5 hour of downtime per week. If the company grows to the point of offering similar online services 24/7 with 99% availability, the allowable downtime grows to approximately 1.7 hours.

A better approach is to track the quantity of downtime occurring on a daily, weekly, and monthly basis. As was the case with FedEx, infrastructure personnel can pinpoint and proactively correct problem areas by analyzing the trends, patterns, and relationships of these downtimes. Robust infrastructures also track several of the major components comprising an online system. The areas most commonly measured are the server environment, the disk storage environment, databases, and networks.

The tendency of many service suppliers to measure their availability in percentages of uptime is sometimes referred to as the rule of nines. Nines are continually added to the target availability goal as shown in Table 1. The table shows how the weekly minutes of allowable downtime changes from our example of the online system with 100 weekly hours and how the number of allowable undelivered items changes from our FedEx example.

Table 1. Rule of Nines Availability Percentage

Number of Nines	Percentage of Availability	Weekly Hours Down	Weekly Minutes Down	Daily Packages Not Delivered (out of 10K)	Daily Packages Not Delivered (out of 1M)
1	90.000%	10.000	600.00	1,000.0	100,000.0
2	99.000%	1.000	60.00	100.0	10,000.0
3	99.900%	0.100	6.00	10.0	1,000.0
4	99.990%	0.010	0.60	1.0	100.0
5	99.999%	0.001	0.06	0.1	10.0

In part three of this series I will discuss characteristics that I feel are essential for obtaining maximum availability. They coincidently all happen to start with the same letter leading me to refer to them as the Seven R's of High Availability.

References

Schiesser, Rich, IT Systems Management, Prentice Hall, 2002

< Back Page 68 of 205 Next >

🔖 Save To Your Account

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

Email Address