Step 1: Launching a Phase 0 HA Assessment
The hardest part of getting a Phase 0 HA assessment started is rounding up the right resources to pull it off well. This effort is so critical to your company’s existence that you are going to want to use your best folks. In addition, timing is everything. It would be nice to launch a Phase 0 HA assessment before you have gotten too far down the path on a new system’s development. Or, if this is after the fact, put all the attention you can on completing this assessment as accurately and completely as possible.
Resources for a Phase 0 HA Assessment
For a Phase 0 HA assessment, you need to assemble between two and three resources (professionals) with the ability to properly understand and capture the technical components of your environment, along with the business drivers behind the application being assessed. Again, these should be some of the best folks you have. If your best folks don’t have enough bandwidth to take this on, then get outside help; don’t settle for less skilled employees. The small amount of time and budget that this assessment will cost will be minimal compared to the far-reaching impact of its results. These are the types of people and their skill sets you should include in the Phase 0 assessment:
A system architect/data architect (SA/DA)—You want someone with both extensive system design and data design experience who will be able to understand the hardware, software, and database aspects of high availability.
A very senior business analyst (SBA)—This person must be completely versed in development methodologies and the business requirements that are being targeted by the application (and by the assessment).
A part-time senior technical lead (STL)—You want a software engineer type with good overall system development skills so that he or she can help in assessing the coding standards that are being followed, the completeness of the system testing tasks, and the general software configuration that has been (or will be) implemented.
The Phase 0 HA Assessment Tasks
After you have assembled a team, you can start on the assessment, which is broken down into several tasks that will yield the different critical pieces of information needed to determine the correct high availability solution. Some tasks are used when you are assessing existing systems; these tasks might not apply to a system that is brand new.
The vast majority of Phase 0 HA assessments are conducted for existing systems. What this seems to indicate is that most folks are retrofitting their applications to be more highly available after they have been implemented. Of course, it would have been best to have identified and analyzed the high availability requirements of an application during development in the first place.
A few of the tasks that described here may not be needed in determining the correct HA solution. However, I have included them here for the sake of completeness, and they often help form a more complete picture of the environment and processing that is being implemented. Remember, this type of assessment becomes a valuable depiction of what you were trying to achieve based on what you were being asked to support. Salient points within each task are outlined as well. Let’s dig into these tasks:
Task 1—Describe the current state of the application. This involves the following points:
Data (data usage and physical implementation)
Process (business processes being supported)
Technology (hardware/software platform/configuration)
Backup/recovery procedures
Standards/guidelines used
Testing/QA process employed
Service level agreement (SLA) currently defined
Level of expertise of personnel administering system
Level of expertise of personnel developing/testing system
Task 2—Describe the future state of the application. This involves the following points:
Data (data usage and physical implementation, data volume growth, data resilience)
Process (business processes being supported, expanded functionality anticipated, and application resilience)
Technology (hardware/software platform/configuration, new technology being acquired)
Backup/recovery procedures being planned
Standards/guidelines used or being enhanced
Testing/QA process being changed or enhanced
SLA desired from here on out
Level of expertise of personnel administering system (planned training and hiring)
Level of expertise of personnel developing/testing system (planned training and hiring)
Task 3—Describe the unplanned downtime reasons at different intervals (past 7 days, past month, past quarter, past 6 months, past year).
Task 4—Describe the planned downtime reasons at different intervals (past 7 days, past month, past quarter, past 6 months, past year).
Task 5—Calculate the availability percentage across different time intervals (past 7 days, past month, past quarter, past 6 months, past year). (Refer to Chapter 1 for this complete calculation.)
Task 6—Calculate the loss of downtime. This involves the following points:
Revenue loss (per hour of unavailability)—For example, in an online order entry system, look at any peak order entry hour and calculate the total order amounts for that peak hour. This will be your revenue loss per hour value.
Productivity dollar loss (per hour of unavailability)—For example, in an internal financial data warehouse that is used for executive decision support, calculate the length of time that this data mart/warehouse was not available within the past month or two and multiply this by the number of executives/managers who were supposed to be querying it during that period. This is the “productivity effect.” Multiply this by the average salary of these execs/managers to get a rough estimate of productivity dollar loss. This does not consider the bad business decisions they might have made without having their data mart/warehouse available and the dollar loss of those bad business decisions. Calculating a productivity dollar loss might be a bit aggressive for this assessment, but there needs to be something to measure against and to help justify the return on investment. For applications that are not productivity applications, this value will not be calculated.
Goodwill dollar loss (in terms of customers lost per hour of unavailability)—It’s extremely important to include this component. Goodwill loss can be measured by taking the average number of customers for a period of time (such as last month’s online order customer average) and comparing it with a period of processing following a system failure (where there was a significant amount of downtime). Chances are that there was a drop-off of the same amount that can be rationalized as goodwill loss (that is, the online customer didn’t come back to you, they went to the competition). You must then take that percentage drop-off (for example, 2%) and multiply it by the peak order amount averages for the defined period. This period loss number is like a repeating loss overhead value that should be included in the ROI calculation for every month.
The loss of downtime might be difficult to calculate but will help in any justification process for purchase of HA-enabling products and in the measurement of ROI.
Once you have completed these tasks, you are ready to move on to step 2: gauging the HA primary variables.