Home > Articles > Data > SQL Server

  • Print
  • + Share This
This chapter is from the book

Step 3—Determining the Optimal HA Solution

Once you have completed step 1 and step 2, the hard part is over because you probably have enough information now to make a VERY good high availability solution choice. If your information from these two steps was spotty and incomplete, the decision you make will probably be somewhat suspect. But, in general, it may be good enough to get you mostly what you are trying to achieve.

Step 3 will draw on a formal deterministic approach that takes the assessment results and gauge information and will yield the right HA solution for your requirements.

A Hybrid High Availability Selection Method

There are many potential selection methods that could be used to help in selecting the best HA solution. There are scoring methods, decision-tree based methods, and simple estimation methods. We like a hybrid decision-tree based method we have evolved over the past few years that will use the primary variable answers to guide you to an HA solution.

With any selection method for determining an HA solution, there will be several possible high availability answers, one of which will be that NO high availability solution is needed. The general cost and administrative complexity of each solution will also be described as well. As new HA solutions are identified in the industry (or by a particular vendor, such as Microsoft), this list can be expanded. But, for now, this book will focus on the following (as seen in Figure 3.2):

  • Disk methods—Disk mirroring, RAID, and so on. Characteristics: High $ cost and low administration complexity.

  • Other hardware—Redundant power supplies, fans, CPUs, and so on (many of which are hot-swappable). Characteristics: High $ cost and low administration complexity.

  • Cluster services—Microsoft Cluster Services allows two (or more) systems to fail-over to each other in a passive/active or active/active mode. Characteristics: High $ cost and moderate administration complexity.

  • SQL clustering—Fully enables a Microsoft SQL Server instance to fail-over to another system using MSCS (SQL Server 2000 is cluster aware). Characteristics: High $ cost and moderate administration complexity.

  • Data replication—Primarily using "transactional replication," which will redundantly distribute transactions from one SQL Server database to another instantaneously. Both SQL Server databases (on different servers) can be used for workload balancing as well. Some limitations exist but are very manageable. Characteristics: Moderate $ cost and moderate administration complexity.

  • Log shipping—The direct application of SQL Server database transaction log entries to a warm standby copy of the same SQL Server database. There are some limitations with this technique. Characteristics: Moderate $ cost and low administration complexity.

  • Distributed transactions—Application controlled methods via programming and distributed transaction techniques (potentially using MS-DTC, and two-phase commit approaches) to redundantly create and manage data in alternate (redundant) locations. Any location becomes available if the other should fail. Characteristics: Low $ cost and high administration complexity.

  • NO high availability solution needed.

Figure 3.2 also illustrates the options that are typically valid together or by themselves. Any options that are diagonally shaded can be used together (disk methods + other HW + MSCS + SQL clustering, and so on). Crosshatched shaded intersections indicate options that must be done together (MSCS + SQL clustering), and vertical-lined intersections indicate that these are NOT typically done together (log shipping is typically not done with data replication, and so on).

Figure 3.2Figure 3.2 Valid high availability options.

As has already been pointed out, some of these possible solutions actually include the other (for example, SQL clustering is built on top of MSCS). This will be factored into the results of the selection.

The Decision-Tree Approach for Choosing an HA Solution

The decision-tree approach will take the high availability information garnered in the Phase 0 assessment and will traverse down a particular path (decision-tree) to an appropriate HA solution. In this case, we have chosen a hybrid decision-tree technique that uses Nassi-Schneiderman charts, which fit well with depicting complex questions and yield very specific results. We won't be using all of the Nassi-Schneiderman chart techniques, only the Conditional/Question part. As Figure 3.3 shows, a Nassi-Schneiderman chart will be in the form of

  • Condition/Question—For which you need to decide an answer.

  • Cases—Any number of known cases (answers) that the question might have (Case A, Case B...Case n).

  • Action/Result—The specific result or action to be followed depending on the case chosen (Result A, Result B...Result n).

Each question will always be considered in context of all questions answered before it. You are essentially navigating your way down a complex tree structure that will yield a definitive HA solution.

The questions will be ordered in a way so that they will clearly flesh out specific needs and push you in a specific high availability direction. Figure 3.4 illustrates an example of a question put into the Nassi- Schneiderman construct. The question is "what percentage of availability must your application have?" (for its scheduled time of operation).

If you have completed enough of the Phase 0 assessment, the answer to this question should be easy to come up with. This will also be a good audit or validation of your Phase 0 assessment.

Figure 3.3Figure 3.3 Hybrid decision-tree using Nassi-Schneiderman charts.

Figure 3.4Figure 3.4 Nassi-Schneiderman example question.

In the normal course of events, we will start with the most critical aspects of high availability first. Then, depending on the answer to each question, proceed down a specific path and a new question. As each high availability characteristic is considered, the path (actions followed) will lead you to a specific HA solution. The series of questions that need to be answered are taken from the HA Primary Variables Gauge but are expanded slightly to make them conditional in nature. These are

  1. What percentage of time must the application remain up during its scheduled time of operation? (The goal!)

  2. How much tolerance does the end-user have when the system is not available (planned or unplanned unavailability)?

  3. What is the per hour cost of downtime for this application?

  4. How long does it take to get the application back online following a failure (of any kind)? (Worst case!)

  5. How much of the application is distributed and will require some type of synchronization with other nodes before all nodes are considered to be 100% available?

  6. How much data inconsistency can be tolerated in favor of having the application available?

  7. How often is scheduled maintenance required for this application (and environment)?

  8. How important is high performance and scalability?

  9. How important is it for the application to keep its current connection alive with the end-user?

  10. What is the estimated cost of a possible high availability solution? What is your budget?

Design Note

One other important thing that may come into play is the timeline that you have on getting an application to become highly available. If the timeline is very short, then your solution may exclude costs as a barrier and may not even consider hardware solutions that take months to order and install. So, it would be more than appropriate to expand the Primary Variables Gauge to include this question (and any others) as well. This particular question could be "What is your timeline for making your application highly available?"

However, this book will assume that you have a reasonable amount of time to properly assess your application.

It is also assumed that if you have written (or are planning to write) an application that will be "cluster aware," you can leverage MSCS. This would be considered to be an implementation of "application" clustering (an application that is cluster aware). As mentioned before, SQL Server is a cluster aware program. However, we don't consider it to be "application" clustering in the strictest sense; it is "database" clustering.

Scenario 1: Application Service Provider (ASP) Assessment

To drive home the decision-tree method, we will proceed down a complete path (decision-tree) for the application service provider (ASP) business scenario (Scenario #1). We will answer the questions based on an already completed Phase 0 HA assessment for it. As you recall, Scenario #1 centers on a very real ASP and their operating model. This ASP houses (and develops) numerous global, web-based online order entry systems for several major beauty and health product companies in the world. Their customer base is truly global (as the earth turns, the user base accessing the system shifts with it). They are headquartered in California and this ASP guarantees 99.5% uptime to their customers. In this case, the customers are sales associates and their sales managers. If the ASP achieves these guarantees, they get significant bonuses; if they fall below certain thresholds, they are liable for specific penalties. The processing mix of activity is approximately 65% online order entry and approximately 35% reporting.

Availability:

  • 24 hours per day

  • 7 days per week

  • 365 days per year

Planned Downtime: .25% (less than 1%)

Unplanned Downtime: .25% (less than 1%) will be tolerable

Figure 3.5 depicts the first three questions in the decision tree and their corresponding responses (actions). Remember, these are cumulative. Each new question carries along the responses of the preceding questions. Your responses, taken together, determine the HA solution that best fits. Let's proceed through the ASP business scenario depiction to give you a feel of how this works.

HA Assessment (Decision-Tree):

  1. What percentage of time must the application remain up during its scheduled time of operation? (The goal!)

  2. Response: E: 99.5% -> Extreme availability goal.

  3. How much tolerance does the end-user have when the system is not available (planned or unplanned unavailability)?

  4. Response: E: Very low tolerance of downtime -> Extremely Critical.

  5. What is the per hour cost of downtime for this application?

  6. Response: D: $15K/hour cost of downtime -> High Cost.

    Figure 3.5Figure 3.5 Decision-tree: ASP questions 1–3.

    Remember, all questions are additive. So, after going through just three questions, we see that this ASP business scenario has a pretty high cost per hour when it is not available (a .5% per hour cost [total gross revenues of $3 billion]). And, coupled with high uptime goals and extremely low end-user tolerance for downtime (because of the nature of the ASP business) will drive this application to a particular type of HA solution very quickly. We could easily just stop now and jump to an HA solution of maximum hardware redundancy, RAID, MSCS, and SQL clustering in order to fulfill the HA requirement goals; however, there are still several aspects of the requirement such as distributed data processing requirements and budget available for the HA solution that could easily change this outcome. You should always complete the entire set of questions for clarity, consistency, completeness, and cost justification purposes.

    Figure 3.6 forges ahead with the next set of questions and answers.

    Figure 3.6Figure 3.6 Decision-tree: ASP questions 4–6.

  7. How long does it take to get the application back online following a failure (of any kind)? (Worst case!)

  8. Response: C: Average -> Standard amount of time to recover (to get back online). Accomplished via standard DB recovery mechanisms (incremental transaction log dumps done every 15 minutes). Faster recovery times would be beneficial but data integrity is of huge importance.

  9. How much of the application is distributed and will require some type of synchronization with other nodes before all nodes are considered to be 100% available?

  10. Response: A: None -> There aren't any components of this application that are distributed (non-distributed). This simplifies the data synchronization aspects to consider, but does not necessarily mean that a distributed HA solution won't better serve the overall application. If the application has a heavy reporting component, some type of data replication architecture could serve it well. This will be addressed in the performance/scalability question later.

  11. How much data inconsistency can be tolerated in favor of having the application available?

  12. Response: B: A little -> A high degree of data consistency must be maintained at all times. This gives little room for any HA option that would sacrifice data consistency in favor of availability.

    For systems with primarily static data, complete images of the application databases could be kept at numerous locations for instantaneous access any time they needed to get to it, with little danger of having data inconsistent (in administered properly). For systems with a lot of data volatility, the answer on this one question may well dictate the HA option to use. Very often the HA option best suited for high data consistency needs is that of SQL clustering and log shipping, coupled with RAID at the disk subsystem level.

    Another short pause in our path to an HA solution finds us not having to support a complex distributed environment, but having to make sure we keep our data consistent as much as possible. In addition, we can plan on typical recovery times to get the application back on line in case of failures (this was probably stated this way in the ASP's service level agreement). However, if a faster recovery mechanism is possible, it should be considered because it will have a direct impact on the total amount of unplanned downtime, and could potentially allow the ASP to get some uptime bonuses (that might also be in the SLA). Now, let's venture into the next set of questions as illustrated in Figure 3.7. These focus on planned downtime, performance, and application connectivity perception.

  13. How often is scheduled maintenance required for this application (and environment)?

  14. Response: C: Average -> A reasonable amount of downtime will occur to service operating system patches/upgrades, hardware changes/swapping, and application patches/upgrades. For 24x7 systems, this will cut into the availability time directly. For systems with downtime windows, the pressure is much less in this area.

    Figure 3.7Figure 3.7 Decision-tree: ASP questions 7–9.

  15. How important is high performance and scalability?

  16. Response: D: Very much -> The ASP considers all of its applications to be high-performance systems that must meet strict performance numbers and be able to scale to support large number of users. These performance thresholds would be spelled out clearly in the service level agreement. Any HA solution must therefore be a scalable solution as well.

  17. How important is it for the application to keep its current connection alive with the end-user?

  18. Response: B: Somewhat -> At the very least, the ability to establish a new connection to the application within a short amount of time will be required. No client connection fail-over is required. This was partially made possible by the overall transactional approach of "optimistic concurrency" used by their applications. This approach puts much less pressure on holding a connection for long periods of time (to hold/lock rows).

    As you can see in Figure 3.8, the estimated cost of a potential HA solution would be between $100K and $250K. Budget for HA should be estimated to be a couple of full days' worth of downtime cost. For our ASP example, this would be roughly about $720K. The ROI calculation will show how quickly this will be recovered.

    A bit later in this chapter, we will work though a complete ROI calculation so that you can fully understand where these values come from.

    Figure 3.8Figure 3.8 Decision-tree: ASP question 10 and HA solution.

  19. What is the estimated cost of a possible high availability solution? What is your budget?

  20. Response: C: $100K <= C$ < $250K -> This is a moderate amount of cost for potentially a huge amount of benefit. They are estimating

    • Five new four-way servers with 4GB RAM at $30K per server

    • Ten MS Windows 2000 Advanced Server licenses

    • Five shared SCSI disk systems with RAID 10 (50 drives)

    • Five days of additional training costs for personnel

    • No new SQL Server licenses because of the plan to operate in an active/passive clustering mode

The HA Solution for Scenario #1

Figure 3.8 also shows the final selection of hardware redundancy, shared disk RAID arrays, MSCS, and SQL clustering as the best fitting HA solution (all four, together). There is little doubt about the needs being met well by this particular set of HA solutions. It clearly meets all of the most significant requirements of uptime, tolerance, performance, and costs. The lack of distributed data or data synchronization pointed this away from distributed transaction techniques such as data replication or distributed applications. Log shipping might have helped but is not transparent enough to the application. Their SLA allows for brief amounts of downtime to service all OS, hardware, and application upgrades. Figure 3.9 shows the live HA solution technical architecture. Budget allowed for a larger amount of hardware redundancy to be utilized.

Once this HA solution was put into place, the ASP achieved nearly five 9s for extended periods of time (exceeding their original goals of 99.5% uptime). One additional note is that the ASP company also employs a spreading out of the risk solution to further reduce downtime created from application and shared hardware failures. They will only put at most two to three applications on a particular clustered solution (refer back to Figure 2.11 in Chapter 2, "Microsoft High Availability Options," for a more complete depiction of this risk mitigation approach).

Figure 3.9Figure 3.9 ASP HA solution technical architecture.

If you aren't quite getting the idea of how the decision-tree approach works, it can also be illustrated in a slightly different way. Figure 3.10 depicts an abbreviated bubble chart technique of this decision-tree path traversal.

Figure 3.10Figure 3.10 Bubble chart decision-tree path traversal.

Remember, each question takes into context all questions before it. The result is a specific HA solution that best meets your business requirements.

Design Note

A full decision-tree explosion (Complete HA Decision-Tree) that has all questions and paths defined is available in MS Excel document form on the Sams Publishing website.

In addition, a blank Nassi-Schneiderman chart and an HA Primary Variables Gauge are also available in a single PowerPoint document.

Scenario 2: Worldwide Sales and Marketing (Brand Promotion) Assessment

As you recall, this scenario is about a major chip manufacturer that has created a highly successful promotion and branding program, which results in billions of dollars in advertising dollars being rebated back to their worldwide sales channel partners. These sales channel partners must enter in their complete advertisements (newspaper, radio, TV, other) and be measured against ad compliance and logo usage and placements. If a sales channel partner is in compliance, they will receive up to 50% of the cost of their advertisement back from this chip manufacturer. There are three major advertising regions: Far East, Europe, and North America. Any other advertisements outside of these first three are lumped into an "Other Regions" bucket. Each region produces a huge daily load of new advertisement information that is processed instantaneously for compliance. Each major region only deals with that region's advertisements, but receives the compliance rules and compliance judgment from the chip manufacturer's headquarters. Application mix is approximately 75% online entry of advertisement events and 25% management and compliance reporting.

Availability:

  • 24 hours per day

  • 7 days a week

  • 365 days a year

Planned Downtime: 3%

Unplanned Downtime: 2% will be tolerable

HA Assessment (Decision-Tree):

  1. What percentage of time must the application remain up during its scheduled time of operation? (The goal!)

  2. Response: D: 95.0% -> High availability goal. This, however, is not a super critical application in terms of keeping the company running (like an order entry system would be).

  3. How much tolerance does the end-user have when the system is not available (planned or unplanned unavailability)?

  4. Response: C: Medium tolerance of downtime -> Standard criticality.

  5. What is the per hour cost of downtime for this application?

  6. Response: B: $5K/hour cost of downtime -> Low cost.

    As we can see so far, this sales and marketing application is nice to have available, but it can tolerate some downtime without hurting the company very much. Sales are not lost; work just gets backed up a bit. In addition, the cost of downtime is reasonably low at $5K/hr. This is roughly the rate at which advertisement reimbursements take place.

  7. How long does it take to get the application back online following a failure (of any kind)? (Worst case!)

  8. Response: C: Average -> Standard amount of time to recover (to get back online). Accomplished via standard DB recovery mechanisms (incremental transaction log dumps done every 15 minutes).

  9. How much of the application is distributed and will require some type of synchronization with other nodes before all nodes are considered to be 100% available?

  10. Response: D: High Distribution -> This is a global application that relies on data being created and maintained at headquarters from around the world (OLTP activity) but must also support heavy regional reporting (reporting activity) that doesn't interfere with the performance of the OLTP activity.

  11. How much data inconsistency can be tolerated in favor of having the application available?

  12. Response: B: A little -> A high degree of data consistency must be maintained at all times. This gives little room for any HA option that would sacrifice data consistency in favor of availability. This is regionally sensitive, in that when data is being updated by Europe, the Far East doesn't need to get their data updates right away.

  13. How often is scheduled maintenance required for this application (and environment)?

  14. Response: C: Average -> A reasonable amount of downtime will occur to service operating system patches/upgrades, hardware changes/swapping, and application patches/upgrades.

  15. How important is high performance and scalability?

  16. Response: D: Very much -> Performance (and scalability) are very important for this application. Ideally, an overall approach of separating the OLTP activity from the reporting activity will pay big dividends towards this.

  17. How important is it for the application to keep its current connection alive with the end-user?

  18. Response: B: Somewhat -> At the very least, the ability to establish a new connection to the application within a short amount of time will be required. No client connection fail-over is required. In fact, for the worst case scenario of the headquarters database becoming unavailable, the OLTP activity could easily be shifted to any other full copy of the database that is being used for reporting and is being kept current (as will be seen in the HA solution for this scenario).

  19. What is the estimated cost of a possible high availability solution? What is your budget?

  20. Response: B: $10K <= C$ < $100K -> This is a pretty low amount of cost for potentially a huge amount of benefit. They are estimating

    • Three new two-way servers with 4GB RAM at $10K per server

    • Three new MS Windows 2000 Server licenses

    • Three SCSI disk systems with RAID 10 (15 drives)

    • Two days of additional training costs for personnel

    • Four new SQL Server licenses (remote distributor, three subscribers)

The HA Solution for Scenario #2

Figure 3.11 also shows that a basic hardware/disk redundancy approach on each server should be used along with SQL Server's robust "transactional" data replication implementation to create three regional reporting images of the primary marketing database (MktgDB). These distributed copies will try to alleviate the major reporting burden against the OLTP (primary database) and also can serve as a warm standby copy of the database in the event of a major database problem at headquarters. Overall, this distributed architecture is easy to maintain and keep in sync and is highly scalable, as seen in Figure 3.12.

Figure 3.11Figure 3.11 Sales/marketing decision-tree summary + HA solution.

Figure 3.12Figure 3.12 Sales/marketing HA solution technical architecture.

After building this HA solution, the uptime goal was achieved for most of the time. Occasionally, there were some delays in resyncing the data at each regional site (subscribers). But, in general, the users were extremely happy with performance and availability.

Scenario 3: Investment Portfolio Management Assessment

This investment portfolio management application will be housed in a major server farm in the heart of the world's financial center: NY, NY. Serving North American customers only, this application provides the ability to do full trading of stocks and options in all financial markets (U.S. and international) along with full portfolio holdings assessment, historical performance, and holdings valuation. Primary users are investment managers for their large customers. Stock purchasing/selling comprise 90% of the daytime activity with massive assessment, historical performance, and valuation reporting done after the markets have closed. Three major peaks occur each weekday that are driven by the three major trading markets of the world (United States, Europe, and the Far East). The weekends are filled with the long range planning reporting and front-loading stock trades for the coming week.

Availability:

  • 20 hours per day

  • 7 days per week

  • 365 days per year

Planned Downtime: 4%

Unplanned Downtime: 1% will be tolerable

HA Assessment (Decision-Tree):

  1. What percentage of time must the application remain up during its scheduled time of operation? (The goal!)

  2. Response: D: 95.0% -> High availability goal. This particular financial institution (one of the largest on the planet) tends to allow for a small percentage of "built-in" downtime (planned or unplanned). A smaller, more nimble financial institution would probably have slightly more aggressive uptime goals (like five 9s). Time is money, you know.

  3. How much tolerance does the end-user have when the system is not available (planned or unplanned unavailability)?

  4. Response: D: Low tolerance of downtime -> High criticality due to market timings (selling and buying stocks within market windows).

  5. What is the per hour cost of downtime for this application?

  6. Response: E: $150K/hour cost of downtime -> Very high cost. However, this is the worse case scenario. When the markets are closed, the cost of downtime is marginal.

  7. How long does it take to get the application back online following a failure (of any kind)? (Worst case!)

  8. Response: E: Very short recovery -> This application's time to recover should be a very short amount of time (to get back online).

  9. How much of the application is distributed and will require some type of synchronization with other nodes before all nodes are considered to be 100% available?

  10. Response: C: Medium distribution -> This moderately distributed application has a large OLTP processing requirement and a large report processing requirement.

  11. How much data inconsistency can be tolerated in favor of having the application available?

  12. Response: A: Very little -> A very high degree of data consistency must be maintained at all times. This is financial data.

  13. How often is scheduled maintenance required for this application (and environment)?

  14. Response: C: Average -> A reasonable amount of downtime will occur to service operating system patches/upgrades, hardware changes/swapping, and application patches/upgrades.

  15. How important is high performance and scalability?

  16. Response: D: Very much -> Performance (and scalability) are very important for this application.

  17. How important is it for the application to keep its current connection alive with the end-user?

  18. Response: B: Somewhat -> At the very least, the ability to establish a new connection to the application within a short amount of time will be required. No client connection fail-over is required.

  19. What is the estimated cost of a possible high availability solution? What is your budget?

  20. Response: C: $100K <= C$ < $250K -> This is a moderate amount of cost for potentially a huge amount of benefit. They are estimating

    • Four new four-way servers with 8GB RAM at $50K per server

    • Four MS Windows 2000 Advanced Server licenses

    • Two shared SCSI disk systems with RAID 10 (30 drives)

    • Twelve days of additional training costs for personnel

    • Four new SQL Server licenses

    From the budget point-of-view, they had budgeted of $1.25 million for all HA costs. A solid HA solution won't even approach those numbers.

Figure 3.13 shows the overall summary of the decision-tree results along with the HA solution for the portfolio management scenario.

Figure 3.13Figure 3.13 Portfolio management decision-tree summary + HA solution.

The HA Solution for Scenario #3

As identified in Figure 3.13, we opt for the basic hardware/disk redundancy approach on each server, add on the MS Cluster Services and SQL Clustering for the primary database, then use data replication to offload the reporting load (and risk) to a secondary "reporting" server. There is now plenty of risk mitigation with this technical architecture, but it is not that difficult to maintain (as seen in Figure 3.14).

Once this HA solution was put together, it exceeded the high availability goals on a regular basis. Great performance has also resulted due to the splitting out of the OLTP from the reporting (very often a solid design approach).

Figure 3.14Figure 3.14 Portfolio management HA solution technical architecture.

Scenario 4: Call Before You Dig Assessment

The last scenario is the Tri-State Underground Construction Call Center. This application will determine within 6 inches the likelihood of hitting any underground gas mains, water mains, electrical wiring, phone lines, or cables that might be present on a proposed dig site for construction. Law requires that a call be placed to this center to determine whether or not it is safe to dig and identify the exact location of any underground hazard BEFORE any digging has started. This is a "life at risk" classified application and must be available very nearly 100% of the time during common construction workdays (Monday through Saturday). Each year more than 25 people are killed nationwide digging into unknown underground hazards. Application mix is 95% query only with 5% devoted to updating images, geo-spatial values, and various pipe and cable location information provided by the regional utility companies.

Availability:

  • 15 hours per day (5:00 a.m.–8:00 p.m.)

  • 6 days per week (closed on Sunday)

  • 312 days per year

Planned Downtime: 0%

Unplanned Downtime: .5% (less than 1%) will be tolerable

HA Assessment (Decision-Tree):

  1. What percentage of time must the application remain up during its scheduled time of operation? (The goal!)

  2. Response: E: 99.5% -> Extreme availability goal. This is a "life critical" application. Literally, someone may get killed if information cannot be obtained from this system during its planned time of operation.

  3. How much tolerance does the end-user have when the system is not available (planned or unplanned unavailability)?

  4. Response: E: Very low tolerance of downtime -> In other words, this has extreme criticality from the end-user's point-of-view.

  5. What is the per hour cost of downtime for this application?

  6. Response: A: $2K/hour cost of downtime -> Very low dollar cost. Very high life cost. This one question is very deceiving. There is no limit to the cost of "loss of life." However, we must go with the original dollar costing approach. So, bear with us on this one. Hopefully, the outcome will be the same.

  7. How long does it take to get the application back online following a failure (of any kind)? (Worst case!)

  8. Response: E: Very short recovery -> This application's time to recover should be a very short amount of time (to get back online).

  9. How much of the application is distributed and will require some type of synchronization with other nodes before all nodes are considered to be 100% available?

  10. Response: A: None -> There is no data distribution requirement for this application.

  11. How much data inconsistency can be tolerated in favor of having the application available?

  12. Response: A: Very little -> A very high degree of data consistency must be maintained at all times. This data must be extremely accurate and up to date due to the life-threatening aspects to incorrect information.

  13. How often is scheduled maintenance required for this application (and environment)?

  14. Response: C: Average -> A reasonable amount of downtime will occur to service operating system patches/upgrades, hardware changes/swapping, and application patches/upgrades. This system has a planned time of operation of 15x6x312. Plenty of time for this type of maintenance. Thus 0% planned downtime but average amount of scheduled maintenance.

  15. How important is high performance and scalability?

  16. Response: C: Moderate performance -> Performance (and scalability) isn't paramount for this application. The accuracy and availability to the information is most important.

  17. How important is it for the application to keep its current connection alive with the end-user?

  18. Response: B: Somewhat -> At the very least, the ability to establish a new connection to the application within a short amount of time will be required. No client connection fail-over is required.

  19. What is the estimated cost of a possible high availability solution? What is your budget?

  20. Response: B: $10K <= C$ < $100K -> This is a pretty low amount of cost.

    They are estimating

    • Three new four-way servers with 4GB RAM at $30K per server

    • Three new MS Windows 2000 Advanced Server licenses

    • One shared SCSI disk system with RAID 5 (10 drives)—this is primarily a read-only system (95% reads)

    • One SCSI disk system RAID 5 (5 drives)

    • Five days of additional training costs for personnel

    • Three new SQL Server licenses

    Budget for the whole HA solution is somewhat limited to under $100K as well.

The HA Solution for Scenario #4

Figure 3.15 summarizes the decision-tree answers for the Call Before You Dig application. As you have seen, this is a very critical system during its planned hours of operation, but has low performance goals, and low cost of downtime (when it is down). Regardless, it is highly desirable for this system to be up and running as much as possible. The HA solution that best fits this particular applications needs is a combination that yields maximum redundancy (hardware, disk, and database) with the additional insurance policy of maintaining a hot standby server (via log shipping) in case the whole SQL cluster configuration fails. This is a pretty extreme attempt at always having a valid application up and running to support the loss of life aspect of this application.

Figure 3.15Figure 3.15 Call before you dig decision-tree summary + HA solution.

After building this HA solution, the uptime goal was achieved easily. In fact, after three months, the log shipping configuration was disabled. Two days after the log shipping was disabled, the whole SQL cluster configuration failed (Murphy's law). The log shipping was rebuilt and this configuration has remained in place since then. Performance has been exceptional and this application continuously achieves its availability goals! Figure 3.16 shows the technical HA solution employed.

Figure 3.16Figure 3.16 Call before you dig HA solution technical architecture.

Cost Justification of a Selected High Availability Solution

As was described earlier, it might be necessary for you to cost justify the high availability solution that you are about to go out and build. This is assuming that money doesn't grow on trees in your organization or that the cost of downtime isn't a huge dollar amount per hour. If you are like most organizations, any new change to a system or application must be evaluated on its value to the organization and a calculation of how soon it will pay for itself must be done. That's what ROI (Return on Investment) calculations serve to provide—the cost justification behind a proposed solution.

ROI Calculation: As was stated earlier, ROI can be calculated by adding up the incremental costs (or estimates) of the new HA solution and comparing them against the complete cost of downtime for a period of time (I suggest this be calculated across a one year time period). We will use the ASP business (Scenario #1) as the basis for our ROI calculation. As you might also recall, we had estimated the costs to be in the range of between $100K and $250K, which included

  • Five new four-way servers with 4GB RAM at $30K per server

  • Ten MS Windows 2000 Advanced Server licenses

  • Five shared SCSI disk systems with RAID 10 (50 drives)

  • Five days of additional training costs for personnel

  • No new SQL Server licenses because of the plan to operate in an active/passive clustering mode

Okay, the incremental costs are

  1. Maintenance Cost (for a one year period):

    • $20K (estimate)—System admin personnel cost (additional time for training of these personnel)

    • $35K (estimate)—Software licensing cost (of additional HA components)

  2. Hardware Cost:

    • $100K hardware cost (of additional HW in the new HA solution)

    • Deployment/Assessment Cost:

    • $20K deployment cost (develop, test, QA, production implementation of the solution)

    • $10K HA assessment cost (be bold and go ahead and throw the cost of the assessment into this to be a complete ROI calculation)

  3. Downtime Cost (for a one year period):

    • If you kept track of last year's downtime record, use this number; otherwise produce an estimate of planned and unplanned downtime for this calculation. We estimated the cost of downtime/hour to be $15K/hour.

    • Planned downtime cost (revenue loss cost) = Planned downtime hours x cost of hourly downtime to the company:

      1. .25% x 8760 hours in a year = 21.9 hours of planned downtime

      2. 21.9 hours x $15K/hr = $328,500/year cost of planned downtime.

    • Unplanned downtime cost (revenue loss cost) = Unplanned downtime hours x cost of hourly downtime to the company:

      1. .25% x 8760 hours in a year = 21.9 hours of unplanned downtime

      2. 21.9 hours x $15K/hr = $328,500/year cost of unplanned downtime.

ROI totals:

  • Total of the incremental costs = $185,000 (for the year)

  • Total of downtime cost = $656,000 (for the year)

Incremental cost is .28 of the downtime cost for one year. In other words, the investment of the HA solution will pay for itself in .28 of a year or 3.4 months!

In reality, most companies will have achieved the ROI within 6 to 9 months of the first year.

Adding HA Elements to Your Development Methodology

Most of the high availability elements that were identified in the Phase 0 HA assessment process and the Primary Variables Gauge can be cleanly added (extended) to your company's current system development life cycle. By adding the HA-oriented elements to your standard development methodology, you ensure that this information is captured and can readily target new applications to the correct technology solution. Figure 3.17 highlights the high availability tasks that could be added to a typical waterfall development methodology. As you can see, HA starts from early on in the assessment phase, and is present all the way though the implementation phase. Think of this as extending your development capability. It truly guarantees that all your applications get properly evaluated and designed against their high availability needs if they have any.

Figure 3.17Figure 3.17 Development methodology with high availability built in.

  • + Share This
  • 🔖 Save To Your Account