Home > Articles

  • Print
  • + Share This
This chapter is from the book

Step 3: Determining the Optimal HA Solution

Once you have completed step 1 and step 2, the hard part is over, and you probably have enough information to make a very good high availability solution choice. If your information from these two steps was spotty and incomplete, the decision you make will probably be somewhat suspect. But, in general, it may be good enough to get you mostly what you are trying to achieve.

Step 3 draws on a formal deterministic approach that combines the assessment results and gauge information to yield the right HA solution for your requirements.

A Hybrid High Availability Selection Method

Many potential selection methods could be used to help in selecting the best HA solution. There are scoring methods, decision-tree methods, and simple estimation methods. I like a hybrid decision-tree method that uses the primary variable answers to guide you to an HA solution.

With any selection method for determining an HA solution, you will get several possible high availability answers, one of which will be that no high availability solution is needed. The general cost and administrative complexity of each solution are also described. As new HA solutions are identified in the industry (or by a particular vendor, such as Microsoft), this list can be expanded. But, for now, this book focuses on the following (see Figure 3.2):

  • Disk methods—Disk methods include disk mirroring, RAID, and so on. Characteristics: Medium cost and low administration complexity.

  • Other hardware—Other hardware includes redundant power supplies, fans, CPUs, and so on (many of which are hot swappable). Characteristics: Medium cost and low administration complexity.

  • Failover clustering—Windows Server Failover Clustering allows two (or more) systems to fail over to each other in a passive/active or active/active mode. Characteristics: Medium cost and moderate administration complexity.

  • SQL clustering—SQL clustering fully enables a Microsoft SQL Server instance to fail over to another system using WSFC (because SQL Server 2016 is cluster aware). Characteristics: Medium cost and moderate administration complexity.

  • AlwaysOn availability groups (AVG)—AlwaysOn provides database-level failover and availability by utilizing a data redundancy approach. Primary and up to eight secondary replicas are created and synchronized to form a robust high availability configuration and a work offloading option. Characteristics: High cost (due to Microsoft licensing for required Enterprise Edition) and moderate administration complexity.

  • Data replication—You can use transactional replication to redundantly distribute transactions from one SQL Server database to another instantaneously. Both SQL Server databases (on different servers) can be used for workload balancing as well. Some limitations exist but are very manageable. Characteristics: Low cost and moderate administration complexity.

  • Log shipping—You can directly apply SQL Server database transaction log entries to a warm standby copy of the same SQL Server database. There are some limitations with this technique. Characteristics: Low cost and low administration complexity.

  • DB Snapshots—You can allow database-level creation of point-in-time data access for reporting users, mass update protection, and testing advantages. This is often used in conjunction with database mirroring to make the mirror available for read-only access for other pools of end users. Characteristics: Low cost and moderate administration complexity.

  • Microsoft Azure availability groups—This is an extension of the on-premises capability of availability groups for failover and secondary replicas to the cloud (Microsoft Azure). Characteristics: Medium cost and moderate administration complexity.

  • Microsoft Azure Stretch Database—You can move, at the database and table levels, less-accessed (eligible) data to Microsoft Azure remote storage (remote endpoint/remote data). Characteristics: Medium cost and moderate administration complexity.

  • Microsoft Azure SQL database—You can create both standard and advanced cloud-based SQL databases that allow for database backups to be geo-distributed and secondary SQL databases to be created as secondary failovers and usable replicas. Characteristics: Moderate cost and moderate administration complexity.

  • No high availability solution needed

Figure 3.2 illustrates the options that are typically valid together or by themselves. Any options that has an X in an intersection with another means that these options are often used together; they either need to be built on top of an option or can be additive to this option for more a more effective high availability solution. For example, you could use disk methods, other HW, failover clustering, and SQL clustering.

03fig02.jpg

Figure 3.2 Valid high availability options and combinations.

As pointed out earlier, some of these possible solutions actually include others (for example, SQL clustering is built on top of failover clustering). This needs to be factored into the results of the selection.

The Decision-Tree Approach for Choosing an HA Solution

The decision-tree approach involves taking the high availability information garnered in the Phase 0 assessment and traversing down a particular path (decision tree) to an appropriate HA solution. In this case, I have chosen a hybrid decision-tree technique that uses Nassi-Shneiderman charts, which fit well with depicting complex questions and yield very specific results. (I don’t show the use of all the Nassi-Shneiderman chart techniques; only the conditional/question part.) As Figure 3.3 shows, a Nassi-Shneiderman chart includes the following:

  • Condition/question—For which you need to decide an answer.

  • Cases—Any number of known cases (answers) that the question might have (Case A, Case B,…Case n).

  • Action/result—The specific result or action to be followed, depending on the case chosen (Result A, Result B,…,Result n).

Each question is considered in the context of all the questions answered before it. You are essentially navigating your way down a complex tree structure that will yield a definitive HA solution.

The questions are ordered so that they will clearly flesh out specific needs and push you in a specific high availability direction. Figure 3.4 illustrates an example of a question put into the Nassi-Shneiderman construct. The question is “What percentage of availability must your application have (for its scheduled time of operation)?” If you have completed enough of the Phase 0 assessment, answering this question should be easy. Answering this question is also a good audit or validation of your Phase 0 assessment.

03fig03.jpg

Figure 3.3 Hybrid decision tree using Nassi-Shneiderman charts.

03fig04.jpg

Figure 3.4 Nassi-Shneiderman question example.

In the normal course of events, you start with the most critical aspects of high availability. Then, depending on the answer to each question, you proceed down a specific path with a new question. As each high availability characteristic is considered, the path (actions followed) lead you to a specific HA solution. The series of questions that need to be answered are taken from the HA primary variables gauge but are expanded slightly to make them conditional in nature. They are as follows:

  • What percentage of time must the application remain up during its scheduled time of operation? (The goal!)

  • How much tolerance does the end user have when the system is not available (planned or unplanned unavailability)?

  • What is the per-hour cost of downtime for this application?

  • How long does it take to get the application back online following a failure (of any kind)? (Worst case!)

  • How much of the application is distributed and will require some type of synchronization with other nodes before all nodes are considered to be 100% available?

  • How much data inconsistency can be tolerated in favor of having the application available?

  • How often is scheduled maintenance required for this application (and environment)?

  • How important are high performance and scalability?

  • How important is it for the application to keep its current connection alive with the end user?

  • What is the estimated cost of a possible high availability solution? What is the budget?

It is also assumed that if you have written (or are planning to write) an application that will be cluster aware, you can leverage WSFC. This would be considered to be an implementation of application clustering (that is, an application that is cluster aware). As mentioned earlier, SQL Server is a cluster-aware program. However, I don’t consider it to be application clustering in the strictest sense; it is, rather, database clustering.

Scenario 1: Application Service Provider (ASP) Assessment

To drive home the decision-tree method, this section walks you through a complete path (decision tree) for the application service provider (ASP) business scenario first mentioned in Chapter 1. This section shows how to answer the questions based on an already completed Phase 0 HA assessment for the ASP. As you recall, this scenario involves a very real ASP and its operating model. This ASP houses and develops numerous global, web-based online order entry systems for several major beauty and health products companies around the world. Its customer base is truly global. The company is headquartered in California, and this ASP guarantees 99.95% uptime to its customers. In this case, the customers are sales associates and their sales managers. If the ASP achieves these guarantees, it gets significant bonuses; if it falls below certain thresholds, it is liable for specific penalties. The processing mix of activity is approximately 65% online order entry and approximately 35% reporting.

Availability:

  • 24 hours per day

  • 7 days per week

  • 365 days per year

  • Planned downtime: 0.25% (less than 1%)

  • Unplanned downtime: 0.25% (less than 1%) will be tolerable

Figure 3.5 shows the first three questions in the decision tree and their corresponding responses (actions). Remember that the questions are cumulative. Each new question carries along the responses of the preceding questions. The responses, taken together, determine the HA solution that best fits. The following pages proceed through the ASP business scenario depiction to give you a feel for how this works.

HA assessment (decision tree):

  1. What percentage of time must the application remain up during its scheduled time of operation? (The goal!)

    Response: E: 99.95%—Extreme availability goal.

  2. How much tolerance does the end user have when the system is not available (planned or unplanned unavailability)?

    Response: E: Very low tolerance of downtime—Extremely critical.

  3. What is the per-hour cost of downtime for this application?

    Response: D: $15k/hour cost of downtime—High cost.

    03fig05.jpg

    Figure 3.5 Decision tree for the ASP, questions 1–3.

    Remember that all questions are additive. After going through just three questions, you can see that this ASP business scenario has a pretty high cost per hour when it is not available (a 0.5% per-hour cost [total gross revenues of $3 billion]). This, coupled with high uptime goals and extremely low end-user tolerance for downtime (because of the nature of the ASP business), will drive this application to a particular type of HA solution very quickly. You could easily just stop now and jump to an HA solution of maximum hardware redundancy, RAID, WSFC, and SQL clustering in order to fulfill the HA requirement goals; however, there are still several aspects of the requirement, such as distributed data processing requirements and budget available for the HA solution, that could easily change this outcome. You should always complete the entire set of questions for clarity, consistency, completeness, and cost justification purposes.

    Figure 3.6 forges ahead with the next set of questions and answers.

    03fig06.jpg

    Figure 3.6 Decision tree for the ASP, questions 4–6.

  4. How long does it take to get the application back online following a failure (of any kind)? (Worst case!)

    Response: C: Average—Standard amount of time to recover (that is, to get back online). Accomplished via standard DB recovery mechanisms (incremental transaction log dumps done every 15 minutes). Faster recovery times would be beneficial, but data integrity is of huge importance.

  5. How much of the application is distributed and will require some type of synchronization with other nodes before all nodes are considered to be 100% available?

    Response: A: None—There aren’t any components of this application that are distributed (nondistributed). This simplifies the data synchronization aspects to consider but does not necessarily mean that a distributed HA solution won’t better serve the overall application. If the application has a heavy reporting component, some type of data replication architecture could serve it well. This will be addressed in the performance/scalability question later.

  6. How much data inconsistency can be tolerated in favor of having the application available?

    Response: B: A little—A high degree of data consistency must be maintained at all times. This gives little room for any HA option that would sacrifice data consistency in favor of availability.

    For systems with primarily static data, complete images of the application databases could be kept at numerous locations for instantaneous access any time they are needed, with little danger of having data inconsistent (in administered properly). For systems with a lot of data volatility, the answer on this one question may well dictate the HA option to use. Very often the HA option best suited for high data consistency needs is SQL clustering and log shipping, coupled with RAID at the disk subsystem level.

    Another short pause in the path to an HA solution finds you not having to support a complex distributed environment but having to make sure you keep the data consistent as much as possible. In addition, you can plan on typical recovery times to get the application back online in case of failures. (The ASP’s service level agreement probably specified this.) However, if a faster recovery mechanism is possible, it should be considered because it will have a direct impact on the total amount of unplanned downtime and could potentially allow the ASP to get some uptime bonuses (that might also be in the SLA).

    Now, let’s venture into the next set of questions, as illustrated in Figure 3.7. These focus on planned downtime, performance, and application connectivity perception.

  7. How often is scheduled maintenance required for this application (and environment)?

    Response: C: Average—A reasonable amount of downtime is needed to service operating system patches/upgrades, hardware changes/swapping, and application patches/upgrades. For 24×7 systems, this will cut into the availability time directly. For systems with downtime windows, the pressure is much less in this area.

    03fig07.jpg

    Figure 3.7 Decision tree for the ASP, questions 7–9.

  8. How important are high performance and scalability?

    Response: D: Very much—The ASP considers all of its applications to be high-performance systems that must meet strict performance numbers and must be able to scale to support large number of users. These performance thresholds would be spelled out clearly in the service level agreement. Any HA solution must therefore be a scalable solution as well.

  9. How important is it for the application to keep its current connection alive with the end user?

    Response: B: Somewhat—At the very least, the ability to establish a new connection to the application within a short amount of time will be required. No client connection failover is required. This was partially made possible by the overall transactional approach of “optimistic concurrency” used by the applications. This approach puts much less pressure on holding a connection for long periods of time (to hold/lock rows).

    As you can see in Figure 3.8, the estimated cost of a potential HA solution would be between $100k and $250k. Budget for HA should be estimated to be a couple of full days’ worth of downtime cost. For the ASP example, this would be roughly about $720k. The ROI calculation will show how quickly this will be recovered.

    03fig08.jpg

    Figure 3.8 Decision tree for the ASP, question 10 and HA solution.

  10. What is the estimated cost of a possible high availability solution? What is the budget?

    Response: C: $100k <= C$ < $250k—This is a moderate amount of cost for potentially a huge amount of benefit. These estimates are involved:

    • Five new multi-core servers with 64GB RAM at $30k per server

    • Five Microsoft Windows 2012 licenses

    • Five shared SCSI disk systems with RAID 10 (50 drives)

    • Five days of additional training costs for personnel

    • Five SQL Server Enterprise Edition licenses

The HA Solution for Scenario 1

Figure 3.8 shows the final selection of hardware redundancy, shared disk RAID arrays, failover clustering, SQL clustering for the primary server instance, a secondary replication in synchronous mode for instantaneous failover, and two more asynchronous secondary replicas for heavy load balancing of online read-only access and reporting needs. There is little doubt about the needs being met well by this particular set of HA solutions. It clearly meets all of the most significant requirements of uptime, tolerance, performance, and costs. The ASP’s SLA allows for brief amounts of downtime to service all OS, hardware, and application upgrades, but due to the availability group configuration, all these are handled with rolling updates and zero downtime for the applications. Figure 3.9 shows the live HA solution technical architecture, with a budget allowing for a larger amount of hardware redundancy to be utilized.

The ASP actually put this HA solution in place and then achieved nearly five 9s for extended periods of time (exceeding its original goals of 99.95% uptime). One additional note is that the ASP also employs a spreading out of the risk strategy to further reduce downtime created from application and shared hardware failures. It will put at most two to three applications on a particular clustered solution. (Refer to Figure 2.10 in Chapter 2 for a more complete depiction of this risk mitigation approach.)

03fig09.jpg

Figure 3.9 ASP HA solution technical architecture.

The decision-tree approach can also be illustrated in a slightly different way. Figure 3.10 shows an abbreviated bubble chart technique of this decision-tree path traversal.

03fig10.jpg

Figure 3.10 Bubble chart decision-tree path traversal.

Remember that each question takes into context all questions before it. The result is a specific HA solution that best meets your business requirements.

Scenario 2: Worldwide Sales and Marketing (Brand Promotion) Assessment

Recall that Scenario 2 features a major chip manufacturer that has created a highly successful promotion and branding program, which results in billions of dollars in advertising dollars being rebated back to the company’s worldwide sales channel partners. These sales channel partners must enter in their complete advertisements (newspaper, radio, TV, other) and be measured against ad compliance and logo usage and placements. If a sales channel partner is in compliance, it will receive up to 50% of the cost of its advertisement back from this chip manufacturer. There are three major advertising regions: Far East, Europe, and North America. Any other advertisements outside these first three are lumped into an “Other Regions” bucket. Each region produces a huge daily load of new advertisement information that is processed instantaneously for compliance. Each major region deals only with that region’s advertisements but receives the compliance rules and compliance judgment from the chip manufacturer’s headquarters. Application mix is approximately 75% online entry of advertisement events and 25% management and compliance reporting.

Availability:

  • 24 hours per day

  • 7 days a week

  • 365 days a year

  • Planned downtime: 3%

  • Unplanned downtime: 2% will be tolerable

HA assessment (decision tree):

  1. What percentage of time must the application remain up during its scheduled time of operation? (The goal!)

    Response: D: 95.0%—High availability goal. This, however, is not a super-critical application in terms of keeping the company running (as an order entry system would be).

  2. How much tolerance does the end user have when the system is not available (planned or unplanned unavailability)?

    Response: C: Medium tolerance of downtime—Standard criticality.

  3. What is the per-hour cost of downtime for this application?

    Response: B: $5k/hour cost of downtime—Low cost.

    As you can see so far, this sales and marketing application is nice to have available, but it can tolerate some downtime without hurting the company very much. Sales are not lost; work just gets backed up a bit. In addition, the cost of downtime is reasonably low, at $5k/hr. This is roughly the rate at which advertisement staff can't be working on marketing materials.

  4. How long does it take to get the application back online following a failure (of any kind)? (Worst case!)

    Response: C: Average—Standard amount of time to recover (that is, to get back online). Accomplished via standard DB recovery mechanisms (incremental transaction log dumps done every 15 minutes).

  5. How much of the application is distributed and will require some type of synchronization with other nodes before all nodes are considered to be 100% available?

    Response: D: High distribution—This is a global application that relies on data being created and maintained at headquarters from around the world (OLTP activity) but must also support heavy regional reporting (reporting activity) that doesn’t interfere with the performance of the OLTP activity.

  6. How much data inconsistency can be tolerated in favor of having the application available?

    Response: B: A little—A high degree of data consistency must be maintained at all times. This gives little room for any HA option that would sacrifice data consistency in favor of availability. This is regionally sensitive, in that when data is being updated by Europe, the Far East doesn’t need to get data updates right away.

  7. How often is scheduled maintenance required for this application (and environment)?

    Response: C: Average—A reasonable amount of downtime will occur to service operating system patches/upgrades, hardware changes/swapping, and application patches/upgrades.

  8. How important is high performance and scalability?

    Response: D: Very much—Performance (and scalability) are very important for this application. Ideally, an overall approach of separating the OLTP activity from the reporting activity will pay big dividends toward this.

  9. How important is it for the application to keep its current connection alive with the end user?

    Response: B: Somewhat—At the very least, the ability to establish a new connection to the application within a short amount of time will be required. No client connection failover is required. In fact, for the worst-case scenario of the headquarters database becoming unavailable, the OLTP activity could easily be shifted to any other full copy of the database that is being used for reporting and is being kept current (as will be seen in the HA solution for this scenario).

  10. What is the estimated cost of a possible high availability solution? What is the budget?

    Response: B: $10k <= C$ < $100k—This is a pretty low amount of cost for potentially a huge amount of benefit. These estimates are involved:

    • Five new multi-core servers with 32GB RAM at $10k per server

    • Five new Microsoft Windows 2012 Server licenses

    • Five SCSI disk systems with RAID 10 (25 drives)

    • Two days of additional training costs for personnel

    • Five new SQL Server licenses (remote distributor, three subscribers)

The HA Solution for Scenario 2

Figure 3.11 shows that a basic hardware/disk redundancy approach on each server should be used, along with SQL Server’s robust “transactional” data replication implementation, to create three regional reporting images of the primary marketing database (MktgDB). These distributed copies will try to alleviate the major reporting burden against the OLTP (primary) database and also can serve as a warm standby copy of the database in the event of a major database problem at headquarters. Overall, this distributed architecture is easy to maintain and keep in sync and is highly scalable, as shown in Figure 3.12.

03fig11.jpg

Figure 3.11 Sales/marketing decision-tree summary plus HA solution.

03fig12.jpg

Figure 3.12 Sales/marketing HA solution technical architecture.

After building this HA solution, the uptime goal was achieved for most of the time. Occasionally, there were some delays in resyncing the data at each regional site (subscribers). But, in general, the users were extremely happy with performance, availability, and minimal costs.

Scenario 3: Investment Portfolio Management Assessment

An investment portfolio management application will be housed in a major server farm in the heart of the world’s financial center: New York. Serving North American customers only, this application provides the ability to do full trading of stocks and options in all financial markets (United States and international), along with full portfolio holdings assessment, historical performance, and holdings valuation. Primary users are investment managers for their large customers. Stock purchasing and selling comprise 90% of the daytime activity, and massive assessment, historical performance, and valuation reporting occur after the markets have closed. Three major peaks occur each weekday that are driven by the three major trading markets of the world (United States, Europe, and the Far East). During the weekends, the application is used for the long-range planning reporting and front-loading stock trades for the coming week.

Availability:

  • 20 hours per day

  • 7 days per week

  • 365 days per year

  • Planned downtime: 4%

  • Unplanned downtime: 1% will be tolerable

HA assessment (decision tree):

  1. What percentage of time must the application remain up during its scheduled time of operation? (The goal!)

    Response: D: 95.0%—High availability goal. This particular financial institution (one of the largest on the planet) tends to allow for a small percentage of “built-in” downtime (planned or unplanned). A smaller, more nimble financial institution would probably have slightly more aggressive uptime goals (for example, five 9s). Time is money, you know.

  2. How much tolerance does the end user have when the system is not available (planned or unplanned unavailability)?

    Response: D: Low tolerance of downtime—High criticality due to market timings (selling and buying stocks within market windows).

  3. What is the per-hour cost of downtime for this application?

    Response: E: $150k/hour cost of downtime—Very high cost. However, this is the worse-case scenario. When the markets are closed, the cost of downtime is marginal.

  4. How long does it take to get the application back online following a failure (of any kind)? (Worst case!)

    Response: E: Very short recovery—This application’s time to recover should be a very short amount of time (that is, to get back online).

    How much of the application is distributed and will require some type of synchronization with other nodes before all nodes are considered to be 100% available?

    Response: C: Medium distribution—This moderately distributed application has a large OLTP requirement and a large report processing requirement.

  5. How much data inconsistency can be tolerated in favor of having the application available?

    Response: A: Very little—A very high degree of data consistency must be maintained at all times because this is financial data.

  6. How often is scheduled maintenance required for this application (and environment)?

    Response: C: Average—A reasonable amount of downtime will occur to service operating system patches/upgrades, hardware changes/swapping, and application patches/upgrades.

  7. How important is high performance and scalability?

    Response: D: Very much—Performance (and scalability) are very important for this application.

  8. How important is it for the application to keep its current connection alive with the end user?

    Response: B: Somewhat—At the very least, the ability to establish a new connection to the application within a short amount of time will be required. No client connection failover is required.

  9. What is the estimated cost of a possible high availability solution? What is the budget?

    Response: C: $100k <= C$ < $250k—This is a moderate amount of cost for potentially a huge amount of benefit. These estimates are involved:

    • Two new multi-core servers with 64GB RAM at $50k per server

    • Three Microsoft Windows 2012 Server licenses

    • Three SQL Server Enterprise Edition licenses

    • $1,000/month Microsoft Azure IaaS fees

    • Two shared SCSI disk systems with RAID 10 (30 drives)

    • Twelve days of additional training costs for personnel

    • The company budgeted $1.25 million for all HA costs. A solid HA solution won’t even approach those numbers coming way under the budgeted amount.

    Figure 3.13 shows the overall summary of the decision-tree results, along with the HA solution for the portfolio management scenario.

03fig13.jpg

Figure 3.13 Portfolio management decision-tree summary plus HA solution.

The HA Solution for Scenario 3

As identified in Figure 3.13, the company opted for the basic hardware/disk redundancy approach on each server, with an AlwaysOn availability group asynchronous secondary replica for failover, offloading of DB backups, and a reporting workload on both the local secondary replica and the Microsoft Azure secondary replica. In fact, most of the reporting was eventually directed to the Microsoft Azure secondary replica; data was only about 10 seconds behind the primary replica, on average. There is now plenty of risk mitigation with this technical architecture, and it is not difficult to maintain (as shown in Figure 3.14).

Once this HA solution was put together, it exceeded the high availability goals on a regular basis. Great performance has also resulted due to the splitting out of the OLTP from the reporting; this is very often a solid design approach.

03fig14.jpg

Figure 3.14 Portfolio management HA solution technical architecture.

Scenario 4: Call Before You Dig Assessment

Recall from Chapter 1 that at a tri-state underground construction call center, an application must determine within 6 inches the likelihood of hitting any underground gas mains, water mains, electrical wiring, phone lines, or cables that might be present on a proposed dig site for construction. Law requires that a call be placed to this center to determine whether it is safe to dig and identify the exact location of any underground hazard before any digging has started. This application is classified as “life at risk” and must be available very nearly 100% of the time during common construction workdays (Monday through Saturday). Each year more than 25 people are killed nationwide digging into unknown underground hazards. The application mix is 95% query only with 5% devoted to updating images, geospatial values, and various pipe and cable location information provided by the regional utility companies.

Availability:

  • 15 hours per day (5:00 a.m.–8:00 p.m.)

  • 6 days per week (closed on Sunday)

  • 312 days per year

  • Planned downtime: 0%

  • Unplanned downtime: 0.5% (less than 1%) will be tolerable

HA assessment (decision tree):

  1. What percentage of time must the application remain up during its scheduled time of operation? (The goal!)

    Response: E: 99.95%—Extreme availability goal. This is a “life critical” application. Someone may be killed if information cannot be obtained from this system during its planned time of operation.

  2. How much tolerance does the end user have when the system is not available (planned or unplanned unavailability)?

    Response: E: Very low tolerance of downtime—In other words, this has extreme criticality from the end user’s point of view.

  3. What is the per-hour cost of downtime for this application?

    Response: A: $2k/hour cost of downtime—Very low dollar cost. Very high life cost. This one question is very deceiving. There is no limit to the cost of loss of life. However, you must go with the original dollar costing approach. So, bear with me on this one. Hopefully, the outcome will be the same.

  4. How long does it take to get the application back online following a failure (of any kind)? (Worst case!)

    Response: E: Very short recovery—This application’s time to recover should be a very short amount of time (that is, to get back online).

  5. How much of the application is distributed and will require some type of synchronization with other nodes before all nodes are considered to be 100% available?

    Response: A: None—There is no data distribution requirement for this application.

  6. How much data inconsistency can be tolerated in favor of having the application available?

    Response: A: Very little—A very high degree of data consistency must be maintained at all times. This data must be extremely accurate and up to date due to the life-threatening aspects to incorrect information.

  7. How often is scheduled maintenance required for this application (and environment)?

    Response: C: Average—A reasonable amount of downtime will occur to service operating system patches/upgrades, hardware changes/swapping, and application patches/upgrades. This system has a planned time of operation of 15×6×312, yielding plenty of time for this type of maintenance. Thus, it will have 0% planned downtime but an average amount of scheduled maintenance.

  8. How important are high performance and scalability?

    Response: C: Moderate performance—Performance and scalability aren’t paramount for this application. The accuracy and availability to the information are most important.

  9. How important is it for the application to keep its current connection alive with the end user?

    Response: B: Somewhat—At the very least, the ability to establish a new connection to the application within a short amount of time will be required. No client connection failover is required.

  10. What is the estimated cost of a possible high availability solution? What is the budget?

    • Response: B: $10k <= C$ < $100k—This is a pretty low cost. These estimates are involved: Four new multi-core servers with 64GB RAM at $30k per server

    • Four new Microsoft Windows 2012 Server licenses

    • One shared SCSI disk system with RAID 5 (10 drives)—this is primarily a read-only system (95% reads)

    • One SCSI disk system RAID 5 (5 drives)

    • $750/month Microsoft Azure IaaS fees

    • Five days of additional training costs for personnel

    • Four new SQL Server Enterprise Edition licenses

    The budget for the whole HA solution is somewhat limited to under $100k as well.

The HA Solution for Scenario 4

Figure 3.15 summarizes the decision-tree answers for the call before digging application. As you have seen, this is a very critical system during its planned hours of operation, but it has low performance goals and a low cost of downtime (when it is down). Regardless, it is highly desirable for this system to be up and running as much as possible. The HA solution that best fits this particular application’s needs is a combination that yields maximum redundancy (hardware, disk, and database) with the additional insurance policy of maintaining a SQL clustered primary as a part of a high availability group that has a failover secondary replica and a Microsoft Azure secondary replica in case the whole SQL cluster configuration fails. This is a pretty extreme attempt at always having a valid application up and running to support the loss-of-life aspect of this application.

After building this HA solution, the uptime goal was achieved easily. Performance has been exceptional, and this application continuously achieves its availability goals. Figure 3.16 shows the technical HA solution employed.

03fig15.jpg

Figure 3.15 Call before digging decision-tree summary plus HA solution.

03fig16.jpg

Figure 3.16 Call before you dig HA solution technical architecture.

  • + Share This
  • 🔖 Save To Your Account