Designing Enterprise Solutions with Sun Cluster 3.0
- Business Reasons for Clustered Systems
- Failures in Complex Systems
- Data Synchronization
- Arbitration Schemes
- Data Caches
- Timeouts
- Failures in Clustered Systems
- Summary
This chapter addresses the following topics:
The need for a business to have a highly available, clustered system
System failures that influence business decisions
Factors to consider when designing a clustered system
Failure modes specific to clusters, synchronization, and arbitration
To understand why you are designing a clustered system, you must first understand the business need for such a system. Your understanding of the complex system failures that can occur in such systems will influence the decision to use a clustered system and will also help you design a system to handle such failures. You must also consider issues such as data synchronization, arbitration, caching, timing, and clustered system failures split brain, multiple instances, and amnesia as you design your clustered system.
Once you are familiar with all the building blocks, issues and features that enable you to design an entire clustered system, you can analyze the solutions that the Sun Cluster 3.0 software offers to see how it meets your enterprise business needs and backup, restore, and recovery requirements.
The sections in this chapter are:
- Business Reasons for Clustered Systems
- Failures in Complex Systems
- Data Synchronization
- Arbitration Schemes
- Data Caches
- Timeouts
- Failures in Clustered Systems
- Summary
Business Reasons for Clustered Systems
Businesses build clusters of computers to improve performance or availability. Some products and technologies can improve both. However, much clustering activity driving the computer industry today is focused on improving service availability.
Downtime is a critical problem for an increasing number of computer users. Computers have not become less reliable, but users now insist on greater degrees of availability. As more businesses depend on computing as the backbone of their operation, around-the-clock availability of services becomes more critical.
Downtime can translate into lost money for businesses, potentially large amounts of money. Large enterprise customers are not the only ones to feel this pinch. The demands for mission-critical computing have reached the workgroup, and even the desktop. No one today can afford downtime. Even the downtime required to perform maintenance on systems is under pressure. Computer users want the systems to remain operational while the system administrators perform system maintenance tasks.
Businesses implement clusters for availability when the potential cost of downtime is greater than the incremental cost of the cluster. The potential cost of downtime can be difficult to predict accurately. To help predict this cost, you can use risk assessment.
Risk Assessment
Risk assessment is the process of determining what results when an event occurs. For many businesses, the business processes themselves are as complex as the computer systems they rely on. This significantly complicates the systems architect's risk assessment. It may be easier to make some sort of generic risk assessment in which the business risk can be indicted as cost. Nevertheless, justifying the costs of a clustered system is often difficult unless one can show that the costs of implementing and supporting a cluster can reduce the costs of downtime. Since the former can be measured in real dollars and the latter is based on a multivariate situation with many probability functions, many people find it easier to relate to some percentage of uptime.
Clusters attempt to decrease the probability that a fault will cause a service outage, but they cannot prevent it. They do, however, limit the maximum service outage time by providing a host on which to recover from the fault.
Computations justifying the costs of a cluster must not assume zero possibility of a system outage. Prospect theory is useful to communicate this to end users in such a situation. To say the system has a 99 percent chance of no loss is preferable to a 1 percent chance of loss. However, for design purposes, the systems architect must consider carefully the case where there is 1 percent chance of loss. You must always consider the 1 percent chance of loss in your design analysis. After you access the risks of downtime, you can do a more realistic cost estimate.
Cost Estimation
Ultimately, everything done by businesses can be attributed to cost. Given infinite funds and time ( time is money) perfect systems can be built and operated. Unfortunately, most real systems have both funding and time constraints.
Nonrecurring expenses include hardware and software acquisition costs, operator training, software development, and so forth. Normally, these costs are not expected to recur. The nonrecurring hardware costs of purchasing a cluster are obviously greater than an equivalent, single system. Software costs vary somewhat. There is the cost of the cluster software and any agents required. An additional cost may be incurred as a result of the software licensing agreement for any other software. In some cases, a software vendor may require the purchase of a software license for each node in the cluster. Other software vendors may have more flexible licensing, such as per-user licenses.
Recurring costs include ongoing maintenance contracts, consumable goods, power, network connection fees, environmental conditioning, support personnel, and floor space costs.
Almost all system designs must be justified in economic terms. Simply put, is the profit generated by the system greater than its cost? For systems that do not consider downtime, economic justification tends to be a fairly straightforward calculation.
where:
Plifetime is the profit over the lifetime of the system.
Rlifetime is the revenue generated by the system over its lifetime.
Cdowntime is the cost of any downtime.
Cnonrecurring is the cost of nonrecurring expenses.
Crecurring is the cost of any recurring expenses.
During system design these costs tend to be difficult to predict accurately. However, they tend to be readily measurable on well-designed systems.
The cost of downtime is often described in terms of the profit of uptime.
where:
Cdowntime is the cost of downtime.
tdown is the duration of the outage.
Puptime is the profit made during tup.
tup is the time the system had been up.
For most purposes, this equation suffices. What is not accounted for in this equation is the opportunity cost. If a web site has competitors and is down, a customer is likely to go to one of the competing web sites. This defection represents an opportunity loss that is difficult to quantify.
The pitfall in using such an equation is that the Puptime is likely to be a function of time. For example, a factory that operates using one shift makes a profit only during the shift hours. During the hours that the factory is not operating, the Puptime is zero, and consequently the Cdowntime is zero.
where:
Puptime(t) =Pnominal, when t is during the work hours
= 0, all other times
Another way to show the real cost of system downtime is to weight the cost according to the impact on the business. For example, a system that supports a call center might choose impacted user minutes (IUM), instead of a dollar value, to represent the cost of downtime. If 1,000 users are affected by an outage for 5 minutes, the IUM value is 1,000 users times 5 minutes, or 5,000 IUMs. This approach has the advantage of being an easily measured metric. The number of logged-in users and the duration of any outage are readily measurable quantities. A service level agreement (SLA) that specifies the service level as IUMs can be negotiated. IUMs can then be translated into a dollar value by the accountants.
Another advantage of using IUMs is that the service provided to the users is measured, rather than the availability of the system components. SLAs can also be negotiated on the basis of service availability, but it becomes difficult to account for the transfer of the service to a secondary site. IUMs can be readily transferred to secondary sites because the measurement is not based in any way on the system components.