Web Farms: Availability and Scalability
- Understanging Availability
- Understanding Scalability
- Scaling a Web Farm
- Summary
In this chapter
- Understanding Availability
- Understanding Scalability
- Scaling a Web Farm
Availability and scalability are the two most important concepts to understand when planning a Web farm using .NET enterprise servers. Success of an online business is dependent on a site's capability to meet the needs of a customer when that customer first browses the site. If a customer's initial experience is met with site unavailable messages and slow response times, confidence is decreased, no money exchanges hands, and return visits are unlikely. While not the direct concern of the administrator, any business owner will eventually demand that the site be highly available and perform well. Success in this area requires a thorough understanding of how the .NET platform handles scalability and availability.
The goals of this chapter are to help you
-
Understand the concept of availability as it relates to a Web farm. Concepts introduced include the availability rating as it relates to the network, servers, and applications in a Web farm. Using the availability rating as a metric for success in a Web farm is also covered.
-
Understand the concept of scalability as it relates to a Web farm. How scaling helps handle increased user load is covered, along with using scale to become a Web farm.
-
Understand the techniques used in scaling a Web farm to handle increased user load. Network load balancing, hardware load balancing, component load balancing, and Windows Server clusters are all introduced as means to achieve scalability and availability in a Web farm.
Understanding Availability
When a user browses to a Web site and encounters any error or unexpected message, this results in a perception problem. This is equivalent to a customer walking into a store and finding all the employees in disarray and nothing getting done. Offline businesses have expected hours for business operations. For online businesses, operating hours are 24 hours a day. The expectation is for an online business to accept money whenever a transaction is executed. This capability to accept transactions 24 hours a day is called availability.
When a site is unavailable, it is said to have taken an availability "hit." When a site takes an availability hit, it is said to be "down." It is helpful to divide availability hits into two categories: scheduled and unscheduled downtime.
Unscheduled downtime is any time that site availability is affected except during scheduled maintenance, upgrades, new software releases, or other planned outages.
The Dot9s Scale
Availability is a ratio of the uptime of a site to unscheduled downtime over a given time period. This is referred to as the "dot9s" scale. A site is given an availability rating such as .9, which is "one nine," or .9999, which is "four nines." A "four nine" site is 99.9999% available. The formula for calculating this ratio is 100 (Unscheduled Downtime/Period of Time)x100. For example, if a site is down for one hour in a day, the availability of that site in one day is 100 (1/24)x 100 = 95.83333%. If a site is down one hour in a week, the availability is 100 (60/10080) x 100 = 99.404%.
Is 100% Availability Realistic?
For a Web farm to obtain 100% availability, it cannot have unscheduled downtime. This should always be the goal of any Web farm, but it is unlikely. From the availability formula, a "four nine" site can have 31.536 seconds of downtime in a year, long enough to replace a network cable! This availability comes at considerable cost in redundant systems, and most businesses accept one or two nines as the goal. Table 3.1 shows some availability measurements.
Table 3.1 Availability Measurements
Percent Available |
Downtime Per Year |
90% |
36.5 Days |
95% |
18.25 Days |
98% |
7.3 Days |
99% |
3.65 Days |
99.9% |
8.76 Hours |
99.99% |
52.56 Minutes |
99.999% |
315 Seconds |
99.9999% |
31.5 Seconds |
Assessing a Web Farm's Availability
Availability measurements are useful only if an organization uses them properly. A successful strategy for considering availability statistics is to break the measurements into categories. These categories will differ for each Web farm, but basic categories include server availability, application availability, and network availability.
The disjointed nature of these categories means each measurement can be considered separately. This separation provides hard numbers for where money and resources should be spent to improve availability. If network availability is low, it doesn't make sense to invest time and effort in solving application availability problems.
All complex systems fail. Any failure can affect a Web farm's availability rating if the failure has no contingency. Realistic availability goals are achievable through redundancy. Redundancy is the first line of defense for any site failure. In hardware systems, redundancy comes from complete copies of critical components. For example, a system with a redundant CPU has a spare CPU that is not used unless the primary CPU fails. If this failure occurs, the spare CPU takes over, enabling the system to continue processing without missing an instruction. With software systems like SQL Server 2000, redundancy comes from a cluster of servers, each capable of owning the database of the others. Redundancy eliminates single points of failure in complex systems. Each availability area has different ways to solve this problem.
Monitoring is the final step in assessing the availability of a Web farm. Without monitoring, it is not possible to determine when a component fails. Each area of a Web farm has different methodologies for error reporting.
Understanding Network Availability
The majority of network systems in a Web farm are physical. When building a network for high availability, redundancy must be considered at every level. Every point in the network should have a backup. From the connection, to the Internet, to the routers that move traffic throughout the farm, each level must be considered. Not every system in a Web farm network has to be redundant, however, and a cost-benefit analysis is the best way to determine at which points to build in redundancy.
For example, Foo.com has a front-end router that handles traffic from the Internet. If this router fails, the site is down. However, router failures are rare, and Foo.com decides to take a calculated risk and not have a cold spare sitting unused. A four-hour response time agreement with the router manufacturer is Foo.com's reason why they accept this downtime if it occurs. However, Foo.com decides that having redundant connections to the Internet is important, so they spend the extra money to have two connections to the Internet at all times. Each business will make the same decisions differently.
The most common area of failure in any network system is at the cable level. It is usually cost prohibitive to have redundant network cables between every point in a network. Luckily, it is relatively easy to diagnose and replace a faulty cable in a network.
Vendor-specific tools best handle monitoring the network for failures. There are tools that exist today that take the pulse of the entire network, ensuring that each connection is functioning correctly. When a failure does occur, these tools can alert an administrator to the problem.
NOTE
More information for building Web farm networks is found in Chapter 4, "Planning a Web Farm Network."
Understanding Server Availability
Single points of failure in server hardware include CPUs, hard drives, motherboards, and network cards. Hardware redundancy is purely a cost issue. Available on the market today for a price are servers that have three levels of redundancy for every internal system. Some businesses will invest considerable dollars to achieve complete hardware redundancy.
With .NET enterprise servers it is possible to create redundant configurations without the need for extreme hardware redundancy. The .NET platform eliminates single points of failure by providing simple software solutions so that multiple servers can handle the same task. With .NET, an administrator can designate two or more servers to provide redundancy and increased availability at the server level. This availability creates a natural pathway to scalability for a Web farm.
Monitoring server availability is best accomplished with the server vendor's tools. As part of the decision to standardize on a particular hardware platform, the monitoring software available should be a factor in the decision. These centralized monitoring tools can inform an administrator when any critical component of a server fails, from the CPU to the power supply. Without tools like this, monitoring for hardware failures is a crapshoot. If the drivers for a piece of hardware log errors to the NT event log, then tools like Microsoft Health Monitor can provide an alert. Watching the NT event log with Health Monitor is a way to alert on failures in redundant software systems like Windows Cluster Server and network load balancing.
NOTE
More Information on Health Monitor is available in Chapter 14, " Monitoring a Web Farm with Application Center 2000."
Understanding Application Availability
A more subtle aspect of building highly available systems is application availability. Application availability measures how the functions and features of a specific application perform throughout the Web site lifecycle. By gauging application availability, actual uptime is measurable. If any functional portion of a site, such as credit card authorization, must function to complete a transaction, then that portion's availability affects the overall site availability measurement. Even if the Web site itself is successfully providing content, if a transaction fails at any point, the site is unavailable.
By considering application availability, a new type of thinking emerges for application design. While load-balancing techniques help eliminate single points of failure in physical systems, software single points of failure are a more difficult problem to solve. With the credit card example, the application must be robust enough to continue the transaction either by deferring credit card authorization or switching to a different credit card service. Availability planning for credit card processing must consider redundant connections to a lending institution and load-balanced redundancy for credit card processing servers, and it must also provide service-level redundancy.
Building software systems that have this level of redundancy is a unique challenge. Each application will have different requirements. However, at the fundamental design level there are a few key points to consider when orchestrating an application availability solution.
Create software systems with well-defined boundaries. This means that in the credit card example it should be fairly easy to tell when an application has entered the credit card processing engine. This enables an application to drop in a different system as long as the boundaries into that system look the same.
When an application data path leaves the boundaries of a Web farm, this process is a good candidate for application availability. For instance, if an application makes a WAN connection to an external bank, this connection is by definition not under the control and management of the IT staff that manages the Web farm. In situations like this, alternative mechanisms from the Network layer to the Application layer should be thought through to provide the highest level of redundancy. This may mean purchasing a second WAN connection of a lesser speed and cost and having an agreement for credit card processing with a second bank.
For transaction-oriented systems, build into the architecture a way to move to a batch-oriented processing engine. This means that in the credit-card example the same information would be gathered to process a credit card authorization, it would just not happen in real-time. The functionality of an application would be decreased in batch mode, but at least the transaction could be completed at a later time. It is much better to switch to batch-oriented processing than tell a customer to come back later. Later may never come.
The most complicated monitoring problem is application availability monitoring. Many applications rely on customer input to determine when critical systems fail. Beyond this, most monitoring endeavors for applications are custom built. Critical application components should add time in a delivery schedule to build and implement the appropriate monitoring. In some cases, there are tools that generate replay scripts that can simulate a user on a Web site. These tools can be used to test the functionality of a site and report when errors occur. Even tools like this will likely demand a full-time resource to manage and maintain these scripts for even remotely sophisticated sites.
Understanding Scheduled Downtime
Scheduled downtime should be the only reason a site becomes unavailable. Whenever a site has a scheduled release, hardware upgrade, planned ISP outage, or other required downtime scenario, this is scheduled downtime. Consider these downtimes as a separate measurement from unscheduled downtime. While it is important to improve this availability number, the goal should stay within what is reasonable in today's Web farms. New .NET Enterprise technologies, like Application Center 2000, will help improve the scheduled availability rating. Most of the improvements are to be made by improving the process for releasing new features into the production Web farm.
Measuring Overall Availability
When measuring overall availability, a business should consider total downtime as a separate measurement from total unscheduled downtime. Each area of availability is calculated separately to help direct where efforts to improve availability should be made. Keep a running total of the network, server, application, unscheduled, and scheduled availabilities. Combine the network, server, and application availability to create the unscheduled availability quotient. Combine all the availability ratings to create the overall availability rating. Table 3.2 has the availability ratings and goals for Foo.com, based on a six-month period.
Table 3.2 Foo.com's Availability Goals for Six Months
Availability_Type |
Availability Goal |
Downtime |
Availability Measurement |
Network |
99.9 |
1 hour |
99.97% |
Server |
99.9 |
5 hours |
99.88% |
Application |
99.9 |
10 hours |
99.77% |
Scheduled |
99 |
70 hours |
98.39% |
Unscheduled |
99.9 |
17 hours |
99.61% |
Overall |
99 |
87 hours |
98% |
Foo.com does a good job of preventing unscheduled downtime. It is 0.3% away from its stated unscheduled availability goal. However, this is approximately 12 hours of downtime. In order to achieve this goal, application availability problems need to be addressed. To improve overall availability, Foo.com needs to work on the time it takes to perform scheduled maintenance tasks. Foo.com needs to make up 43 hours to reach its overall availability goal of 99%.