Java

Capacity Assessment

Last updated Jul 15, 2005.

Many people throw around the terms capacity assessment, capacity planning, trending, and forecasting without really understanding what they mean. "Capacity planning," as commonly used, usually refers to that time when applications cease satisfying service-level agreements, and the speaker is are forced to buy more hardware.

I have long preached that additional hardware is usually not the right solution (although if you have enough funds it is an effective one). By proactively implementing a systematic methodology to learn the capacity of your environment, you can avoid this reactive troubleshooting and make educated decisions that are specific to your environment, before problems affect your end-users.

A capacity assessment is more than a load test. You need the following components before you are ready to perform a capacity assessment:

With all of these capabilities in hand, it is time to start loading your application. Configure your load generator to generate your expected usage in a reasonable amount of time. That could be as short as 10 minutes, or an hour or more, depending on the user behavior that you have observed in your production environment. While you are increasing load to the expected usage, capture the response time of your service-requests, and evaluate them against their service-level agreements.

Once you reach your expected user load, it is time to determine the size of the steps that you want to monitor. The size of a step is the measurable increase in user load between sampling intervals — it defines the granularity of accuracy of your capacity assessment. For example, imagine that your expected user load is 1,000 users; you might define a step as 25 or 50 users. Pick a time interval in which to increase steps, and record the response times of your service requests at these intervals.

Continue this pattern for each service request until the response time of each exceeds its SLA. Note this time and start recording response times at a tighter interval.

The purpose for increasing the sampling interval is to better indentify how a service request degrades after it has reached its capacity. From these degrading figures, we want to attempt to plot the response times to determine the order of the degradation: is it a linear, exponential, or worse?

Understand the implications of missing or service-level agreements. For example, if we miss our service-level agreement at 1,500 users, but only increase our response time by 50% over the next 500 users... it that better than if our response times triple every 100 users and then the entire application server crashes at 1,800 users? This helps us understand and mitigate the risk of changes in user behavior.

For each service request, we compile this information and note the capacity of our application at the lowest common denominator: the service request that first consistently misses its service-level agreement. But that is followed in the capacity analysis report by a section that describes the behavior of the degrading application. From this report, business owners can make the determination about when they require additional resources.

While that this is going on, you need to monitor the utilization of your application server and operating system resources. You need to know the utilizations of your thread pools, heap, JDBC connection pools, other back-end resource connection pools (e.g. JCA and JMS), pools and caches, as well as CPU, physical memory, disk I/O, and network activity.

Figure 70

Figure 70. The relationship between user load, service request response time, and resource utilization

Figure 70 relates the user load with service request response time and resource utilization. As you can see, as user load increases, response time increases slowly, and resource utilization increases almost linearly. This is because the more work you ask your application to do, the more resources it needs.

Once the resource utilization is close to 100%, however, an interesting thing happens: response degrades with an exponential curve. This point in the capacity assessment is referred to as the saturation point. The saturation point is the point where all performance criteria is abandoned and utter panic ensues. Your goal in performing a capacity assessment is to ensure that you know where this point is, and that you never reach it — you will tune the system or add additional hardware well before this load occurs.

A formal capacity assessment report, therefore, includes the following:

After you gather all the data and identify all of the key points (e.g. met SLA, exceeded SLA, degradation pattern, saturation point), it is time to perform deeper analysis and generate recommendations. Try to classify your application into one of the following categories:

In the extremely under-utilized system, you may consider reducing hardware and servers to conserve licensing costs. This determination can only be made after interviews with application business owners to determine if the additional capacity is required.

In the under-utilized system, you can sleep well at night. Your environment can support any reasonable additional load that it might receive, but it is not so under-utilized that you would want to cut resources.

For the nearing-capacity system, you should spend some hard time with the application business owner to determine expected changes in user behavior, forecasted changes in usage patterns, planned promotions, etc. to decide if additional resources are required.

In the over-utilized system, you need more resources. But at this point, it is still the decision of the application business owner: how badly is the application missing its SLAs? What is the degradation pattern? Is the current application behavior acceptable? Are projected changes in usage patterns going to significantly degrade application performance?

In the extremely over-utilized system, you have undoubtedly received user complaints and are in that state of utter panic that I alluded to earlier. You need significant tuning and possibly additional resources such as hardware to save your users!

Performing a capacity assessment on your environment should be required for all of your application deployments and should occur at the end of significant application iterations. A proper capacity assessment captures the performance of your application at the current load, the capacity of your system (when the first service request exceeds its service-level agreement), the degradation patterns of your service requests, and the saturation point of your environment. >From this information you can generalize conclusions that trigger discussions about modifying your environment. Without a capcity assessment you are running blind, hoping that your applications won't fall over with the next promotion or during the holiday season. Urge your management to permit this exercise, the calm nights of sleep you will receive as a result will more than make up for this argument!