Benchmarking Enterprise and Cloud Systems Performance
- Nov 20, 2013
- There are lies, damn lies and then there are performance measures.
- —Anon et al., “A Measure of Transaction Processing Power” [Anon 85]
Benchmarking tests performance in a controlled manner, allowing choices to be compared and performance limits to be understood—before they are encountered in production. These limits may be system resources, software limits in a virtualized environment (cloud computing), or limits in the target application. Previous chapters have explored these components, describing the types of limits present and the tools used to analyze them.
Previous chapters have also introduced tools for micro-benchmarking, which investigate limits using simple artificial workloads. Other types of benchmarking include client workload simulations, which attempt to replicate a client usage pattern, and trace replays. Whichever type you use, it’s important to analyze the benchmark so that you can confirm what is being measured. Benchmarks tell you only how fast the system can run the benchmark; it’s up to you to understand the result and determine how it applies to your environment.
This chapter discusses benchmarking in general, providing advice and methodologies to help you avoid common mistakes and accurately test your systems. This is also useful background when you need to interpret the results from others, including vendor and industry benchmarks.
This section describes benchmarking activities and effective benchmarking and summarizes common mistakes as the “sins of benchmarking.”
Benchmarking may be performed for the following reasons:
- System design: comparing different systems, system components, or applications. For commercial products, benchmarking may provide data to aid a purchase decision, specifically the price/performance ratio of the available options. In some cases, results from published industry benchmarks can be used, which avoids the need for customers to execute the benchmarks themselves.
- Tuning: testing tunable parameters and configuration options, to identify those that are worth further investigation with the production workload.
- Development: for both non-regression testing and limit investigations during product development. Non-regression testing may be an automated battery of performance tests that run regularly, so that any performance regression can be discovered early and quickly matched to the product change. For limit investigations, benchmarking can be used to drive products to their limit during development, in order to identify where engineering effort is best spent to improve product performance.
- Capacity planning: determining system and application limits for capacity planning, either to provide data for modeling performance, or to find capacity limits directly.
- Troubleshooting: to verify that components can still operate at maximum performance, for example, testing maximum network throughput between hosts to check whether there may be a network issue.
- Marketing: determining maximum product performance for use by marketing (also called benchmarketing).
In enterprise environments, benchmarking during proof of concepts can be an important exercise before investing in expensive hardware and may be a process that lasts several weeks. This includes the time to ship, rack, and cable systems, and then to install operating systems before testing.
In cloud computing environments, resources are available on demand, without an expensive initial investment in hardware. These environments still, however, require some investment when choosing which application programming language to use, and which database, web server, and load balancer to run. Some of these choices can be difficult to change down the road. Benchmarking can be performed to investigate how well these choices can scale when required. The cloud computing model also makes benchmarking easy: a large-scale environment can be created in minutes, used for a benchmark run, and then destroyed, all at very little cost.
12.1.2. Effective Benchmarking
Benchmarking is surprisingly difficult to do well, with many opportunities for mistakes and oversights. As summarized by the paper “A Nine Year Study of File System and Storage Benchmarking” [Traeger 08]:
- In this article we survey 415 file system and storage benchmarks from 106 recent papers. We found that most popular benchmarks are flawed and many research papers do not provide a clear indication of true performance.
The paper also makes recommendations for what should be done; in particular, benchmark evaluations should explain what was tested and why, and they should perform some analysis of the system’s expected behavior.
The essence of a good benchmark has also been summarized as [Smaalders 06]
- Repeatable: to facilitate comparisons
- Observable: so that performance can be analyzed and understood
- Portable: to allow benchmarking on competitors and across different product releases
- Easily presented: so that everyone can understand the results
- Realistic: so that measurements reflect customer-experienced realities
- Runnable: so that developers can quickly test changes
Another characteristic must be added when comparing different systems with the intent to purchase: the price/performance ratio. The price can be quantified as the five-year capital cost of the equipment [Anon 85].
Effective benchmarking is also about how you apply the benchmark: the analysis and the conclusions drawn.
When using benchmarks, you need to understand
- What is being tested
- What the limiting factor or factors are
- Any perturbations that might affect the results
- What conclusions may be drawn from the results
These needs require a deep understanding of what the benchmark software is doing, how the system is responding, and how the results relate to the destination environment.
Given a benchmark tool and access to the system that runs it, these needs are best served by performance analysis of the system while the benchmark is running. A common mistake is to have junior staff execute the benchmarks, then to bring in performance experts to explain the results after the benchmark has completed. It is best to engage the performance experts during the benchmark so they can analyze the system while it is still running. This may include drill-down analysis to explain and quantify the limiting factor.
The following is an interesting example of analysis:
- As an experiment to investigate the performance of the resulting TCP/IP implementation, we transmitted 4 Megabytes of data between two user processes on different machines. The transfer was partitioned into 1024 byte records and encapsulated in 1068 byte Ethernet packets. Sending the data from our 11/750 to our 11/780 through TCP/IP takes 28 seconds. This includes all the time needed to set up and tear down the connections, for an user-user throughput of 1.2 Megabaud. During this time the 11/750 is CPU saturated, but the 11/780 has about 30% idle time. The time spent in the system processing the data is spread out among handling for the Ethernet (20%), IP packet processing (10%), TCP processing (30%), checksumming (25%), and user system call handling (15%), with no single part of the handling dominating the time in the system.
This describes checking the limiting factors (“the 11/750 is CPU saturated”1), then explains details of the kernel components causing them. As an aside, being able to perform this analysis and summarize kernel time so neatly is an unusual skill today, even with advanced tools such as DTrace. This quote is from Bill Joy while he was developing the original BSD TCP/IP stack in 1981! 
Apart from using a given benchmark tool, you may find it more effective to develop your own custom benchmark software, or at least custom load generators. These can be kept short, focusing on only what is needed for your test, making them quick to analyze and debug.
In some cases you don’t have access to the benchmark tool or the system, as when reading benchmark results from others. Consider the previous items based on the materials available, and, in addition, ask, What is the system environment? How is it configured? You may be permitted to ask the vendor to answer these questions as well. See Section 12.4, Benchmark Questions, for more vendor questions.
12.1.3. Benchmarking Sins
The following sections provide a quick checklist of specific issues to avoid, and how to avoid them. Section 12.3, Methodology, describes how to perform benchmarking.
To do benchmarking well is not a fire-and-forget activity. Benchmark tools provide numbers, but those numbers may not reflect what you think, and your conclusions about them may therefore be bogus.
- Casual benchmarking: you benchmark A, but actually measure B, and conclude you’ve measured C.
Benchmarking well requires rigor to check what is actually measured and an understanding of what was tested to form valid conclusions.
For example, many tools claim or imply that they measure disk performance but actually test file system performance. The difference between these two can be orders of magnitude, as file systems employ caching and buffering to substitute disk I/O with memory I/O. Even though the benchmark tool may be functioning correctly and testing the file system, your conclusions about the disks will be wildly incorrect.
Understanding benchmarks is particularly difficult for the beginner, who has no instinct for whether numbers are suspicious or not. If you bought a thermometer that showed the temperature of the room you’re in as 1,000 degrees Fahrenheit, you’d immediately know that something was amiss. The same isn’t true of benchmarks, which produce numbers that are probably unfamiliar to you.
It may be tempting to believe that a popular benchmarking tool is trustworthy, especially if it is open source and has been around for a long time. The misconception that popularity equals validity is known as argumentum ad populum logic (Latin for “appeal to the people”).
Analyzing the benchmarks you’re using is time-consuming and requires expertise to perform properly. And, for a popular benchmark, it may seem wasteful to analyze what surely must be valid.
The problem isn’t even necessarily with the benchmark software—although bugs do happen—but with the interpretation of the benchmark’s results.
Numbers without Analysis
Bare benchmark results, provided with no analytical details, can be a sign that the author is inexperienced and has assumed that the benchmark results are trustworthy and final. Often, this is just the beginning of an investigation, and one that finds the results were wrong or confusing.
Every benchmark number should be accompanied by a description of the limit encountered and the analysis performed. I’ve summarized the risk this way:
- If you’ve spent less than a week studying a benchmark result, it’s probably wrong.
Much of this book focuses on analyzing performance, which should be carried out during benchmarking. In cases where you don’t have time for careful analysis, it is a good idea to list the assumptions that you haven’t had time to check and include them with the results, for example:
- Assuming the benchmark tool isn’t buggy
- Assuming the disk I/O test actually measures disk I/O
- Assuming the benchmark tool drove disk I/O to its limit, as intended
- Assuming this type of disk I/O is relevant for this application
This can become a to-do list, if the benchmark result is later deemed important enough to spend more effort on.
Complex Benchmark Tools
It is important that the benchmark tool not hinder benchmark analysis by its own complexity. Ideally, the program is open source so that it can be studied, and short enough that it can be read and understood quickly.
For micro-benchmarks, it is recommended to pick those written in the C programming language. For client simulation benchmarks, it is recommended to use the same programming language as the client, to minimize differences.
A common problem is one of benchmarking the benchmark—where the result reported is limited by the benchmark software itself. Complex benchmarks suites can make this difficult to identify, due to the sheer volume of code to comprehend and analyze.
Testing the Wrong Thing
While there are numerous benchmark tools available to test a variety of workloads, many of them may not be relevant for the target application.
For example, a common mistake is to test disk performance—based on the availability of disk benchmark tools—even though the target environment workload is expected to run entirely out of file system cache and not be related to disk I/O.
Similarly, an engineering team developing a product may standardize on a particular benchmark and spend all its performance efforts improving performance as measured by that benchmark. If it doesn’t actually resemble customer workloads, however, the engineering effort will optimize for the wrong behavior [Smaalders 06].
A benchmark may have tested an appropriate workload once upon a time but hasn’t been updated for years and so is now testing the wrong thing. The article “Eulogy for a Benchmark” describes how a version of the SPEC SFS industry benchmark, commonly cited during the 2000s, was based on a customer usage study from 1986 .
Just because a benchmark tool produces a result doesn’t mean the result reflects a successful test. Some—or even all—of the requests may have resulted in an error. While this issue is covered by the previous sins, this one in particular is so common that it’s worth singling out.
I was reminded of this during a recent benchmark of web server performance. Those running the test reported that the average latency of the web server was too high for their needs: over one second, on average. Some quick analysis determined what went wrong: the web server did nothing at all during the test, as all requests were blocked by a firewall. All requests. The latency shown was the time it took for the benchmark client to time-out and error.
Benchmark tools, especially micro-benchmarks, often apply a steady and consistent workload, based on the average of a series of measurements of real-world characteristics, such as at different times of day or during an interval. For example, a disk workload may be found to have average rates of 500 reads/s and 50 writes/s. A benchmark tool may then either simulate this rate, or simulate the ratio of 10:1 reads/writes, so that higher rates can be tested.
This approach ignores variance: the rate of operations may be variable. The types of operations may also vary, and some types may occur orthogonally. For example, writes may be applied in bursts every 10 s (asynchronous write-back data flushing), whereas synchronous reads are steady. Bursts of writes may cause real issues in production, such as by queueing the reads, but are not simulated if the benchmark applies steady average rates.
Consider what external perturbations may be affecting results. Will a timed system activity, such as a system backup, execute during the benchmark run? For the cloud, a perturbation may be caused by unseen tenants on the same system.
A common strategy for ironing out perturbations is to make the benchmark runs longer—minutes instead of seconds. As a rule, the duration of a benchmark should not be shorter than one second. Short tests might be unusually perturbed by device interrupts (pinning the thread while performing interrupt service routines), kernel CPU scheduling decisions (waiting before migrating queued threads to preserve CPU affinity), and CPU cache warmth effects. Try running the benchmark test several times and examining the standard deviation. This should be as small as possible, to ensure repeatability.
Also collect data so that perturbations, if present, can be studied. This might include collecting the distribution of operation latency—not just the total runtime for the benchmark—so that outliers can be seen and their details recorded.
Changing Multiple Factors
When comparing benchmark results from two tests, be careful to understand all the factors that are different between the two.
For example, if two hosts are benchmarked over the network, is the network between them identical? What if one host was more hops away, over a slower network, or over a more congested network? Any such extra factors could make the benchmark result bogus.
In the cloud, benchmarks are sometimes performed by creating instances, testing them, and then destroying them. This creates the potential for many unseen factors: instances may be created on faster or slower systems, or on systems with higher load and contention from other tenants. It is recommended to test multiple instances and take the average (or better, record the distribution) to avoid outliers caused by testing one unusually fast or slow system.
Benchmarking the Competition
Your marketing department would like benchmark results showing how your product beats the competition. This is usually a bad idea, for reasons I’m about to explain.
When customers pick a product, they don’t use it for 5 minutes; they use it for months. During that time, they analyze and tune the product for performance, perhaps shaking out the worst issues in the first few weeks.
You don’t have a few weeks to spend analyzing and tuning your competitor. In the time available, you can only gather untuned—and therefore unrealistic—results. The customers of your competitor—the target of this marketing activity—may well see that you’ve posted untuned results, so your company loses credibility with the very people it was trying to impress.
If you must benchmark the competition, you’ll want to spend serious time tuning their product. Analyze performance using the techniques described in earlier chapters. Also search for best practices, customer forums, and bug databases. You may even want to bring in outside expertise to tune the system. Then make the same effort for your own company before you finally perform head-to-head benchmarks.
When benchmarking your own products, make every effort to ensure that the top-performing system and configuration have been tested, and that the system has been driven to its true limit. Share the results with the engineering team before publication; they may spot configuration items that you have missed. And if you are on the engineering team, be on the lookout for benchmark efforts—either from your company or from contracted third parties—and help them out.
Consider this hypothetical situation: An engineering team has worked hard to develop a high-performing product. Key to its performance is a new technology that they have developed that has yet to be documented. For the product launch, a benchmark team has been asked to provide the numbers. They don’t understand the new technology (it isn’t documented), they misconfigure it, and then they publish numbers that undersell the product.
Sometimes the system may be configured correctly but simply hasn’t been pushed to its limit. Ask the question, What is the bottleneck for this benchmark? This may be a physical resource, such as CPUs, disks, or an interconnect, that has been driven to 100% and can be identified using analysis. See Section 12.3.2, Active Benchmarking.
Another friendly fire issue is when benchmarking older versions of the software that has performance issues that were fixed in later versions, or on limited equipment that happens to be available, producing a result that is not the best possible (as may be expected by a company benchmark).
Misleading benchmark results are common in the industry. Often they are a result of either unintentionally limited information about what the benchmark actually measures or deliberately omitted information. Often the benchmark result is technically correct but is then misrepresented to the customer.
Consider this hypothetical situation: A vendor achieves a fantastic result by building a custom product that is prohibitively expensive and would never be sold to an actual customer. The price is not disclosed with the benchmark result, which focuses on non–price/performance metrics. The marketing department liberally shares an ambiguous summary of the result (“We are 2x faster!”), associating it in customers’ minds with either the company in general or a product line. This is a case of omitting details in order to favorably misrepresent products. While it may not be cheating—the numbers are not fake—it is lying by omission.
Such vendor benchmarks may still be useful for you as upper bounds for performance. They are values that you should not expect to exceed (with an exception for cases of friendly fire).
Consider this different hypothetical situation: A marketing department has a budget to spend on a campaign and wants a good benchmark result to use. They engage several third parties to benchmark their product and pick the best result from the group. These third parties are not picked for their expertise; they are picked to deliver a fast and inexpensive result. In fact, non-expertise might be considered advantageous: the greater the results deviate from reality, the better. Ideally one of them deviates greatly in a positive direction!
When using vendor results, be careful to check the fine print for what system was tested, what disk types were used and how many, what network interfaces were used and in which configuration, and other factors. For specifics to be wary of, see Section 12.4, Benchmark Questions.
A type of sneaky activity—which in the eyes of some is considered a sin and thus prohibited—is the development of benchmark specials. This is when the vendor studies a popular or industry benchmark, and then engineers the product so that it scores well, while disregarding actual customer performance. This is also called optimizing for the benchmark.
The notion of benchmark specials became known in 1993 with the TPC-A benchmark, as described on the Transaction Processing Performance Council (TPC) history page :
- The Standish Group, a Massachusetts-based consulting firm, charged that Oracle had added a special option (discrete transactions) to its database software, with the sole purpose of inflating Oracle’s TPC-A results. The Standish Group claimed that Oracle had “violated the spirit of the TPC” because the discrete transaction option was something a typical customer wouldn’t use and was, therefore, a benchmark special. Oracle vehemently rejected the accusation, stating, with some justification, that they had followed the letter of the law in the benchmark specifications. Oracle argued that since benchmark specials, much less the spirit of the TPC, were not addressed in the TPC benchmark specifications, it was unfair to accuse them of violating anything.
TPC added an anti-benchmark special clause:
- All “benchmark special” implementations that improve benchmark results but not real-world performance or pricing, are prohibited.
As TPC is focused on price/performance, another strategy to inflate numbers can be to base them on special pricing—deep discounts that no customer would actually get. Like special software changes, the result doesn’t match reality when a real customer purchases the system. TPC has addressed this in its price requirements :
- TPC specifications require that the total price must be within 2% of the price a customer would pay for the configuration.
While these examples may help explain the notion of benchmark specials, TPC addressed them in its specifications many years ago, and you shouldn’t necessarily expect them today.
The last sin of benchmarking is cheating: sharing fake results. Fortunately, this is either rare or nonexistent; I’ve not seen a case of purely made-up numbers being shared, even in the most bloodthirsty of benchmarking battles.