3.3 Interpretation and Analysis
After making the measurements, it is necessary to review the voluminous amount of data that were collected. Some of the performance tools present the data as tables of numbers. Other tools offer summarization of the data in graphical form. Regardless of the presentation method, interpretation of the data is complex. One must first understand what the metrics mean. Then, it is necessary to know how to interpret the numbers; in other words, what number indicates a performance problem. This is no easy task! Rule #1 comes into play here. A good value or a bad value for a given performance metric depends upon many factors.
From Bob's Consulting Log I was involved in a sizing exercise where we had to determine what system was needed to support the application running 125 users with a response time of < 5 seconds. We started out with a basic system and looked at the load average metric, which was 250. We also saw that the response time was well above 5 seconds (some took 5 minutes!), and we were only simulating 75 users.
After upgrading the CPU to a faster one, we re-ran the test, and the load average dropped to 125. We were now able to simulate all 125 users, but response time was still unsatisfactory. Finally, we upgraded to the most powerful processor currently available. Now, the load average was 75. Most people would cringe at that number, saying response time should be terrible. However, all transactions completed in under 5 seconds for all 125 users. The moral is: Don't be scared by large numbers. What really matters is that you meet the performance requirements of the application.
The value for a given metric that can be considered good or bad will be discussed in-depth in the tuning section for each major system resource.
Some of the general factors that affect setting rules of thumb for performance metrics are:
Type of system: multi-user or workstation
Type of application: interactive or batch, compute-or I/O-intensive
Application architecture: single system or client/server, multi-tiered, parallel system
Speed of the CPU
Type of disk drives
Type of networking
3.3.1 Multi-User versus Workstation
A multi-user system experiences many more context switches than a workstation normally does. Context switches consume some of the CPU resource. Additionally, the high number of users typically causes a lot of I/O, which puts demands on the CPU and disk resources. Workstations that are used for graphics-intensive applications typically have high user CPU utilization numbers (which are normal) but lower context switch rates, since there is only one user. Applications on a multi-user system usually cause random patterns of disk I/O. Workstation applications often cause sequential patterns of disk I/O. So, these factors affect the optimal values for the CPU and disk metrics.
3.3.2 Interactive versus Batch, Compute-Intensive versus I/O-Intensive
Workstations can support highly interactive applications, for example, X/Windows or Motif applications that require a lot of interaction. These can be business applications, such as customer service applications that provide several windows into different databases. Alternatively, technical applications such as CAD programs support interactive input to draw the part on the screen. Compute-bound applications on a workstation sometimes act like batch applications on a multi-user systems. Batch applications consume a large amount of CPU, as do compute-intensive applications. Interactive applications cause many more context switches and more system CPU utilization than do batch or compute-intensive applications. Batch applications use less memory than highly interactive applications. Compute-intensive applications can touch more pages of memory than individual I/O-intensive applications. The optimal values for the CPU and memory metrics are affected by these factors.
3.3.3 Application Architecture
An application may be architected in several ways. It can be monolithic, that is, one large program that runs entirely on a single system. Parallel applications are designed to make better use of the components of the single computer system (especially a Symmetric Multi-Processing (SMP) system) to improve throughput or response. In contrast, an application can be distributed as in a multi-tiered client/server environment. In this case, parallel processing does not necessarily provide the same benefits. For these reasons, understanding the architecture is necessary before deciding what measurements to make and how to interpret the data.
3.3.4 Performance of the CPU
The higher performing the CPU, the greater the CPU resource that is available. Applications can get more work done during the time-slice they are allocated, and may be able to satisfy their current need for the CPU resource completely, rather than having to wait for another turn. This factor affects the run queue metric.
3.3.5 Type of Disks
Newer technology disk drives are faster, have higher capacity, and can support more I/Os per second than can older disks. The number of disk channels also affects the number of disk I/Os that are optimal. Caches on disks can dramatically improve the overall response times of disk I/Os.
3.3.6 Type of Networks
Networks continue to get faster and faster as well. High speed networks can move a lot of data between servers with high bandwidth and low latency. The network bandwidth has actually increased much more significantly over the years than the bandwidth of disks. Improvements in link throughput and connectivity technologies such as Dense Wave Division Multiplexing (DWDM) fiber links allow for high speed and low latency between systems separated by miles.