Capacity Planning as a Performance Tuning Tool—Case Study for a Very Large Database Environment
- Analysis and High-Level Observations
- Resolving CPU and I/O Bottlenecks Through Modeling and Capacity Planning
- I/O Infrastructure Performance Improvement Methodology
- Data Tables
This article discusses the performance and scaleability impact due to severe CPU and I/O bottlenecks in a very large database (over 20 terabytes). It describes the methodologies used to collect performance data in a production environment, and explains how to evaluate and analyze the memory, CPU, network, I/O, and Oracle database in a production server by using the following tools:
- Solaris Operating Environment (Solaris OE) Standard UNIX tools
- Oracle STATSPACK performance evaluation software from ORACLE Corporation
- Trace Normal Form (TNF)
- TeamQuest Model software from Team Quest Corporation
- VERITAS Tool VxBench from VERITAS Corporation
The article is intended for use by intermediate to advanced performance tuning experts, database administrators, and TeamQuest specialists. It assumes that the reader has a basic understanding of performance analysis tools and capacity planning. The expertise level of this article is intermediate to advanced.
This article discusses the performance and scalability impact due to severe CPU and I/O bottlenecks in a very large database (over 20 terabytes). It describes the methodologies used to collect performance data in a production environment, and explains how to evaluate and analyze the memory, CPU, network, I/O, and Oracle database in a production server by using the following tools:
Solaris™ Operating Environment (Solaris OE)Standard UNIX™ tools
Oracle STATSPACK performance evaluation software from ORACLE™ Corporation.
Trace Normal Form (TNF)
TeamQuest Model software from Team Quest Corporation
VERITAS Tool VxBench from VERITAS Corporation
The article is intended for use by intermediate to advanced performance tuning experts, database administrators, and TeamQuest specialists. It assumes that the reader has a basic understanding of performance analysis tools and capacity planning.
The article addresses the following topics:
Analysis and High-Level Observations
Resolving CPU and I/O Bottlenecks Through Modeling and Capacity Planning
I/O Infrastructure Performance Improvement Methodology
The article discusses the chronological events of "what-if" analysis using the TeamQuest modeling software to resolve CPU and I/O bottlenecks, for projections of the database server performance and scalability, and to simulate effects of performance tuning. It also provides a detailed analysis of the findings, and discusses the theoretical analyses and solutions.
Finally, it provides an in-depth discussion and analysis of the solution, that is, how to resolve the I/O and CPU bottlenecks by balancing the I/O on the existing controllers and adding new controllers.
The first part of the article presents the result of performance analysis with respect to the CPU, I/O, and Oracle database using the tools previously stated. The second part, describes the CPU, I/O tuning, and capacity planning methodology, and its results. Finally, the article provides conclusions, recommendations, and the methodology for improving I/O infrastructure performance.
The performance analysis, tuning, and capacity planning methods described in this article can be applied to any servers in a production environment. Performance analysis and capacity planning is a continuous effort. When the application or the environment change, as result of a performance optimization for instance, the performance analysis has to be revisited and the capacity planning model re-calibrated. For a system that is operating on the upper limit of its capacity, performance optimization is a continuous search for the next resource constraint.
The performance analysis methodology starts with an analysis of the top five system resources being utilized during the peak-period and the percentage of utilization associated to each one. In this case study, the I/O controllers and CPUs topped out at roughly 80 percent utilization and the disk drives reached their peak at 70-to-80 percent utilization. Once these thresholds were reached, response times degraded rapidly (depending the workloads, more than one workload may be depicted).
Teamquest performance tools were used to provide the performance analysis and capacity planning results. The Teamquest Framework component was installed on the systems to be monitored. This component implements the system workload definitions and collects detailed performance data. The Teamquest View component allows real time and historical analysis of the performance data being collected on any number of systems on the network.
Analysis and High-Level Observations
In this case study, the primary focus was on I/O and CPU utilization as it was observed that server was neither memory nor network bound. CPU utilization was high during the peak period, sometimes above the 80 percent threshold considered acceptable to avoid impacting overall system performance. The critical component of CPU utilization was I/O Wait, which accounts for about 50 percent of CPU utilization during the peak period. This corresponds to the time the system uses (wastes) managing the I/O queue and waiting for I/O. CPU wait in user mode reached 6-to-8 percent on some peak periods.
The following paragraphs analyze and show the performance data obtained with VERITAS, TeamQuest, and the Solaris OE standard tools and the TNF and Oracle STATSPACK data for the top five system resources and preliminary data for Oracle.
VERITAS Analysis and Observation
TeamQuest reported 1,300 I/O operations per second (IOPS) on the system during the peak period. Tests with the VERITAS VxBench indicate a total capacity of the I/O subsystem to sustain more than 11,000 IOPS. These are logical IOPS, what the operating system sees as the controllers get hit. Although there are differences in these two measurements, in both scenarios the data was collected at the operating system level.
TeamQuest Analysis and Observations
The TeamQuest analysis and observations concern the CPU and disk utilization.
The following charts show a CPU utilization of about 50-to-60 percent during the working hours (10:00 a.m. to 4:00 p.m). A small peak occurs during the night, possibly related to some batch processing.
The graphs and the tables in "Data Tables" on page 35 indicate a significant proportion of wait I/O time. Typically, this indicates poor disk responsiveness in single processor systems. The disks were unable to service all of the transaction requests and an I/O requests queue formed. While this operation fully utilizes the disks, it forces the operating system to spend a lot of time managing the queue and waiting for I/O to complete before additional processing can take place.
Disk drives reach their peak at 70-to-80 percent utilization. The values shown in the graph indicate very high device utilization.
Solaris OE Tools Analysis and Observations
The "%busy" column in TABLE 4 in the "Data Tables" on page 35" shows the disk utilization. Disks that are utilized over 65 percent should be considered a concern and any disk utilized over 90 percent represents a serious performance bottleneck.
However, disk utilization alone (%busy) is not an indicator of bad disk performance. You should always correlate (%busy) with (service time in milliseconds (ms). The rule of thumb is that %busy is high and disk service time is high too (over 50 ms) for a long period of time.
Oracle Analysis and Observation
The following table lists the data collected for the top five wait events by using the Oracle STATSPACK tool.
Wait Time (cs)
%Total Wait Time
db file sequential read
db file scattered read
db file parallel read
control file parallel write
It is important to understand the significance of the wait event and the average wait time (ms) per wait. The focus should be on the wait event db_file_sequential_read which constitutes the majority of the waits (greater that 99 percent)in this case, it is the top wait event. The average wait time per wait is 1940 ms.
Oracle uses cs for measuring time. One cs is 1/10 of a second so 2,913,058 cs for total of 1,501,970 wait events gives an average of 1940 ms.
Also, the accumulated results do not show that some wait events are really bad during peak time because the calculated average is good.
TNF Analysis and Observation
TNF analysis further confirmed the preceding results. In addition, very high queue depth and response times were observed. For TNF output data, refer to TABLE 4 in "Data Tables" on page 35".