Home > Articles > Software Development & Management

Capacity Planning as a Performance Tuning Tool—Case Study for a Very Large Database Environment

  • Print
  • + Share This

This article discusses the performance and scaleability impact due to severe CPU and I/O bottlenecks in a very large database (over 20 terabytes). It describes the methodologies used to collect performance data in a production environment, and explains how to evaluate and analyze the memory, CPU, network, I/O, and Oracle database in a production server by using the following tools:

  • Solaris Operating Environment (Solaris OE) Standard UNIX tools
  • Oracle STATSPACK performance evaluation software from ORACLE Corporation
  • Trace Normal Form (TNF)
  • TeamQuest Model software from Team Quest Corporation
  • VERITAS Tool VxBench from VERITAS Corporation

The article is intended for use by intermediate to advanced performance tuning experts, database administrators, and TeamQuest specialists. It assumes that the reader has a basic understanding of performance analysis tools and capacity planning. The expertise level of this article is intermediate to advanced.

Like this article? We recommend

This article discusses the performance and scalability impact due to severe CPU and I/O bottlenecks in a very large database (over 20 terabytes). It describes the methodologies used to collect performance data in a production environment, and explains how to evaluate and analyze the memory, CPU, network, I/O, and Oracle database in a production server by using the following tools:

  • Solaris™ Operating Environment (Solaris OE)—Standard UNIX™ tools

  • Oracle STATSPACK performance evaluation software from ORACLE Corporation.

  • Trace Normal Form (TNF)

  • TeamQuest Model software from Team Quest Corporation

  • VERITAS Tool VxBench from VERITAS Corporation

The article is intended for use by intermediate to advanced performance tuning experts, database administrators, and TeamQuest specialists. It assumes that the reader has a basic understanding of performance analysis tools and capacity planning.

The article addresses the following topics:

  • Analysis and High-Level Observations

  • Resolving CPU and I/O Bottlenecks Through Modeling and Capacity Planning

  • Conclusions

  • Recommendations

  • I/O Infrastructure Performance Improvement Methodology

  • Data Tables

The article discusses the chronological events of "what-if" analysis using the TeamQuest modeling software to resolve CPU and I/O bottlenecks, for projections of the database server performance and scalability, and to simulate effects of performance tuning. It also provides a detailed analysis of the findings, and discusses the theoretical analyses and solutions.

Finally, it provides an in-depth discussion and analysis of the solution, that is, how to resolve the I/O and CPU bottlenecks by balancing the I/O on the existing controllers and adding new controllers.

The first part of the article presents the result of performance analysis with respect to the CPU, I/O, and Oracle database using the tools previously stated. The second part, describes the CPU, I/O tuning, and capacity planning methodology, and its results. Finally, the article provides conclusions, recommendations, and the methodology for improving I/O infrastructure performance.

The performance analysis, tuning, and capacity planning methods described in this article can be applied to any servers in a production environment. Performance analysis and capacity planning is a continuous effort. When the application or the environment change, as result of a performance optimization for instance, the performance analysis has to be revisited and the capacity planning model re-calibrated. For a system that is operating on the upper limit of its capacity, performance optimization is a continuous search for the next resource constraint.

The performance analysis methodology starts with an analysis of the top five system resources being utilized during the peak-period and the percentage of utilization associated to each one. In this case study, the I/O controllers and CPUs topped out at roughly 80 percent utilization and the disk drives reached their peak at 70-to-80 percent utilization. Once these thresholds were reached, response times degraded rapidly (depending the workloads, more than one workload may be depicted).

Teamquest performance tools were used to provide the performance analysis and capacity planning results. The Teamquest Framework component was installed on the systems to be monitored. This component implements the system workload definitions and collects detailed performance data. The Teamquest View component allows real time and historical analysis of the performance data being collected on any number of systems on the network.

Analysis and High-Level Observations

In this case study, the primary focus was on I/O and CPU utilization as it was observed that server was neither memory nor network bound. CPU utilization was high during the peak period, sometimes above the 80 percent threshold considered acceptable to avoid impacting overall system performance. The critical component of CPU utilization was I/O Wait, which accounts for about 50 percent of CPU utilization during the peak period. This corresponds to the time the system uses (wastes) managing the I/O queue and waiting for I/O. CPU wait in user mode reached 6-to-8 percent on some peak periods.

The following paragraphs analyze and show the performance data obtained with VERITAS, TeamQuest, and the Solaris OE standard tools and the TNF and Oracle STATSPACK data for the top five system resources and preliminary data for Oracle.

VERITAS Analysis and Observation

TeamQuest reported 1,300 I/O operations per second (IOPS) on the system during the peak period. Tests with the VERITAS VxBench indicate a total capacity of the I/O subsystem to sustain more than 11,000 IOPS. These are logical IOPS, what the operating system sees as the controllers get hit. Although there are differences in these two measurements, in both scenarios the data was collected at the operating system level.

TeamQuest Analysis and Observations

The TeamQuest analysis and observations concern the CPU and disk utilization.

CPU

The following charts show a CPU utilization of about 50-to-60 percent during the working hours (10:00 a.m. to 4:00 p.m). A small peak occurs during the night, possibly related to some batch processing.

Figure 01Figure 1


The graphs and the tables in "Data Tables" on page 35 indicate a significant proportion of wait I/O time. Typically, this indicates poor disk responsiveness in single processor systems. The disks were unable to service all of the transaction requests and an I/O requests queue formed. While this operation fully utilizes the disks, it forces the operating system to spend a lot of time managing the queue and waiting for I/O to complete before additional processing can take place.

Disks

Disk drives reach their peak at 70-to-80 percent utilization. The values shown in the graph indicate very high device utilization.

Solaris OE Tools Analysis and Observations

The "%busy" column in TABLE 4 in the "Data Tables" on page 35" shows the disk utilization. Disks that are utilized over 65 percent should be considered a concern and any disk utilized over 90 percent represents a serious performance bottleneck.

However, disk utilization alone (%busy) is not an indicator of bad disk performance. You should always correlate (%busy) with (service time in milliseconds (ms). The rule of thumb is that %busy is high and disk service time is high too (over 50 ms) for a long period of time.

Oracle Analysis and Observation

The following table lists the data collected for the top five wait events by using the Oracle STATSPACK tool.

Event

Waits

Wait Time (cs)

%Total Wait Time

db file sequential read

1,501,970

2,913,058

99.71

db file scattered read

2,346

5,449

.19

db file parallel read

400

1,286

.04

control file parallel write

198

471

.02

latch free

364

380

.01


It is important to understand the significance of the wait event and the average wait time (ms) per wait. The focus should be on the wait event db_file_sequential_read which constitutes the majority of the waits (greater that 99 percent)—in this case, it is the top wait event. The average wait time per wait is 1940 ms.

Oracle uses cs for measuring time. One cs is 1/10 of a second so 2,913,058 cs for total of 1,501,970 wait events gives an average of 1940 ms.

Also, the accumulated results do not show that some wait events are really bad during peak time because the calculated average is good.

TNF Analysis and Observation

TNF analysis further confirmed the preceding results. In addition, very high queue depth and response times were observed. For TNF output data, refer to TABLE 4 in "Data Tables" on page 35".

  • + Share This
  • 🔖 Save To Your Account