- Performance and Disease
- Business Requirements
- Medical Analogues
- Lab Tests and Record Keeping
- Traps and Pitfalls
- Where Does the Time Go?
- Diagnostic Strategies
- Selected Tools and Techniques
- Third-Party URLs
- About the Author
- Ordering Sun Documents
- Accessing Sun Documentation Online
Where Does the Time Go?
Performance issues are all about time. Successful performance sleuths start from the perspective that "You can always tell where the time is going." Armed with a broad collection of tools, they set out to determine where the time is going. Then, they drill down on why the time is going wherever it is going. Knowing where the time can go is essential to accurately diagnosing where the time is actually going. Here, we break it down, starting with CPU usage.
At a high level of analysis, CPU cycles can be separated into various broad categories. These categories are not mutually exclusive.
The application code is the main logic of a program, which could be expressed in languages as diverse as C, C++, Java™, SQL, Perl, or ABAP.
Interpreter code is the implementation of the engine interpreting the application code, such as a C compiler, a Java runtime environment, or a database engine.
The efficiency of system-call implementations are clearly in the domain of the operating system vendor, but unnecessary calls to the system might originate from a variety of places.
Whether library calls involve system-supplied libraries or application libraries, the fact that they are contained in libraries offers a chance for adding instrumentation.
At a lower level of analysis, CPU cycles are often attributed as:
User cycles represent the amount of CPU cycles used in the user mode. This includes the time spent for some system calls when their logic does not require switching to kernel mode.
System cycles represent the amount of CPU cycles used in the kernel mode. This includes time spent in the OS kernel due to system calls from processes, as well as due to kernel internal operations such as servicing hardware events.
Prior to the release of the Solaris 9 OS, time spent in interrupt handlers was not attributed as either system or user time. Whether or not interrupt processing is observable, it will certainly account for some percentage of actual CPU usage. The servicing of some interface cards, such as gigabit ethernet cards, can consume a large proportion of a CPU.
It is noteworthy that the precision of CPU usage attribution in the Solaris OS is not 100 percent accurate. The quality of time accounting in the Solaris OS is under continuous improvement and will vary between software releases.
Some low-level phenomena in a system can help explain CPU consumption in the previously described categories. Analysis of these events is largely in the domain of gurus, but tools are evolving that can digest this information into metrics that can be used by a broader range of analysts.
The Solaris 9 OS introduces the trapstat(1M) tool for reporting on time spent in low-level service routines including traps9 and interrupts. Traps occur for low-level events such as delays in remapping memory accesses or handling certain numerical exceptions.
Tools like busstat(1M) and cpustat(1M) can be used to report on low-level counters embedded in the CPU or system architecture. Useful analysis with these tools requires very specific knowledge of low-level architectural factors.
Elapsed time, which is not attributable simply to CPU usage, might be attributed to the following categories:
All I/O is very slow compared to CPU speeds.
Scheduling delays involve the allocation of and contention for CPU resources. This factor is not nearly as well understood as I/O and memory factors, and it can be very difficult to model, forecast, and comprehend.
Whether at the CPU chip level or due to page faults at the virtual memory abstraction layer, delays in memory access ensue when a memory reference cannot be immediately satisfied from the nearest cache.
Protocols such as TCP/IP may stall on protocol events such as message acknowledgements or retransmissions.
Applications use a variety of mechanisms to coordinate their activities. These mechanisms range from inefficient file system-based schemes (for example, based on lockf(3C)) to more generally efficient techniques, such as machine-optimized mutual-exclusion locking mechanisms (that is, mutexes10). Among the methods used to attain low latency in lock acquisition is the spin lock or busy wait, which is basically a form of polling.
These delays are not necessarily mutually exclusive, and they overlap with the previous categories of time attribution. For example, memory delays might count against %usr or %sys and might relate to the traps and esoterica mentioned above. I/O latencies can include significant CPU and protocol latencies.
Having a complete overview of where the time can go can be most helpful in formulating diagnostic strategies.