Drill-Down Monitoring of Database Servers
Introduction
Editor's Note
This material is extracted from Chapter 21 of Configuring and Tuning Databases on the Solaris Platform, by Allan N. Packer, (c) 2002, Sun Microsystems Press (ISBN 0-13-083417-3). Chapter 22, Monitoring and Tuning Oracle, will be presented in the July and August 2002 editions of Sun BluePrintsTM OnLine.
The time has come. Armed only with your wits, a little common sense, and some basic system knowledge, you're going to crack the performance problem bedevilling your database server. You roll up your sleeves and seat yourself firmly in front of a keyboard. A cluster of slightly awed colleagues watches wide-eyed over your shoulder.
OK, perhaps I'm getting a bit carried away here. Suffice it to say that the aim of this chapter is to develop a simple method for identifying performance problems on database servers.
I'm assuming you have already looked at the issues affecting server performance (covered in earlier chapters of Configuring and Tuning Databases on the Solaris Platform). Your application's behavior is well understood and consultants in small doses have already done wonders with application performance. You've done what you can to fix any database schema design problems and the addition of a couple of crucial indexes has already calmed the users down a little.
You've checked out obscure things like environment variables and racked your brains for other issues that might need attention. But performance problems still persist. Perhaps you need to upgrade your hardware, but at this point you're not sure.
Where should you start? I'm going to suggest a five-step process that will walk you through the major components of the system: memory, disk I/O, network, and CPU, followed by database monitoring and tuning. If a problem becomes apparent in one of these areas, there may be further steps to narrow the problem. This kind of "drill down" approach is an effective way to identify and ultimately solve problems.
If you find a bottleneck (by which I mean a constriction of performance, just as the neck of a bottle limits the flow of liquid into or out of a bottle), does that mean you should look no further? I would suggest going through the whole process anyway to see what you can discover.
Bear in mind, though, that fixing a bottleneck in one place might expose another elsewhere. Suppose, for example, your system is paging severely due to a lack of memory, but no problems are apparent elsewhere. Adding memory might allow your throughput to improve to the point where one of the disks becomes overutilized, resulting in a disk bottleneck. Checking out the disks as well might give a hint of problems to come.
Once you've found and resolved a bottleneck, go through the entire process again.
Finally, is the order of the steps important? Of course there are many possible ways of tackling system monitoring, but I suggest you go through the steps in the order shown.
STEP 1. Monitoring Memory
To check for a memory bottleneck, use the vmstat utility, which shows, among other things, memory behavior for the system. A 5-second interval is a good choice for live monitoring.
The vmstat trace in Figure 1 shows a system with no evidence of memory shortfall.
Figure 1 vmstat trace with no memory shortfall
Figure 2 shows a vmstat trace from the same system during a severe memory shortfall.
Figure 2 vmstat trace with severe memory shortfall
What to Look For
Look for po (pageoutsthe kilobytes paged out per second) and sr (scan ratethe number of pages scanned by the clock algorithm). When both are consistently high at the same time (much more than 100 per second, say, on a system with up to 4 Gbytes of memory, more on a larger system), then it is possible the page daemon is being forced to steal free memory from running processes. Do you need to add more memory to the system? Maybe.
More memory may not help, though. That might sound crazy, but unfortunately the water is a little muddy here. Some explanation might help clarify the situation.
Pageouts can happen for a number of reasons, including the following:
Dirty (modified) file system pages are being flushed to disk. Such flushing is normal behavior and does not represent a problem. If database files are placed on file system files, expect to see this kind of pageout.
Application pages are being pushed out to the swap device to free up memory for other purposes. If the applications in question are active or about to become active, paging is bad!
New memory has been allocated by an application and swap space is being assigned to it. This, too, is normal behavior and does not represent a problem.
Memory pages have been freed by applications and are being flushed to disk. Isn't paging a waste of time if the memory is no longer required by the applications? You bet! Solaris 8 introduced a new madvise() flag called MADV_FREE to enable developers to tell the operating system not to bother to flush such pages to swap.
The scan rate is a measure of the activity of the page daemon. The page daemon wakes up and looks for memory pages to free when an application is unable to find enough memory on the free list (memory has fallen to the lotsfree system parameter). The greater the memory shortfall, the faster the page daemon will scan pages in main memory.
The major consumers of memory in a system are:
Applications, including text (binary code), stack (which contains information related to the current phase of execution of the program and its functions), heap (which contains program working space), and shared memory.
The file system page cache, which contains file system data (all file system disk blocks must first be read into memory before they can be used). This cache becomes important when database files are stored on file systems.
The operating system kernel.
Normal Paging Behavior Prior to Solaris 8
Before Solaris 8, the free column reported by vmstat may not be a good indication of the available memory in the system. The reason is that once memory pages are used by the file system page cache, they are not returned to the free list. Instead, the file system data blocks are left in the cache in case they are needed again in the future.
When the page daemon detects a memory shortfall and scans for pages to free, it may well choose to free some of the pages in the file system page cache. If the pages have been modified, they are first flushed to disk. There is no simple way of finding out how much of main memory is being used by the file system page cache at any point, but you can bet it will be substantial if database files are located on UFS files rather than raw devices. The memtool package (Richard McDougall's memory monitoring tool), available on the book website, can identify memory use by UFS files.
The problem is that the page daemon may free application memory pages as well as file system page cache pages since it doesn't know which is which. The result can be severe paging and major performance problems. Adding more memory won't help much, either. It will simply mean that more database pages can be cached in the memory. Fortunately, there is a solution.
Priority Paging
As of Solaris 7, a new feature called priority paging has been added. Priority paging lowers the priority of file system pages in memory so that the page daemon will choose to free them ahead of application pages. This behavior can make a huge difference to paging problems; priority paging should be enabled wherever databases coexist with active file systems, and especially where database files are placed on file systems.
You can activate priority paging by adding the following line to /etc/system and rebooting:
set priority_paging = 1
Patches are available for Solaris 2.5.1 and Solaris 2.6 to add priority paging functionality. From Solaris 8, changes to the virtual memory system mean that priority paging is no longer required and should not be used.
UFS Files and Paging
If your database files are UFS files rather than raw devices, you may observe significant scanning even once you have enabled priority paging. In fact, the scan rate may increase since priority paging causes the page daemon to become active sooner. This behavior is a natural consequence of the need to bring all database pages into the UFS page cache before they can be accessed by the database. The ongoing need to find free memory gives rise to constant scanning activity on busy database servers using UFS files.
If your application carries out updates, inserts, and deletes, you should also expect to see pageout activity. All database writes must go through the UFS page cache before being written to disk. Although the page being written to disk would previously have been read into the UFS page cache, the memory might have since been reused if the scan rate is high. In that case the page must be reread from disk before the write to disk can proceed. This process in turn displaces another page, and the cycle continues.
How do you stop all this paging activity? To eliminate scanning (assuming you have enough memory for your applications), either use raw devices or mount your database partitions with the Direct I/O flag (forcedirectio).
A word of caution: eliminating paging activity with Direct I/O may not always result in instant performance improvements. The file system page cache acts as a second-level cache for database pages, and removing it from the picture will expose any inadequacies in the sizing of your database buffer cache. Make sure that your database buffer cache is adequately sized; otherwise, you may find yourself with plenty of free memory and a database buffer cache starved of buffers.
Enabling Direct I/O for database files, and especially for database logs, can offer significant performance benefits as of the Solaris 8 1/01 release; earlier versions of Direct I/O may not offer significant performance gains. Direct I/O is described in more detail in "Unix File System Enhancements" on page 21 of Configuring and Tuning Databases on the Solaris Platform.
As a final caution, although Direct I/O can prove very useful for database files, do not enable it for nondatabase files without first examining carefully the performance implications of doing so.
Normal Paging Behavior as of Solaris 8
As we have seen, priority paging doesn't go all the way to solving the problem. Although the page daemon will choose file system pages in preference to application pages, the page daemon still has to search through the whole of memory to find them. And if large database buffer caches are being used, file system pages may only represent a small proportion of the total memory, so a lot of searching will be necessary.
As of Solaris 8, file system pages are separately accounted for, so they can be freed without a memory scan to find them. Consequently, the page daemon is not needed at all unless there is a major memory problem. As a result, the likelihood of paging problems is greatly diminished.
Drilling Down Further
If you want to find out where the memory is going, there are a number of options:
For the final answer on memory consumption, use memtool. The procmem script described below will provide a detailed breakdown of process memory usage. Both memtool and procmem are available on the book website.
Run dmesg and look for Avail mem. The difference between available memory and physical memory (use prtconf or /usr/platform/'arch -k'/sbin/prtdiag to find out the physical memory) indicates the amount of memory reserved for the kernel.
Use /usr/ucb/ps -aux to find out which processes are the major memory hogs. This command lists the percentage of memory used by each process. Beware, though! The memory listed is virtual memory, not physical memory, and it may not be a good indication of how much physical memory is actually being consumed by the process at any given moment.
If you have installed the unbundled memtool package, the procmem script will use the pmem program (similar to the standard Solaris pmap program, but with various bugs fixed on some Solaris releases). Some users are unwilling to install an unbundled package on a production systemprocmem uses pmap instead if memtool (and therefore pmem) is not available. The procmem script summarizes memory use for all processes and gives a breakdown into resident, shared, and private memory usage. Please note that both procmem and memtool are unsupported software.
Passing the -h parameter to procmem results in the following usage information:
usage: procmem [-v] [-h | -p pidlist | [ -u username ] [ searchstring ]] Examples: procmem -p 10784 10759 - show memory usage for processes with pids 10784 10759 procmem -u root - show memory usage for all processes owned by the root user procmem -u "daemon root" - show memory usage for all processes owned by the root & daemon users procmem netscape - show memory usage for process(es) in 'ps -ef' with "netscape" procmem -u fred netscape - show memory usage for "netscape" processes owned by fred procmem - show memory usage for all processes (provided current user has superuser access privileges) Definition of terms 'Kbytes' is the total memory size of the process or file. 'Resident' is that portion currently occupying physical memory. 'Shared' is resident memory capable of being shared. 'Private' is resident memory unique to this process or file. Resident = Shared + Private Sizing For reporting purposes, the 'Shared' component has been counted once only while the 'Private' component has been summed for each process or file. The /usr/lib shared libraries have been reported separately since they tend to be widely used across applications. To be totally accurate, though, the shared component of these shared libraries should only be counted once across all applications, not once for every group of applications. The same logic may apply to other shared libraries also used by multiple applications.
The -v flag offers additional detail. An example of procmem output follows for all processes on a server running an Oracle database.
Processes Kbytes Resident Shared Private --------- ------ -------- ------ ------- Process Summary (Count) -csh (3) 5376 2456 1064 1392 -ksh (8) 14624 2912 1456 1456 automountd (1) 4088 3656 1792 1864 cimomboot (1) 1576 1384 1256 128 cron (1) 1936 1744 1464 280 devfsadmd (1) 2776 2536 1592 944 devfseventd (1) 1272 1232 952 280 dmispd (1) 3160 2648 1744 904 dtlogin (1) 4920 2856 2192 664 dwhttpd (2) 20080 7440 4496 2944 esd (2) 24712 21768 3880 17888 grep (1) 968 936 832 104 in.ndpd (1) 1856 1488 1304 184 in.rdisc (1) 1616 1416 1272 144 in.rlogind (6) 10368 2464 1360 1104 inetd (1) 2648 2384 1384 1000 init (1) 1888 1608 1136 472 iostat (1) 1824 1784 904 880 ksh (5) 9040 2320 1456 864 lockd (1) 1896 1656 1264 392 lpsched (1) 3040 1768 1552 216 mibiisa (1) 2952 2752 1504 1248 mountd (1) 2952 2536 1528 1008 nfsd (1) 1888 1672 1264 408 nscd (1) 3200 2928 1528 1400 ora_ckpt_bench (1) 1405264 1372488 1371824 664 ora_dbw0_bench (1) 1406976 1374208 1371824 2384 ora_dbw1_bench (1) 1406968 1374200 1371824 2376 ora_dbw2_bench (1) 1406968 1374200 1371824 2376 ora_dbw3_bench (1) 1406968 1374200 1371824 2376 ora_lgwr_bench (1) 1405248 1372472 1371824 648 ora_pmon_bench (1) 1405696 1372920 1371824 1096 ora_reco_bench (1) 1405160 1372384 1371824 560 ora_smon_bench (1) 1405192 1372424 1371824 600 oraclebench (50) 70258944 1400768 1371824 28944 powerd (1) 1632 1576 976 600 rpcbind (1) 2584 2088 1296 792 sac (1) 1736 1488 1296 192 sendmail (1) 2936 2280 1816 464 sh (4) 4176 1320 912 408 snmpXdmid (1) 3744 3184 2024 1160 snmpdx (1) 2144 1920 1520 400 statd (1) 2592 2208 1464 744 syslogd (1) 4192 3008 1456 1552 tail (1) 968 936 792 144 tee (2) 1808 952 792 160 tpccload (50) 418800 38944 3344 35600 ttymon (2) 3464 1744 1336 408 utmpd (1) 1000 944 816 128 vmstat (1) 1312 1280 808 472 vold (1) 2696 2408 1768 640 vxconfigd (1) 14656 13944 1328 12616 ----------------------------------------------------------------------------- File (Count) Kbytes Resident Shared Private ----------- ------ -------- ------ ------- /usr/lib Shared Library Totals 291936 27928 2856 25072 Other Shared Library Totals 1473544 29304 7248 22056 Mapped File Totals 560 488 488 0 Binary File Totals 1378632 18864 13328 5536 Shared Memory Totals 80280128 1360688 1360688 0 Anonymous Memory Totals 89680 84008 0 84008 ----------------------------------------------------------------------------- Grand Totals 83514480 1521280 1384608 136672
The bulk of the 1.5 Gbytes of resident memory used on this server is accounted for by 1.3 Gbytes of shared memory, which also constitutes the major component of the memory used for the Oracle processes.
The script can be used to report the physical and virtual memory consumption for a group of processes. For example, procmem ora will report memory consumption for all processes that have the string ora in a ps -ef report (Oracle processes typically meet this criterion). If another Oracle user running the same applications is added to the system, you would not expect an increase in the Shared component of memory for /usr/lib shared libraries, other shared libraries, binary files, or shared memory segments. The Private component would be expected to grow, though, for the shared libraries, the mapped files, and anonymous memory. The additional private memory required would probably be roughly equivalent to the private memory total for these applications divided by the current number of users.
Detail is available for all processes, as well as summaries for /usr/lib shared libraries (which tend to be used by many processes throughout a system and so should be counted only once for sizing purposes), other shared libraries (for example, Oracle shared libraries), mapped files (memory-mapped file system files), binary files (executable programs), shared memory segments, and anonymous memory (heap and stack).
The procmem script will accurately show all memory directly used by processes, but not memory belonging to UFS files that are resident in the file system page cache. Since pages from UFS database files are not directly mapped into the address spaces of database processes, they will not appear in the totals. The memps -m command from memtool provides this information (it requires the memtool kernel moduleinstalled when the memtool package is first set upto be loaded).
What You Can Do to Reduce Paging
If you are using file systems for your database files, the first step is to upgrade to Solaris 8 or else enable priority paging for earlier releases of Solaris. If necessary, you could also consider include the following steps to relieve memory pressures on your database server:
Add more memory to the system.
Use Direct I/O for database files.
Reduce the size of the database buffer cache. This reduction may result in additional database I/O, but that is almost always preferable to paging.
Remove applications from the server. If applications are running on the server, move them to a client system and run the applications in client/server mode. Memory should be freed up as a result.
Reduce the number of users on the system.