Home > Articles > Web Services

  • Print
  • + Share This
This chapter is from the book

Monitoring the Buffer Cache

Given that the buffer cache has a significant impact on the performance of OLTP applications, it should be no surprise that monitoring the effectiveness of the buffer cache is an important priority. The buffer cache hit rate is one of the main database metrics that you should monitor for OLTP workloads. The cache hit rate measures how many database blocks were found in the buffer cache rather than read from disk. Blocks read from disk are referred to as physical reads. Blocks retrieved either from disk or the buffer cache are referred to as logical reads (so physical disk reads are included in both metrics). The buffer cache hit rate is calculated in the following way:

BufferCacheHitRate = (1 - (PhysicalReads) / (LogicalReads)) X 100

Not all databases supply the buffer cache hit rate in a convenient form; you might need to calculate it yourself. The chapters focusing on database tuning for Oracle, Sybase, Informix, and DB2 (Chapters 22 to 25) explain how to calculate this metric for each database.

Note that the buffer cache stores more than just table data. Index blocks are held in the buffer cache and tend to enjoy high cache hit rates since they are relatively compact and can store many index nodes per block. Since index blocks are frequently used, they are often given a more favorable weighting than data blocks so they will stay in the cache longer.

The data dictionary, which holds information about tables, columns, indexes, database files, and similar data, is held in database tables and therefore also finds its way into the buffer cache. Once the database has been running for a while, accesses to data dictionary blocks can usually be satisfied from the cache, leading to a high cache hit rate for these blocks (usually higher than for data table blocks).

An Acceptable Cache Hit Rate

What represents an acceptable buffer cache hit rate? Unfortunately, this question is not easy to answer.

I recently asked a database administrator about the cache hit rates on his Oracle instances. He told me they were within Oracle guidelines and added that they were over 90%.

If his answer was fuzzy, it certainly wasn't without foundation. I recently reviewed three Oracle8 tuning textbooks and discovered three different values for the minimum recommended buffer cache hit rate for OLTP applications: 70%, 95%, and 98%.

Unfortunately, achieving an acceptable buffer cache hit rate isn't as simple as setting an arbitrary value to aim for. As we will see, though, there are objective measures that you can use to evaluate your current hit rate. But first we need to consider the meaning and implications of the cache hit rate statistic.

The Cache Hit Rate Confusion

During training I have sometimes asked the following question: "If you were able to improve your cache hit rate from 90% to 95%, how much of a reduction will you see in your physical reads?"

Answers have ranged from "There will be a 5% reduction" to "You can't be sure." In fact you can be sure: the physical reads will be halved.

The confusion stems from the fact that the hit rate obscures the metric that we really care about: the miss rate. The miss rate is the number of I/O requests that cannot be satisfied from the cache and therefore result in a physical read. The miss rate is simple to compute: it is 100 minus the hit rate. So a 90% cache hit rate means a 10% cache miss rate, and a 95% hit rate means a 5% miss rate.

When the earlier question is expressed in terms of the miss rate, it becomes easier to answer: "If you were able to reduce your cache miss rate from 10% to 5%, how much of a reduction will you see in your physical reads?" Reducing the miss rate from 10% to 5% means that only half as many physical reads are required.

So the number of physical disk reads is halved when the buffer cache hit rate improves from 90% to 95%, and halved again from 95% to 97.5%.

With the implications of the cache hit rate clarified, which recommendation—70%, 95%, or 98%—is the optimal cache hit rate to aim for?

Cache Hit Rate Guidelines

There is no magic figure that represents the acceptable buffer cache hit rate for a particular database application or database system. There are, however, factors that you must consider in deciding whether your current hit rate is acceptable. The following list identifies the most important factors.

  • Available memory. You typically increase the cache hit rate by expanding the size of the buffer cache. Such an expansion only makes sense if you have enough main memory to support it, though. If you put too much memory into the buffer cache and starve the applications as a result, you will cause application paging. Application paging is likely to lead to much worse performance problems than those you solved by increasing the size of the buffer cache.

    Occasionally you might actually find yourself decreasing the size of the buffer cache. For example, if the number of active users connected to the system increases, the demands on main memory will increase also. By decreasing the size of the buffer cache, you can expect to suffer an increase in the number of physical database reads. The lower cache hit rate will be worthwhile, though, if it helps you avoid application paging.

  • Disk utilization. An increased cache hit rate means fewer physical reads, which means fewer I/Os on the data disks. Use the iostat, sar, or statit utilities to check the utilization of the relevant disks. If the utilization is high, reducing the number of disk reads might reduce the load on the disks sufficiently to reduce the disk response time. Reduced response times can translate into savings in transaction response times.

    Remember that changes in cache hit rates have little effect on data disk writes. If, for example, you observe that half the disk I/Os are reads and half are writes, then halving the cache miss rate will only reduce disk I/Os by 25% since disk writes will stay constant.

    If you are using UFS database files, then each physical read you eliminate means one less page the Solaris Operating Environment will need in the file system page cache (unless you are using Direct I/O, in which case UFS pages are not cached). The result is less kernel activity.

  • Application response times. The transaction path length (that is, the number of machine code instructions executed) is shorter if a block is found in the buffer cache rather than if it is read from disk. Shorter path lengths mean lower CPU consumption, and freeing up CPU cycles can help improve transaction response times. If transaction response times are an issue, then a higher cache hit rate may help. Note, though, that many other factors contribute to transaction response times, so improvements resulting from caching effects may not prove to be significant.

A Worked Example

To illustrate the principles described above, let's consider a practical example. The following statistics, based on a real workload, were reported by the Oracle8 utlbstat and utlestat utilities (described in "The utlbstat and utlestat Scripts" on page 313):

 consistent gets          52180045
 db block gets            1675582
 physical reads           1217906
 physical writes          200263

We will flesh out the process by asking a number of questions.

  • What is the current cache hit rate?

    As we see from "Calculating the Buffer Cache Hit Rate" on page 317, logical reads for Oracle8 are the total of consistent gets and db block gets, so the cache hit rate is:

    (1 - (1217906 ξ (1675582 + 52180045))) ∴ 100 = 97.7%

    According to two of the three Oracle8 tuning text books I referred to earlier, it's time to kick back and relax! We're almost at the 98% target suggested in the third book, too. So all appears to be well. However, there's still a lot we don't know, so we won't take a vacation just yet.

  • What is the physical read rate?

    We can see that we are doing more than a million physical reads, which sounds like a lot of I/O. Before we become too alarmed, though, we need to find out the duration of the measurement period. At the end of the report the measurement interval was reported as being one hour, from 17:53:06 to 18:53:06. So 1,217,906 physical reads in one hour means the rate was 338 reads per second.

  • What is the physical write rate?

    The report showed that in the same interval, 200,263 physical writes were completed, or 56 writes per second. The total I/Os per second to the Oracle data disks amounted to 394 (the sum of 338 and 56).

  • How many disks, and of what type, are used to store the data tables?

    In this case, I happen to know that the data tables were striped across eight old 5,400 rpm disks, each with a capacity of 2.1 Gbytes. These disks can reasonably be expected to sustain a peak of around 60 I/Os per second. Assuming the I/Os were evenly spread across the eight disks, we were averaging 49 I/Os per second, approximately 80% of the expected I/O capacity of the disks.

    Most workloads are not uniform in the load they place on the system over the course of a one-hour period; peaks and troughs are more usual. So it is likely that I/O during peak periods will exceed 60 I/Os per second per disk. Increasing the cache hit rate could prove beneficial in this case, especially if the one-hour measurement interval does not represent a heavy workload period. If it proved impossible to improve the cache hit rate for some reason, it would be wise to use more than eight disks in the stripe.

  • How much free memory is available to expand the buffer cache?

    Before we decide to increase the size of the buffer cache to improve the cache hit rate, we need to be sure that we have enough free memory in the system to support an increase.

    In this case, the buffer cache was tiny—only 40 Mbytes in size—and plenty of memory was available to increase it.

  • By what proportion can we increase the buffer cache?

    We also need to determine the percentage by which we can increase the buffer cache. In this case, the size of the buffer cache could have been quadrupled, and since the load on the disk was heavy, an expanded buffer cache might have been resulted in worthwhile benefits.

    Unfortunately, the higher the cache hit rate, the more memory it takes to increase it further. Eventually a point is reached beyond which there is little improvement in the cache hit rate, no matter how much memory is added.

    Suppose, for example, your cache is 512 Mbytes in size and your hit rate is 80%. You double the size of the cache by adding another 512 Mbytes and discover that your cache hit rate has increased to 90%. If you are hoping that a further 512 Mbytes will improve your cache hit rate by the same amount, bringing it close to 100%, you are likely to be sadly disappointed. It is more probable that you will need to double the cache again to achieve a significant improvement. The law of diminishing returns applies to cache sizing; we discuss this effect in more detail in the next section.

    With a cache hit rate of 97.7%, we might have almost reached the upper limit. On the other hand, there could be further benefits to be enjoyed. The miss rate is 2.3%, and reducing it to 1.15% (representing a cache hit rate of 98.85%) would halve physical read I/Os. Only testing can reveal the possibilities.

In summary, the objective of monitoring the buffer cache hit rate is not simply to achieve some arbitrary number; it is to optimize database performance by finding a balance between memory consumption and disk I/O.

If your cache hit rate is 70%, you will probably find that even a modest increase in buffer cache size will reduce the rate of physical reads for an OLTP workload. If, however, you have little or no spare memory, your disk subsystem is comfortably coping with the load and your users are happy, you may not need to make any changes at all.

If your cache hit rate is 95%, you will probably need a substantial increase in buffer cache size to make much impact on the rate of physical reads. If your disks are overloaded and you have plenty of spare memory, though, it is worth the attempt.

The appropriate balance in your environment will depend on data skew, application response times, and, especially, the availability of memory and disk I/O capacity.

  • + Share This
  • 🔖 Save To Your Account