Home > Articles > Data > SQL Server

SQL Server Reference Guide

Hosted by

Toggle Open Guide Table of ContentsGuide Contents

Close Table of ContentsGuide Contents

Close Table of Contents

Creating a Performance Tuning Audit - Interpretation

Last updated Mar 28, 2003.

We've arrived at the third phase of the Performance Tuning Audit. If you've been following along in this series then you have learned that one of the best ways you can tune a system is to create a systematic plan to evaluate your system from top to bottom. Having a plan helps you keep everything straight when the pressure is on, as it surely will be in a crisis. Having a documented process is even better, so I've created a four-step outline to help you work through your issues. You can find that outline explained in the first tutorial in this series.

But you will benefit even more if you develop your plan and implement it before you run into issues. The reason is that the audit creates a baseline you can work from when you do have system slowdowns. Comparing the metrics you collect when the system shows a performance issue with the baseline allows you to find changes in the metrics quickly, and more effectively direct your tuning efforts. Why fight a hardware issue when you have a poorly performing index?

At this phase of the process you've already defined your components and you've identified the individual items within those components to track. You've also learned how to select the objects and counters you need to monitor and track to create the second part of the process. As I've mentioned, how you store the results of the monitoring and the earlier component definitions is less important than the fact that you make the collections, and use tools readily available to everyone on your team. I'll cover the tools I recommend for the documentation and collection of the objects in another tutorial.

So you have a good idea about the layout of your system and a lot of data collected from multiple tools. All along I've cautioned you against watching the monitoring and then taking immediate action based on what you see. You, or other subject matter experts, will certainly find obvious corrections that you could make during the evaluation phase. Making those changes not only corrupts the test, they might mask other issues and dilute the effectiveness of the entire exercise. Of course, if there are issues which could cause major issues prior to the audit's completion, by all means correct them. But if the issue is such that you can wait, then you should wait until you’ve completed the phase I'm discussing in this section.

Another reason that you should wait until you have completed is that you should always work from as much intelligence as possible. If you've followed the process so far, you've got a great deal of information to work with — and you'll need it. The metrics you've collected don't have as much relevance by themselves as they do when combined with the documentation of each component. Here's a simple example to illustrate that point. I've examined the results of a system's monitoring that showed a high level of processor inefficiencies. At first glance, it seemed that the system needed a faster processor or more of them. When I examined the entire set of documentation, I found that the server was running Windows 2000, and it had hyper-threaded processors. Windows 2000 doesn't schedule threads onto a hyper-threaded processor in the most efficient way, so the interpretation was that the processor issue would benefit by installing an operating system such as Windows 2003 that works better with hyper-threading.

With all the information available from your earlier documentation efforts, start with the static information, specifically the layout drawings and/or documentation. Look for obvious "choke points" in the architecture, such as slow junctions in the network that use older or slower routers, and note those down. Also note any significant processing load in the design, and note down the servers in the architecture involved in that processing. Do the same thing for storage or memory requirements, from the design aspect.

Once you've noted these items, cross-reference the static information for the servers that you indicated in the design that have a heavy load. Do those servers appear to have adequate power to handle the load, or have newer servers crept into the landscape that could be rotated into that position? Now check the dynamic counters to see if the processors on those servers show a high load (above 70% utilization or so for more than a few minutes at a time). Check the metrics for the memory, storage access and other objects on those servers.

Now check the network dynamic information with the static information from those critical network junctions. Are there a high number of dropped packets showing on the routers or switches? Are you seeing a lot of CRC errors inside the WAN? Also check the latency to ensure you've got everything within tolerance.

Move on to the workstations. What you're looking for here is high resource use, such as processor, memory and storage loads. Also check for packet errors on the network interface.

If possible, you should try a representative operation from the application, such as a data entry or reporting query. Correlate the origination "click" from the workstation until the return packet on the screen. Using what you know from the landscape you can trace the packet from the client to the database and back – something called a "ripple". The dynamic information you collect during that timeframe will show where the highest lag is, and direct your tuning efforts.

In this phase you want to interpret the data you've collected. There are times when you can take the data and make direct statements about it — such as a high number of dropped packets on a network device. In those cases you can interpret that you have a networking issue. At other times you won't be able to make direct statements, because everything is working properly, or at least as efficiently as it can. In that case you might have to evaluate the system's design or the software calls. It's a little like driving to a location but taking the wrong route. There's nothing wrong with the driver, the car, or the road; it's just that you're taking the wrong path. Changing the path fixes the time problem.

It's not always as simple as "Watch this counter for this value. When it is X, the problem is Y." If that were the case, a tool could do the interpretation for you. Most of the time the answer is "it depends", which means you'll have to leverage your knowledge with others to find the issues. For instance, a latency of +600 milliseconds is normally an indicator of a network issue, but not always. In some cases this level of latency is acceptable with a properly architected application.

No matter what you encounter, write down what you think is wrong. Don't stop with the first red light that you see. Continue evaluating the system as a whole by examining the individual parts. Don't recommend corrections in this phase — that's next. Right now you're trying to say what is wrong — not what you need to do to fix the problem. Make a complete list and place it into your documentation.

In the next section I'll show you how to make suggestions for the fixes and how to create a plan to implement those fixes.

Informit Articles and Sample Chapters

The concepts I'm covering in this series aren't unique to databases. You'll find a great article here about characterizing your network layout here.

Online Resources

Did I pique your interest about hyperthreading processors and operating systems? Check out this article for more.