Home > Articles > Data > SQL Server

SQL Server Reference Guide

Hosted by

Database Troubleshooting

Last updated Mar 28, 2003.

This reference guide on SQL Server is divided up into various sections that help you quickly locate information based on who you are, and what you're trying to do. For instance, if you're a Database Administrator and you're interested in learning about the tools used for managing SQL Server, you'd look in the Microsoft SQL Server Administration area and find the tutorial on the left involving SQL Server 2000 Management Tools or SQL Server 2005 Management Tools. If you're a Database Developer and you want to know more about SQL Server's Data Types, you can navigate to the Microsoft SQL Server Programming area and select the tutorial called Data Types from the menu on the left. We've also got sections for the Executive looking to gain an understanding on what SQL Server does and others on practical examples of how you can manage and use SQL Server. The section you're in now, Professional Development, deals with how you can take your skills to the next level.

The other day I was asked how I go about solving problems. I was struck by the fact that many technical professionals solve problems in different ways. The approach I've seen many of them take is either ad-hoc, relying on past experiences or those they reference, or systematic. If you observe how the seasoned professionals in your organization work, it's often a combination of both.

I began to do a little searching on the process and found that there are many books, magazine articles, web pages and other reference materials that talk a lot about troubleshooting a specific problem, but not troubleshooting in general. In this tutorial I'll do just that – although I'll slant the process towards solving database issues. I felt that this section was a good place to put the article, since one of the marks of a professional was how well and how he or she solves problems. The next level, of course, is how well you prevent problems – but that's another tutorial.

Interestingly enough, if you talk about troubleshooting in the general sense, it's a deceptively simple set of questions asked by almost all technicians in any discipline:

  1. What Does It Do?
  2. What Changed?
  3. How Does It Work?
  4. What About the Problem Doesn't Match the Previous Steps?
  5. What Do You Need To Do To Make It Work?

Let's take this question outline, extrapolate it to databases (and applications they interact with) and figure out how you can put it to good use.

What Does It Do?

The first question you need to ask and answer is What Does This System/Code/Hardware Do? In some cases this obvious – a hard drive stores data, a memory stick is used in a server, a database processes data. But to troubleshoot, you need to have more than a passing definition for the components in the system. I usually find myself at least mentally running through a process of answering this question a bit more thoroughly.

First, I categorize the objects in the problem. Is this hardware? What is that hardware made up of? Is it software? What kind? What versions? And so on. Of course, the problem might involve an interaction of hardware and software – but I'll leave that question open for the moment.

Once I know what I'm dealing with, I try to identify what I know about it. If I'm very familiar with a particular piece of software, I can usually rely on my experience or an extension of it to zero in on what is wrong and how I can fix it. If I don't know that much about the software, hardware, system or process the first thing I do is try to learn more about what it does. This is an important point.

The reason that this part of the process is so important is that you can actually do a great deal of harm if you rely on past experience or you don't completely understand the components within the problem. I once tracked down a very difficult problem in an application where only certain parts of the application were acting incorrectly. Entire datasets were disappearing – and it involved financial data. The problem turned out to be a DBA that changed the ANSI PADDING setting on one workstation. Normally the application relied on this setting to be off, but the DBA had set it on because that's what he was familiar with. Every data screen that the workstation touched added 30 blank spaces to the fields, and the application no longer matched the data properly during search operations. We had to write a script to locate the spaces in the data and remove them, and all was well.

So even if you think you know what the hardware, software, process or product does, find out from those who know exactly what it does. If the situation warrants it, find out from more than one person.

What Changed?

This is the most infamous question in the history of troubleshooting. The reason is that the answer is always "nothing" – at least at first. Most people who do a bad thing don't want to admit it, so there's a vested interest in your not finding this answer. But the fact that remains is that when something is working and suddenly stops, there's always a change involved. Either some component failed or is broken, or some process was violated, or some environmental factor is different than it was before.

The trick to getting the answer to this question is not to ask it. You'll need to research the activities that preceded the problem so that you can circle in on the change. And don't forget the political part of this process. You're going to need to allow whoever did the bad thing to save face, or at least not attack them directly. Make sure that they understand that your goal is to fix the problem, not the blame.

If you find the change and you can remedy it, you're finished with the process. Just two questions gets you out of the outline. But if you can't undo the change, or you're not sure how to put things right again, you'll need to continue with a few more questions.

How Does It Work?

At this point I return to the question of the complex interactions between problem components. On some problems you don't actually have to be familiar with the technology to solve it. For instance, I don't have to be an expert on magnetic sputtering and platter layering techniques to realize I have a bad hard drive. I just need to know that the hard drive isn't doing what it's supposed to and know when to replace it when it doesn't do its thing.

To go to the next level, however, you need to have a firm understanding of the parts of the hardware, software and processes to solve complex problems. I think of it in the same terms I used in my military days – you prepare for battle when you're at peace. When you've got time, you should set up a lab, even a virtual one, and test and try out various technologies so that you understand everything involved with the problem. In some cases that isn't practical, so you'll need to involve others.

In any case, you need to make sure you know a lot about what you're working on. That's what sites like this one do for you – we distill knowledge and experiences into 30-minute articles that you can leverage to help you understand concepts and theories around the platform you're working with.

There's simply no substitute for knowledge. To be able to correct the issues you find you'll have to educate yourself thoroughly about what happens at each step of its operation.

What About the Problem Doesn't Match the Previous Steps?

With the discovery complete, you need to find out what about the current situation doesn't match either a) how it was working or b) how it's supposed to work. It's sort of like how the bank professionals spot counterfeits. They don't study the various ways that criminals make fake money; they study the real thing. Once you know how something is supposed to look, you can quickly spot when it doesn't look that way anymore.

What Do You Need To Do To Make It Work?

This is the part that is simpler said than done. Once again there's no substitute for knowledge. You should be familiar with how the system works to be able to put it back together again. Using that knowledge you'll need to apply any configuration changes, process corrections or replace any faulty components within the system to repair it.

It may be that the system is working outside of its design. Some systems I've encountered remind me of the circus cars holding dozens of clowns. You wonder to yourself why anyone would overload the system or use it in a certain way. Once again you have to relay on your political skills to explain that the organization might have to buy more hardware or do things in a different way.

Let's put all this into a practical example. You're working in your office when someone approaches you and says that the ERP system has suddenly stopped working.

First, you need to find out what the ERP system does. Your research leads you to two subject matter experts in the system, one business and the other technical. You find that it's a three-tier application with a client front-end, a COM+ middle layer that talks to a single clustered database system.

Your investigation shows you that nothing obvious has changed, but on further investigation it appears that three new users were added. Although nothing else has changed, you record the additional load.

You find that the system works by connecting multiple users through a set of COM+ boxes that talk to the single database. You check the settings on the database system, and find that nothing that affects connections is an issue, nor is the system starved of resources. Going back to the change, you notice that two of the new connections use Terminal Services as an interface to the application. Probing further you find that since the current Terminal Servers were overloaded, the technical staff created a new system for the additional users, and moved a few users from the other Terminal Servers to balance the load.

As you might guess, this happened to me in a recent situation. The problem turned out to be that the application wasn't designed to allow a single user to connect from multiple stations – it simply locked up when that happened. One of the users from the other Terminal Servers had two connections opened at one time. This couldn't happen if they were logged onto one station, but when another Terminal Server was added, the user connected twice.

I used this simple process to correct an issue that had us all baffled for a while. It's a matter of staying calm, following the process, and being disciplined in your approach.

Informit Articles and Sample Chapters

Another article that doesn't deal specifically with databases but does have some great information is this sample book chapter by Chris Wolf.

Online Resources

It has nothing to do with databases, but I think you'll recognize the process I follow in a different format in this document regarding troubleshooting network problems.