The Delta Method: Identifying Network Change
I wrote "Teach Yourself Network Troubleshooting" back in 1998 after having several conversations with some pretty smart folks who were frustrated that there was no concise and organized text on how to fix your network when it was broken. In fact, a buddy of mine—a real-life rocket scientist and MIT graduate—who used data networks extensively in his job would often complain that he simply didn’t have time to plow through 10 textbooks about 800 different networking topics: he just wanted a jump-start in the most important networking topics so that he could quickly dispatch problems and get back to building rockets. After several conversations with him and several other professionals—doctors, lawyers, engineers, and so on—my original outline for "Teach Yourself Network Troubleshooting" was born.
I suppose I shouldn’t have been surprised (but I was) when readers who were newcomers to the networking business started to write to me about how useful this book was in their new jobs as system administrators and network support personnel—even after they had received certifications in various vendor-specific fields! So, in the 2002 edition of this book, the folks at SAMS and I have made sure to address topics that are found in enterprise networks as well. Whether you’re new to computers or an old hand who wants to get up to speed with network troubleshooting, I’m pretty confident that you’ll find this book valuable. Enjoy!
The changes are small
"It ought to work," he says, yet:
The network is down.
The Mighty Quizro, The Law of Unintended Consequences
It can seem like gremlins are lurking in every server, wire center, and particularly in every Windows PC. Usually, however, what we attribute to gremlins is actually the result of a change. Frequently, this means that a user has changed something and subsequently has forgotten (or denied) making the change.
We all do this; we're our own worst enemies. For example, say you install a new application on your computer, decide you don't like it, forget to delete it, and then have trouble accessing your company's intranet the next day. If you're lucky, you won't forget you installed a new app, and you might uninstall it to find that, voilá, you can now access your company intranet. Typical application conflict, right? Easy.
But what happens if you have a really great weekend in between and forget? You might "spin your wheels" for hours while trying to figure out what's wrong. No shame in it; I've been there, too!
For as many times as we might think that there's a ghost in the machine sometimes, there isn't. For all their potential complexity, computers and networks, once set up, are rather predictable devices, and tend to function exactly the same, all the timeuntil something changes. One of the most essential pieces of knowledge in troubleshooting, then, is finding that change, or as the math geeks might say, "finding the delta (Δ)."
Several common vectors of system change that you will want to consider are as follows:
Person initiated (yourself; your peers; vendors such as ISPs, VARs, consultants, and perhaps even a malicious outsider)
Equipment failure (a marginal component, perhaps, causing an intermittent or strange problem)
Resource allocation failure (your database running out of a resource, such as system inter-process communication structures and perhaps giving you and your users a false error rather than giving the proper error message about the resource failure)
The Fat Finger Factor
Let's look at person-initiated change first, starting with my favorite person to blame: me. Many times, when you might make a change to a router or server in order to offer a new service or to fix a problem, you introduce new and wonderful problems, simply because to err is human. If the error that you introduce does not take effect immediately, the change that you initiate seems to work, and will be the last thing you think of if a problem surfaces later. (This is a really, really good reason to keep a log of changes; rather than having to remember or ask others what network attribute has changed, you can simply look it up.)
For example, one time, I was fixing the startup file for a NetWare server to automatically make a certain volume available upon the next bootup. I rebooted the server that night and tested the users of that server; all seemed well.
The next day, we found that Windows NTbut not Windows 95users were complaining that their time was off by five hours. Related? Couldn't be! We thought, surely this has to be something that had globally affected our NT users' configurations; it couldn't possibly be the server, particularly because the users in the affected department did not log in to this server for file services.
Of course, it was the server I had touched. Although I thought that my change was benign, in fact, I had accidentally introduced an error into the startup file; this is what I call the fat finger factorwhile editing any configuration file (no matter how benign your changes are), you might introduce a stray character or two that makes a key command or parameter unintelligible to the server.
In my case, the problem was that I had edited the NetWare startup file for something quite innocuous and hit a random key by accident at the top of the file. The first line of the file was responsible for setting the time zone. Instead of reading
SET TIME ZONE=EST5EDT
the line as accidentally edited to read
\SET TIME ZONE=EST5EDT
When I saved the file with my "benign" changes, all of a sudden the server had no idea what time zone it was in because it didn't understand the \SET command. The server in question was not the file server for the department, but it was the time server!
Novell-connected Windows NT synchronizes the time from the server differently from Windows 9x; it relies heavily on time zones. So, my fat finger threw every NT workstation on our network off by five hours. (We're in the Eastern time zone, which is five hours off from universal time.) Ouch!
The lesson is this: Always point the finger at yourself first. Always be thinking, "What have I done lately?" Then, be prepared to undo your changes. As I discussed in the last chapter, keeping good notesparticularly a formal logbookis a really good idea. That way, you can compare the date when the problem started to the dates of changes entered in your logbook.
Of course, others in your organization might be equally at faultparticularly if you share responsibility for your network. A logbook helps here, too, but don't solely rely on the logsit helps to talk to each other and discuss what you've been working on.
When starting to troubleshoot a problem, if there are others who work on your network, you should definitely query them about changes they've recently made.
It can be particularly troublesome when you, or someone else responsible for the network, does a "fat finger" on a device and doesn't immediately restart it. When the device does finally get restarted, the change is no longer recent and is therefore hard to point to as the culprit. When you consider this, it makes sense to think of any device that has been recently restarted as a suspect device.
Of course, problems after a restart won't always be as subtle as the previous problem: I knew a Unix consultant who restarted a server to ensure that the new application he had installed would start up after the operating system started, only to find that the entire server operating system failed to load upon reboot. After much hair-tearing ("Jonathan, I SO did not touch anything to do with the OS loading, man!"), fixing the problem, which was unrelated to the change he made, and later investigating, we found that a system programmer with way too many security rights had accidentally removed the files in the startup filesystem and replaced them with his own files. He had done this a month previously, so this was a time-delayed fat finger factor. The programmer wasn't even the one who got burned; the consultant and I were the ones who had to fix the system by booting from removable media and replacing the files from backup.
In general, you should always restart a device after a change has been made to it: You want to know if your change does something bad so that you can undo it while it is fresh in your mind.
What about the problem of being wrongly accused of making a bad change? I combat this problem in two ways: First, for devices that I am responsible for, putting them on an automated restart schedule (see Hour 18, "Managing Change: Consistency and Standards," for some tips on how to do this); second, if I am able to, I like to restart a device before I start performing change operations. This way, I know that I am about to change a device that isn't already seriously broken.
Finally, Bill's 1st Law of Network Changes: Always, always, always make sure that everything is backed up before making changes!