Windows Throughput, Resources, and Rescue
Here's where the rubber meets the road: The part in which, if you don't have the necessary resources, or if some bit of configuration fluff gets dinked up, all of the preceding fancy file and printer sharing simply seizes up and stops working.
The first bit of good news is that you can say what you want about Microsoft, but it built some killer statistic-gathering mechanisms into all Microsoft Networking server and client programs. This, coupled with System Monitor, enable you as Joe User to see how you're doing in terms of performance at any given time. This means that you can quickly and easily profile a particular server or network segment. This is something that used to take a network expert and a $10,000 piece of equipment to do, but, whoa, you can do this on pretty much any Windows workstation. (Did I mention that I like this feature?)
In all seriousness, the powerful statistics-gathering abilities of Windows really allow you to do some powerful troubleshooting cheaply. Let's look at a two quick examples.
Scenario 1: Suppose we have an application that's running very slowly for a given person. Other folks are running the application just fine, but our confidence level is pretty low concerning how alike each workstation is. We discover that the user can log in to someone else's workstation and work just fine.
Somebody suggests that we nuke the user's hard drive and make it just like a setup that works. This is greeted with stony silence by the user, who happens to be a big shot in the company. Apparently, this is not the path that we will take. At least, not if we want to continue our employment.
The user says that his machine runs like molasses, and he seems right from what we can tell of other people's workstations. What's going on? Is it the client or the server? Is this subjective slowness, or is there an objective way of measuring this?
We could break out the network analyzer to measure the network throughput by attaching it to the network segment and analyzing packets between the client and server. However, because network throughput is easy to measure with the Windows 9x family's System Monitor, we install this to the user's workstation and add the Microsoft Network client's Bytes Read/Second statistic to our chart. Figure 12.5 shows a monitor session; it reveals that the Microsoft file and print session is running reasonably fast on this user's 10Mbps network: 625KB991KB per second. You can't expect much faster than that on a 10Mbps Ethernet network. But would an upgrade to 100Mbps help? Well, it's an older workstation, and you can tell that it is working hard keeping up (look at the CPU utilization and free memory). Both drop when this file transfer is happening. In this example, we would see normal output like this when running the application on a different server.
Figure 12.5 A Windows 9x monitor session.
We keep the System Monitor minimized and run the problem application. Yes, indeed, it's slow. We're seeing wide-area speeds on a LAN connection. Right away we know that there's truly a speed problem on the network; it has nothing to do with his workstation.
Or does it? We have a different server that runs another department's stuff, so in the spirit of ruling in or ruling out the problem, we decide to do the same measurements on a different server using the same application. In this case, the measurements are more in line with what we're seeing on other workstations on the LAN. It seems that this particular server and this person's workstation are not happy with each other. The five minutes we spent with the System Monitor accomplished a couple of important things:
This person is not imagining things; we have objective data to prove it.
We've discovered that the workstation can connect to a different server without the slowdown problem; we now have a workaround while we further diagnose the problem server or workstation.
But life isn't always simple. It turns out that the original server is not a Microsoft server, it's a UNIX server running Sambathe UNIX/Linux SMB server. Is this the same type of server as the test server we just used? Well, no. So, we do the tests again, this time to a different UNIX server running the same version of Samba. Ah hah! It turns out that this user's workstation seems to have problems talking to this version of Samba. (Others in the department are not having a problem.)
The solution to this problem consisted of (very carefully) reinstalling the Microsoft file and print client to the workstation and repatching the workstation, consistent with other workstations in the department. (Yes, indeed, because we were dealing with a VIP, a backup of his hard drive was donejust in case more problems got caused through this process.) After the application of the client and the service pack, we tested again with the System Monitor. The problem was gone, and the VIP was happy.
Scenario 2: Many, but not all users in a department were complaining about slow application performance. These happened to be users of Windows 2000, so I used the Windows 2000 Control Panel applet (Administrative Tools, Performance) to take some measurements. The first statistic I added was "RedirectorBytes Received per second," followed by "RedirectorBytes Transmitted per second." (The redirector refers to the client portion of SMB that redirects server resources to look like local resources, such as drive letters.)
Unlike the first scenario, this network was 100Mbps to the desktop, with a very, very fast server at the other end. There was no reason at all for many people to be complaining, so I thought I'd take a look from my office.
The server itself was connected to a SAN, had huge amounts of memory, and so forth, so I became concerned that any file transfers from that server would be limited by my hard drive. So, I mapped my "O" drive to the server, identified a very large file to use as a test file, and then dropped to a command prompt and typed:
COPY O:BIGFILE NUL:
The NUL: parameter allows you to say, "Don't actually copy the file to my hard drive, just throw it in the bit-bucket," enabling you to avoid your hard drive as a limiting factor.
Because BIGFILE was huge (it was a SQL server database backup, and was about 10GB), I had plenty of time to switch over to the Performance applet and view how fast, on average, the file was transferring to my workstation. As Figure 12.6 shows, it was averaging around 6.1 megabytes per second, which was not bad.
Figure 12.6 Reasonable performance on a 100Mbps network.
I went down to the wire closet used by the folks who were having problems, using the same laptop that I had used in my office, hooked it into the Ethernet port of one of the problem workstations, and ran my tests again.
Whoa! As Figure 12.7 shows, there was quite a difference. We went down from 6.1MB per second to about a third of a megabyte per second. Also, see how the indicator is jerking up and down? There's no steady stream as you might expect.
Figure 12.7 Poor performance, given a nominally fast server on a 100Mbps network.
So what was the deal? It turned out that the closet switch hadn't been properly configured for our environmentit was left as default; that is, autosense, instead of explicitly set to half duplex, 100 Mbps. As discussed in Hour 10, "Ethernet and Switching," autosensing can cause all sorts of problems, including this type of throughput problem. But, thanks to the Performance applet, it got tracked down pretty quickly.
You can track a huge number of stats with the Performance applettracking redirector transmission speeds are just the beginning. You'll definitely want to get familiar with the Performance applet before trouble strikes.
Process, File, and Registry Tracking
Although you can use the Performance applet to track things like processor utilization, it won't tell you which process is using what proportion of processor resources. Fear not, though, the Windows NT family does in fact keep track of such things, just like a real operating system.
The simplest way to check out per-process utilization is to bring up the Windows Task Manager (Ctrl+Shift+Esc is a shortcut key for this) and to sort the columns by CPU timethis is the amount of total CPU time used by a process since it started. On a healthy computer, the System Idle Process should be the highest number. (One other interesting thing to do is to sort by Mem Usage, letting you know which processes are hogging memory.)
Of course, there are other, more buff ways to track processes. SysInternals (http://www.sysinternals.com) provides one freeware program called Process Explorer, which, put simply, rocks the house. Process Explorer (ProcExp.Exe) not only shows various resources used by processes, but it also sports a tree view, showing which processes started which others. It also identifies which files a given process has open, as shown in Figure 12.8. If you've ever been frustrated by file in use errors, kiss them goodbye once you download Process Explorer. Sweet!
Figure 12.8 Process Explorer shows a tree view of the process table, as well as additional process info, notably files open.
Similarly, if you've ever asked, "Which actual process owns that socket pair that I'm seeing in the netstat -r output?" check out TCPView. It identifies all currently used socket pairs (UDP and TCP) and maps them to processes. This can be hugely useful; I once saw a bug in a program that kept infinitely opening sockets, and basically created a denial of service (DOS) because it was opening them so fast that other programs couldn't open sockets. It's pretty good to be able to track this sort of thing and kill the offending process rather than reboot and pray.
SysInternals also provides two other tools that no self-respecting system administrator should be without: FileMon and RegMon. Both utilities work on the Windows 9x and the Windows NT families.
FileMon, as it sounds, tracks all file access on a Windows system. Depending on your troubleshooting scenario, this can be a huge help. I can't even count the number of times I've been hit with some weird application error, "can't access file," and I had no idea what file, what error, or what the deal was. By reproducing the error in a small time frame on an otherwise idle system, FileMon quickly identified which file this was. Why not use ProcExp? Well, ProcExp only shows files that are currently open by processes, not open attempts.
FileMon actually shows failed attempts as well, which is awesome if you are a troubleshooter. (You might not know which application file you need to restore from backup, and this tells you.) Figure 12.9 shows FileMon tracking a Notepad session that is attempting to load a nonexistent file called Ouch.txt. As you can see, FileMon fingers both the nature of the error as well as the name of the missing file.
Figure 12.9 FileMon and RegMon show current activity on the file system and registry, respectively.
RegMon looks and works similarly to FileMon, but it deals with the registry instead of the Windows file system. Both tools are extremely powerful for dealing with corruption or security issues.
Nonbootable OS Recovery Methods
There are two ways in which you might not be able to access your Windows NT family machine: First, you might not be able to logon; second, you might not be able to boot into the OS at all.
In the good old days, you might boot an operating system diskette, fix the problem on the hard drive, and then reboot the hard drive. Well, there are a couple of problems with that nowadays. First, the Windows NT family doesn't offer a bootable OS diskette. Second, booting to a different operating system (such as DOS) presents a problem: The NTFS (NT File System) isn't readable by DOS, so restoring files that way is a bit of a problem. Besides, you can't exactly make registry tweaks here.
Microsoft has a couple of recovery methods that it recommends. First is the "Last Known Good" boot prompt. This has its uses, but I can tell you from experience that this does not always work. "Last Known Good" means, "restore the registry to the last known good point," where "last known good point" means "after Windows goes graphical." If Windows goes graphical, and then you have your problem, you're sort of out of luck.
For example, one time I ran the viewer application for a certain type of remote control package, and it randomly decided to modify my Windows 2000 workstation and install remote control. When I rebooted, I couldn't log in because it was complaining that it couldn't find a certain DLL. Well, sure, I hadn't installed the remote control piece. I tried using "Last Known Good" to no avail. How about the Emergency Recovery Diskette (ERD)? Well, I hadn't updated my workstation's ERD recently, so that was out.
There were a few methods I could have tried, in order, from most annoying and complex to least. First, I could have installed another copy of Windows 2000 in a different directory on the hard drive, and then manually copied the required file to the dead installation. Second, I could have moved my hard drive to another Windows 2000 workstation, and fixed it there. Third, I could have shared my hard drive from another Windows machine, and copied the remote control DLL to the proper place.
Clearly, the third option was, in this case, most preferable. But the first and second would have worked if my workstation was so dinked up that I could not share the drive from another machine. This happens with servers more often than we'd like to think.
If you find yourself doing a lot of recovery work, I'd like to recommend that you invest in some more utilities, produced by the same folks that write the SysInternals ones. They are
NTFSDOS ProfessionalAllows you to read and write files from an NTFS hard drive after booting from a DOS diskette.
Remote RecoverAllows you to read and write from a dead server over the LAN, by providing a LAN boot diskette for the server.