Net Therapy 101: Techniques for Using Your Analyzer
Each analyzer is different; choosing your weapon appropriately is one of the first steps toward success with a protocol analyzer. If you have an old and crusty FDDI network, for example, do not use an Ethernet-specific analyzer that "happens" to work with FDDI as well: use an analyzer made for FDDI.
Your scenario always dictates which tool you need. There's more than one tool out there because there's more than one problem out there. Because you can't buy all the tools available, it pays to know your network environment thoroughly before you invest so that you can buy the most appropriate tools for you.
Filtering: Cold-Filtered, Ice-Brewed Packets
As I mentioned earlier, knowing how and when to filter your capture data is one of the most important skills you can have when using a protocol analyzer to capture network traffic. Otherwise, you'll likely be searching for a very small needle in a very large haystack! Even a veteran packet analyst would get discouraged without filtered data.
Several types of filters are available both during capture and display:
MAC address filtering.
Network protocol filtersTCP/IP, IPX/SPX, AppleTalk...
Service protocol filtersEither filter by service or by service attribute (for example, DNS queries only, but not other DNS traffic).
Generic filtersHexadecimal values within a packet.
Station filtersUsually based on some attribute that is at least temporarily assigned to a network name. Use with care because these can be quite dynamic. For example, say that I choose "JFPC" as a NetBIOS station to filter on. Ultimately, this is based on a pairing between network protocol and service naming, and if the "name discovery" is run earlier than the trace, you might find that you are capturing data for the wrong station (bearing in mind that address assignment protocols like DHCP do not guarantee that you receive the same address day after day).
Not every kind of filter is available on all analyzers; for instance, some analyzers won't filter every kind of serviceat least by name. You can get around this by using a generic filter based on some portion of the decode window.
Let's look at how to make one analyzer filter by TCP port number. For example, Wild Packets' EtherPeek doesn't specifically have a filter for Gnutella sessions, but it does allow you to filter based on a right-click in the decode window. In our case, we're interested in a Gnutella session, TCP port 6346, which translates to hexadecimal 18CA (see Figure 21.5).
Figure 21.5 Typically, packet analyzers let you filter on just about any decode field; for example, when using Wild Packets' EtherPeek, simply right-click and choose "Make Filter."
Notice how several bytes (18 CA) are highlighted; these are the bytes in the packet that are the hex codes that identify this packet as TCP port 6346. Right-clicking will bring up a menu with a choice to apply a filter based on these values. You can apply this to other fields in the decode as well. Very cool!
Here are the two ways an analyzer can filter:
PrecaptureThis is useful when you don't want your buffer to overflow with needless data.
PostcaptureThis is good for when you've already captured the general data in question and want to refine your search.
Why would you not filter while capturing? After all, we want to avoid the needle-in-a-haystack problem, right? Well, if it's possible to take a small trace by sticking to a very small time window, sometimes the additional information in that trace can be useful. Or, if you know that the problem lies in one protocol family, but not another, sometimes it's possible to only capture one protocol family. This can be pretty illuminating because a service can fail based on lower-level protocols failing. Read on.
Limited Filtering Scenario
Here's a case in point for how limited filtering can help in troubleshooting a busy server.
I once had to deal with a problem in which one server started to have trouble sending print jobs to a print server. The print server would all of a sudden, and at seemingly random times, generate socket errors in its log stop processing. Only a restart of the server would make it start processing print again. Our first question was, "Who changed something on the print server?" The answer was...nobody. Nothing had changed on the print server. No interrogation or torture was spared to verify this; we were absolutely certain (ha, ha) that nobody had changed anything in the time frame that we were talking about.
This was a really tough problem to troubleshoot: A search on the print server support site for the particular error message revealed nothing, and the problem was still popping up intermittently. We needed an answer relatively quickly because this print gateway was responsible for processing print for a time-sensitive function. Because we were relatively certain that nothing had changed on either the print server or the application server (in fact, the app server was printing fine to other print servers), we decided to see what was happening on the network. Maybe some errant evil packet was causing the print server some mental illness.
We connected a protocol analyzer to the print server's segment (because we suspected something bad was happening to the server) and considered what we wanted to filter on:
Because we knew something was happening to the print server, we would only capture packets destined for the print server's MAC address.
Because we knew that this was a very busy file and print server, it wasn't feasible to capture all packets destined for this server.
Because we knew that the problem was with Unix type printing (LPD), we would only accept TCP/IP packets. This eliminated most of the packets destined for this server; the file services didn't use TCP/IP, but used a different protocol. This left us with a test setup that looked something like what's shown in Figure 21.6.
Figure 21.6 The test setup for a tough app server/print server printing problem.
As soon as the problem occurred again, we looked at the packet capture. There are two important concepts here: First, we ran the analyzer until we received the report, and then stopped the analyzer immediately. Second, we synchronized the clock on the protocol analyzer to the network time before we started capturing, and we asked the user who reported the problem to also report the time of the problem. Because this was a pretty busy print service, we were sure that the problem report was within plus or minus two or three minutes, so we now only had to consider packets around the time of the report (plus and minus a minute), thus limiting how much junk we had to wade through.
Skipping to the end of the trace, we first filtered on the LPD port number, TCP/515. We did see a problem: The server stopped responding to the LPD requests from the UNIX host at the end. Well, we knew that without taking a trace. Doh!
Still, this was useful: It let us know where in the packet list the problem occurred. Therefore, we got rid of the LPD filter, jumped to the packet where the problem occurred, and looked at the packets right before the problem.
Apparently, right before the problem occurred, there was an ARP request (TCP/IP's Address Resolution Protocol). Remember, each TCP/IP address must have a corresponding MAC address in order for two network cards to talk. The ARP request I saw was responding with the wrong MAC address. An ARP packet with the wrong MAC address typically means that someone else has used a TCP/IP address that's the same as yours, thus interrupting communicationsbut that was not the case here.
We tried to find the MAC address reported by the ARP request, but there was no such network card on our network. Not only that, but I couldn't find the OUI of the MAC address in my OUI table, which was also suspicious. Furthermore, this was a network where only one or two well-known vendors' cards were in use.
Because there was no such device on the network, we next looked at the switch configuration. Because there was a MAC-level problem, we naturally suspected the switch. We asked the person responsible for switch configuration if anything had changed in the last couple of daysand, in fact, something had (so much for being "absolutely certain"). He therefore changed the configuration back to the way it used to be, and the problem went away. Tough problem solved!
Two things still bothered me, though. Why could I ping the print server at all if the ARP was incorrect? Well, because ARP is "redone" every couple of minutes, by the time I was on the scene troubleshooting, the ARP was correct again; therefore, I could ping the server without a problem. The switch was only sometimes messing up the ARP; usually, it was just fine. Second, why did a bad ARP hang the LPD server? That was a tougher question, and one I wasn't going to find the answer to, mostly because it didn't really matter. (Would the customer keep paying me to troubleshoot it further once the problem was fixed? Ah, no.)
For what it's worth, the print server program (and for that matter, the whole server in question) was somewhat old, and an interruption in the data stream was apparently driving it berserk. After the switch configuration was fixed and the ARP problem went away, everything was okay once more (and that, after all, is what's really important).
Decode View Options
When viewing specific packet traces, you'll want to explore your view options. Most analyzers have many options that allow you to be flexible about which attributes of the trace you're viewing at one time. Some of these attributes include the following:
Hexadecimal representation of packet
MAC and/or protocol and/or service decodes
Protocol or MAC address
Network name (DNS and NetBIOS)
Many protocol analyzers have a name-gathering feature; that is, they "read" the packets as they go by and see whether there's a name identifier in any of them. If there is, the analyzer will make an entry in its name table, which enables you to later specify a capture filter or view based on a network name. This, of course, is a much more "user friendly" way to specify a filter or view data.
Be aware that some analyzers do not capture names automatically; they offer it as a manual operation on data that you've already captured during the viewing portion of your analysis.
Be further aware that in a DHCP environment, viewing by station names can be diceywhat if you name-gather when MOSHEPC has IP 10.50.1.30, but later start analysis, capturing on MOSHEPC, not realizing that LEOPC now has IP 10.50.1.30. All of a sudden, you'll be capturing data for the wrong station! Be careful out there.
Even with a good analysis tool, your brain can only process so much input at one time; being able to specify view options lets you "keep it simple" so as not to overwhelm yourself with too much information. Accordingly, you can view strip charts that sum-marize certain aspects of your data, as shown in Figure 21.7, which divides network traffic by application.
Figure 21.7 Finisar Surveyor and other analyzers can graph "top talkers" and other statistics, thus helping you to interpret raw data.
You can also change your packet decode display optionsin particular, how time and network names are displayed. Because a network is a timing-sensitive animal, the time-related options are particularly important. Your relative or interpacket time is important because it's the delay in between two packets. A value that looks way out of line with other packets indicates a delay caused by network glue such as routers or switchesor, more likely, a delay caused by processing at the other end of the conversation (by a busy server, for example).
High Level Packet Analysis: "Poor Man's" Statistical Analyzer
As helpful as capturing specific packets can be toward finding a solution to a specific problem, there are times when you'll want to run your analyzer "wide open" in order to get a general overview of your network segment. Some analyzers call this monitor mode; others simply bundle this mode into capture mode. Although this isn't a total substitute for statistical analysis (which we'll discuss in Hour 23), it can be helpful in identifying "big picture" issues on the network.
For example, when everybody on a given segment is complaining that they're running slowly, and you don't have any network management tools running, you might want to break out an analyzer that will statistically analyze the segment while it's capturing. The packet analyzer will likely keep a running total on several things:
Errors per station
Frames received per station
Frames transmitted per station
Total utilization of the network
Total errors on the network
On the slow segment, you might see that the total utilization of the network was running high, say, 65 percent. (Ethernet tends to degrade after 35 percent, so this is really high.) You would probably want to know why the utilization was high: Is it because of many users, all of whom are using a fair portion of the pipe, or a couple of users hogging up the pipe? A good way to find this out would be to sort your statistic list. For example, if you used WildPackets' EtherPeek, you might choose Node Statistics, choose All Sent, and sort by bytes, and you'd immediately find out that Fatboy seems to be a top transmitter on this pipe (see Figure 21.8). If that's what you expect (and in this case, it is because Fatboy is in fact a big fat server, and bunches of folks request stuff from him), then great. But if it's not, seeing this type of disproportionate utilization is an indicator to investigate this station further.
Figure 21.8 Packet analyzers such as EtherPeek can usually display a helpful list of network nodes sorted by various statistics, including byte count.
You might want to capture specific data from this station to find out just what type of traffic was being generatedeven quicker, check your documentation and make a phone call to determine what the user in question is doing. In this case, let's say that Fatboy is a workstation, and your phone call reveals that the Fatboy user was doing a backup of his hard drive to the network. You'd probably want to politely ask him to stop doing this during peak hours and suggest other methods for hard drive backup, such as a tape drive.
Just to make sure that the network is otherwise healthy, you can also sort your node list by errors. A couple of errors here or there is fineyou just want to make sure that there isn't one station that's jamming up the freeway by behaving badly.