Home > Articles > Operating Systems, Server > Linux/UNIX/Open Source

  • Print
  • + Share This
This chapter is from the book

This chapter is from the book

Hardware Troubleshooting

For the most part you will probably spend your time troubleshooting host or network issues. After all, hardware is usually pretty obvious when it fails. A hard drive will completely crash; a CPU will likely take the entire system down. There are, however, a few circumstances when hardware doesn’t completely fail and as a result causes random strange behavior. Here I will describe how to test a few hardware components for errors.

Network Card Errors

When a network card starts to fail, it can be rather unnerving as you will try all sorts of network troubleshooting steps to no real avail. Often when a network card or some other network component to which your host is connected starts to fail, you can see it in packet errors on your system. The ifconfig command we used for network troubleshooting before can also tell you about TX (transmit) or RX (receive) errors for a card. Here’s an example from a healthy card:

$ sudo ifconfig eth0
eth0      Link encap:Ethernet  HWaddr 00:17:42:1f:18:be  
          inet addr:10.1.1.7  Bcast:10.1.1.255  Mask:255.255.255.0
          inet6 addr: fe80::217:42ff:fe1f:18be/64 Scope:Link
          UP BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:1 errors:0 dropped:0 overruns:0 frame:0
          TX packets:11 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:229 (229.0 B)  TX bytes:2178 (2.1 KB)
          Interrupt:10 

The lines you are most interested in are

RX packets:1 errors:0 dropped:0 overruns:0 frame:0
TX packets:11 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000 

These lines will tell you about any errors on the device. If you start to see lots of errors here, then it’s worth troubleshooting your physical network components. It’s possible a network card, cable, or switch port is going bad.

Test Hard Drives

Of all of the hardware on your system, your hard drives are the components most likely to fail. Most hard drives these days support SMART, a system that can predict when a hard drive failure is imminent. To test your drives, first install the smartmontools package (sudo apt-get install smartmontools). Next, to test a particular drive’s health, pass the smartctl tool the -H option along with the device to scan. Here’s an example from a healthy drive:

$ sudo smartctl -H /dev/sda
smartctl version 5.37 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

SMART Health Status: OK

This can be useful when a particular drive is suspect, but generally speaking, it would be nice to constantly monitor your drives’ health and report to you. The smartmontools package is already set up for this purpose. All you need to do is open the /etc/default/smartmontools file in a text editor and uncomment the line that says

#start_smartd=yes

so that it looks like

start_smartd=yes

Then the next time the system reboots, smartd will launch automatically. Any errors will be e-mailed to the root user on the system. If you want to manually start the service, you can type sudo service smartmontools start or sudo /etc/init.d/smartmontools start.

Test RAM

Some of the most irritating types of errors to troubleshoot are those caused by bad RAM. Often errors in RAM cause random mayhem on your machine with programs crashing for no good reason, or even random kernel panics. Ubuntu ships with an easy-to-use RAM testing tool called Memtest86+ that is not only installed by default, it’s ready as a boot option. At boot time, hit the Esc key to see the full boot menu. One of the options in the GRUB menu will be labeled Memtest86+. Select that option and Memtest86+ will immediately launch and start scanning your RAM, as shown in Figure 11-1.

Figure 11-1 Memtest86+ RAM scan

Memtest86+ runs through a number of exhaustive tests that can identify different types of RAM errors. On the top right-hand side you can see which test is currently being run along with its progress, and in the Pass field you can see how far along you are with the complete test. A thorough memory test can take hours to run, and I know some administrators with questionable RAM who let the test run overnight or over multiple days if necessary to get more than one complete test through. If Memtest86+ does find any errors, they will be reported in the results output at the bottom of the screen.

  • + Share This
  • 🔖 Save To Your Account