Designing a Reliable Network Server
As businesses depend more and more on computer networks, the reliability of the network and the network server becomes more and more critical to the successful operation of the business. This is especially true of businesses that need to operate 24 hours per day 7 days per week. The purpose of this article is to emphasize the components and factors to consider when purchasing and configuring the network server to maximize its reliability.
Network server reliability is most commonly enhanced by providing redundancy of the components that might fail, and whose failure might cause the failure of the network server itself. This redundancy provides fault tolerance to the network server. The obvious goal is to have the network server never fail, thus making it available 100% of the time. This goal can never be achieved, but server availability of 99% or more can generally be attained. Table 1 shows how the amount of downtime in a year relates to the availability percentage.
Table 1 Availability Requirements and Allowable Downtime
Availability Requirement |
Allowable Downtime per Year |
100% |
0 minutes |
99.999% |
5 minutes |
99.99% |
53 minutes |
99.95% |
2 hours and 23 minutes |
99.9% |
8 hours and 45 minutes |
99.5% |
43 hours and 48 minutes |
99% |
87 hours and 36 minutes |
In the following sections, I cover each of the main component subsystems.
Disk Subsystem Reliability
The main network server component that is prone to failure is the disk subsystem used to store the network server's data. This likelihood of failure is due to the fact that, other than the cooling fans, the disk drives that make up the disk subsystem are the only mechanical components in a network server.
Always select disk drives that support Self-Monitoring and Reporting Technology (S.M.A.R.T.). This technology on disk drives allows for the monitoring of disk drive parameters that can predict the imminent failure of the disk drive. S.M.A.R.T. can report that the drive is about to fail, allowing for the data on the disk drive to be backed up to tape and also allowing for the replacement of the disk drive before it actually fails.
Most network server hardware vendors offer hot swappable disk drives, which can be removed and replaced while the network server is operational. This makes the replacement of a failed disk drive a fairly easy operation that does not require a shutdown of the network server.
To provide fault tolerance for the disk subsystem on a network server, implement RAID (redundant array of inexpensive disks). RAID is available in several versions (or levels) with the most common implementations being RAID 1, which is also known as disk mirroring, and RAID 5, which is also known as striped set with parity. RAID is usually implemented by installing a RAID disk controller in the network server. (However, RAID also can be implemented in software by some network operating systems, such as NetWare, Windows NT, and Windows 2000.) RAID controllers are generally available from the network server hardware vendor as well as third-party vendors.
RAID 1 (disk mirroring) duplicates data on two different disk drives. If one disk drive fails, the data is available on the second disk drive. The drawback to RAID 1 is that half of your available disk space is wasted. For example, a RAID 1 system implemented using two 40GB disk drives has a storage capacity of only 40GB.
RAID 5 (striped set with parity) provides fault tolerance by adding "parity" information for each block of data as it is written to the disk drives in the RAID 5 array. The parity information can be used to "regenerate" the missing data in case of a single disk drive failure in the RAID 5 array. A minimum of three disk drives is required to implement a RAID 5 array. In most implementations of RAID 5, the maximum number of disk drives in the RAID array is 32. The data overhead for RAID 5 is 1/n, where n is the number of disk drives in the array. For example, a RAID 5 array implemented with three 40GB disk drives has a storage capacity of 80GB. In this example, 1/n is 1/3. The total size of the RAID 5 array is 120GB, but 1/3 (or 40GB) is "lost" to the storage of the parity information.
Both RAID 1 and RAID 5 can withstand the loss of a single disk drive without loss of data and without the network server ceasing to operate. However, neither RAID level can survive the loss of a second disk drive. This makes it critical that the failed disk drive be replaced as quickly as possible and that the RAID system reestablished with the new disk drive. Many RAID implementations allow a "hot spare" disk drive to be defined to the RAID system. The hot spare drive is installed in the network server and is fully powered. However, no data is written to this drive until a disk drive that is part of a RAID 1 or a RAID 5 array fails. At the point of a failure, the RAID system automatically reestablishes the RAID system using the hot spare drive.