Storage Networking Building Blocks
A storage network may be composed of a wide variety of hardware and software products. Software products invariably involve management of some kind: management of logical volumes, management of disk resources, management of tape subsystems and backup, and management of the network transport status. Hardware products include the storage end systems and their interfaces to the network as well as the switches and bridge products that provide connectivity. Storage network interfaces typically include legacy SCSI, Fibre Channel, and Gigabit Ethernet. Just as first-generation Fibre Channel SANs accommodated legacy SCSI disk and tape devices, IP-based storage networks must incorporate both Fibre Channel and legacy SCSI devices. The challenge for vendors is to make this complexity disappear, and find inventive ways to integrate new and old technologies with minimal user involvement.
The following sections discuss the major building blocks of storage networks and highlight the technical aspects that differentiate them.
3.1 Storage Networking Terminology
Storage networking end systems include storage devices and the interfaces that bring them into the network. RAID arrays, just a bunch of disks (JBODs), tape subsystems, optical storage systems, and host adapter cards installed in servers are all end systems. The interconnect products that shape these end systems into a coherent network are discussed in the following sections on Fibre Channel, Gigabit Ethernet, and NAS. New SAN products such as virtualization devices and SAN appliances that front-end storage may, depending on implementation, also be considered end systems from the standpoint of storage targets, or interconnects from the standpoint of hosts.
RAID is both a generic term for intelligent storage arrays and a set of methods for the placement of data on multiple disks. Depending on the methods used, RAID can both enhance storage performance and enable data integrity. Because logic is required to distribute and retrieve data safely from multiple disk resources, the RAID function is performed by an intelligent controller. The controller may be implemented in either hardware or software, although optimal performance is through hardware, in the form of application-specific integrated circuits (ASICs) or dedicated microprocessors. A RAID array embeds the controller function in the array enclosure, with the controller standing between the external interface to the host and the internal configuration of disks. RAID arrays may include eight to ten internal disks for departmental applications or many more for data center requirements.
The performance problem that RAID solves stems from the ability of host systems to deliver data much faster than storage systems can absorb it. When a server is connected to a single disk drive, reads or writes of multiple data blocks are limited by the buffering capability, seek time, and rotation speed of the disk. While the disk is busy processing one or more blocks of data, the host must wait for acknowledgment before sending or receiving more. Throughput can be significantly increased by distributing the stream of data block traffic across several disks in an array, a technique called striping. In a write operation, for example, the host can avoid swamping the buffering capacity of any individual drive by subdividing the data blocks into several concurrent transfers sent to multiple targets. This simplified RAID is called level 0. If the total latency of an individual disk restricted its bandwidth to 10 to 15 MBps, then eight disks in an array could saturate a gigabit link that provided approximately 100 MBps of effective throughput.
Although boosting performance, RAID 0 does not provide data integrity. If a single disk in the RAID set fails, data cannot be reconstructed from the survivors. Other RAID techniques address this problem by either writing parity data on each drive in the array or by dedicating a single drive for parity information. RAID 3, for example, writes byte-level parity to a dedicated drive, and RAID 4 writes block-level parity to a dedicated drive. In either case, the dedicated parity drive contains the information required to reconstruct data if a disk failure occurs. RAID 3 and RAID 4 are less commonly used as stand-alone solutions because the parity drive itself poses a performance bottleneck problem. RAID 5 is the preferred method for striped data because it distributes block-level parity information across each drive in the array. If an individual disk fails, its data can be reconstructed from the parity information contained on the other drives, and parity operations are spread out among the disks.
The RAID striping algorithms range from simple to complex and thus imply much higher logic at the RAID 5 level. More logic implies more expense. RAID 5 controllers must not only provide the intelligence to distribute data and parity information across multiple drives, but must also be able to reconstruct data automatically in case of a disk failure. Typically, additional drives are provisioned in standby mode (spare), available for duty if a primary disk fails. The additional manipulation of data provided by RAID 5 also implies latency, and vendors of these products compete on the basis of performance and optimized controller logic.
Storage applications may also require full data redundancy. If, for example, the RAID controller failed or a break occurred along the cable plant, none of the RAID striping methods would ensure data availability. RAID level 1 achieves full data integrity by trading the performance advantage of striping for the integrity of data replication. This is accomplished by mirroring. In disk mirroring, every write operation to a primary disk is repeated on a secondary or mirrored disk. If the primary disk fails, a host can switch over to the backup disk. Mirroring on its own is subject to two performance hits: once for the latency of disk buffering, seek time, and spindle speed, and once for the additional logic required to write to two targets simultaneously. The input/output (I/O) does not complete until both writes are successful. Mirroring is also an expensive data integrity solution because the investment in storage capacity is doubled with every new installation. Mirroring, however, is the only solution that offers a near-absolute guarantee that data will be readily available in the event of disk failure. In addition, because data is written to two separate disk targets, the targets themselves may be separated by distance. Mirroring thus provides a ready solution for disaster recovery applications, provided sufficient latency and bandwidth are available between primary and remote sites. Storage replication over distance also makes mirroring a prime application for IP-based storage solutions.
RAID implementations may be combined to provide both striping throughput and data redundancy via mirroring. RAID 0 + 1, for example, simply replicates a RAID 0 disk set to create two separate copies of data on two separate, high-performance arrays. Just as mirroring doubled the cost of the capacity of a single drive, RAID 0 + 1 doubles the cost of a RAID striping array. As shown in Figure 31, data blocks are striped over disks in each array, providing an exact copy of data that can be written or read at high speed. Some implementations may combine RAID 5 with RAID 1 for an even higher level of data integrity and availability.
Figure 31 RAID 0 + 1 striping plus mirroring. Blocks A, B, C, D, E, and F of a single data file.
Because RAID implementations treat the drives in an array as a single resource, homogeneity of access is imposed on each individual drive. The performance of the disk set is thus determined by the lowest common denominator. The disk with the slowest access time will dictate the performance of all other drives. RAID-based storage arrays are thus typically populated with the same drive type per set, including unused hot spares that can be brought on-line if a drive fails. As newer, lower cost, and higher capacity drives are constantly introduced into the market, however, a customer may replace a failed disk with a much higher performing unit. The performance and capacity advantage of the newer unit will not be realized because the operational parameters of the original drive will throttle the new disk.
The main components of a RAID subsystem include the interface (parallel SCSI, Fibre Channel, or Gigabit Ethernet), the RAID controller logic, the backplane that accommodates the disks, the disks themselves, and the enclosure, including power supplies and fans. At the high end of the RAID food chain, RAID arrays may provide redundant power supplies and fans, diagnostic and "phone home" features for servicing, and high-performance logic to optimize RAID operations. At the low end, RAID may be implemented via software on a host system with data striped to unrelated disks (JBOD). Software RAID places additional burdens on the server or host because CPU cycles must be devoted to striping across multiple targets. The RAID function may also be provided by an HBA or interface card, which off-loads the host CPU and manages striping or mirroring functions to disk targets. This is advantageous from the standpoint of the host, but, as with software RAID, places additional transactions on the storage network for the parity operations. Despite higher cost, a RAID subsystem offers optimal performance for both RAID functions and storage network traffic compared with software or adapter card implementations.
A RAID enclosure hides the access method between the RAID controller and the disks it manages. The external interface between the host systems and the array may be parallel SCSI, Fibre Channel, or Gigabit Ethernet. The internal interface between the RAID controller and disks may be parallel SCSI or Fibre Channel, and at the very low end of the spectrum, even Integrated Drive Electronics (IDE). This separation between the internal workings of the array and the external interface provides flexibility in designing low-, medium-, and high-end RAID systems that target different markets. It also facilitates the introduction of new external interfaces such as IP over Gigabit Ethernet, which although not trivial from an engineering standpoint at least do not require redesign of the entire subsystem including the back-end disks.
In Figure 32, the basic architecture of RAID subsystems is shown with the variations of external interfaces and internal disk configurations. If Fibre Channel disks are used, the internal Fibre Channel disk interface may be based on a shared loop or switched fabric (discussed later).
Figure 32 Basic architecture of a RAID subsystem.
RAID systems are a powerful component of storage networks. They offer the performance and data integrity features required for mission-critical applications, and the flexibility (given sufficient budget) for resilient data replication and disaster recovery strategies. They also provide, depending on vendor implementation, the ability to scale to terabytes of data in a single, highly available storage resource. In a storage network based on Fibre Channel or Gigabit Ethernet, the resources of a RAID array may be shared by multiple servers, thus facilitating data availability and reduction of storage management costs through storage consolidation.
A JBOD is an enclosure with multiple disk drives installed in a common backplane. Unlike a RAID array, a JBOD has no front-end logic to manage the distribution of data over the disks. Instead, the disks are addressed individually, either as separate storage resources or as part of a host-based software or an adapter card RAID set. JBODs may be used for direct-attached storage based on parallel SCSI cabling, or on a storage network with, typically, a Fibre Channel interface.
The advantage of a JBOD is its lower cost vis-à-vis a RAID array, and the consolidation of multiple disks into a single enclosure that share power supplies and fans. JBODs are often marketed for installation in 19-inch racks and thus provide an economical and space-saving means to deploy storage. As disk drives with ever-higher capacity are brought to market, it is possible to build JBOD configurations with hundreds of gigabytes of storage.
Because a JBOD has no intelligence and no independent interface to a storage network, the interface type of the individual drives determines the type of connectivity to the SAN. An IP-based storage network using Gigabit Ethernet as a transport would therefore require Gigabit Ethernet/IP interfaces on the individual JBOD disks, or a bridge device to translate between Gigabit Ethernet and IP to Fibre Channel or parallel SCSI. Over time, disk drive manufacturers will determine the type of interface required by the market.
As shown in Figure 33, a JBOD built with SCSI disks is an enclosed SCSI daisy chain and offers a parallel SCSI connection to the host. A JBOD built with Fibre Channel disks may provide one or two Fibre Channel interfaces to the host and internally is composed of shared loop segments. In either configuration, a central issue is the vulnerability of the JBOD to individual disk failure. Without the appropriate bypass capability, the failure of a single drive could disable the entire JBOD.
Figure 33 JBOD disk configurations.
Management of a JBOD enclosure is normally limited to simple power and fan status. In-band management may be provided by the SCSI Enclosure Services protocol, which can be used in both parallel SCSI and Fibre Channel environments. Some vendor offerings also allow the JBOD to be divided into separate groups of disks via a hardware switch or jumpers. As shown in Figure 34, a single Fibre Channel JBOD may appear as two separate resources to the host.
Figure 34 Dividing a Fibre Channel JBOD backplane into separate loops.
How the individual disk drives within a JBOD are used for data storage is determined by the host server or workstation, or by RAID intelligence on an HBA. Windows Disk Administrator, for example, can be used to create individual volumes from individual JBOD disks, or can assign groups of JBOD disks as a volume composed of a striped software RAID set. Software RAID will increase performance in reads and writes to the JBOD, but will also give exclusive ownership of the striped set to a single server. Without volume-sharing middleware, multiple servers cannot simultaneously manage the organization of striped data on a JBOD without data corruption. The symptom of unsanctioned sharing is the triggering of endless check disk sequences as a host struggles with unexpected reorganization of data on the disks. Generally, software RAID on JBODs offers higher performance and redundancy for dedicated server-to-storage relationships, but does not lend itself to server clustering or serverless tape backup across the SAN.
One means of leveraging JBODs for shared storage is the use of storage virtualization appliances that sit between the host systems and the JBOD targets. The virtualization appliance manages the placement of data to multiple JBODs or RAID arrays, while presenting the illusion of a single storage resource to each host. This makes it possible to dispense with software RAID on the host because this function is now assumed by the appliance. Essentially, storage virtualization fulfills the same function of an intelligent RAID controller, except that the virtualization appliance and storage arrays now sit in separate enclosures across the storage network. As with Dorothy in The Wizard of Oz, however, one really should pay attention to the little man behind the curtain. Although presenting a simplified view of storage resources to the host systems, storage virtualization appliances must necessarily assume the complexity of managing data placement and be able to recover automatically from failures or disruptions. This is not a trivial task.
3.1.3 Tape Subsystems
Data storage on media of different kinds divides into a hierarchy of cost per megabyte, performance, and capacity. Generally, higher cost per megabyte brings greater performance and, ironically, lower capacity. At the high end, solid-state storage offers the performance of memory access, but with limited capacity. Spinning media such as disk drives in RAID configurations can support gigabit access speeds, with the capacity of more than a terabyte of data. Tape subsystems may only support a fifth of that speed, but large libraries can store multiple terabytes of data. At the low-performance end of the scale, optical media libraries offer nearly unlimited storage capacity. The two most commonly used solutionsdisk and tapeare deployed for a sequential storage strategy, with normal data transactions based on disk, followed by periodic backup of that data to tape. Hierarchical storage management applications may be used to rationalize this process and to determine the frequency of and appropriate migration of data from disk to tape, and sometimes from tape to optical storage.
Securing a backup copy of data as a safeguard against disk or system failure is a universal problem. No institution or enterprise is likely to survive a loss of mission-critical information. In addition, a company may be obliged to keep reliable copies of its data according to government or commercial regulations. Financial institutions, for example, must keep long-term records of their transactions, which may require both tape and optical storage archiving.
If spinning disk media could provide affordable, scalable, and highly reliable long-term storage, mirroring would suffice. Although this solution is strongly supported by disk manufacturers, mirroring for ever-increasing amounts of data has proved too costly to implement. Tape, by contrast, has proved to be economical, scalable, and highly reliable, and despite the advances made by disk technology, tape is likely to endure as the primary tool for the archival preservation of data.
In its pure SCSI incarnation, tape provides a secure copy of data, but its performance is constrained by the typical topology in which it is deployed. For optimal performance in traditional environments, a SCSI tape device can be attached to a server/storage SCSI daisy chain. In this configuration, the tape device (like the storage arrays) becomes captive to an individual server. Each server would thus require its own tape unit for backup, multiplying the cost of tape systems and management throughout the network. Alternately, a dedicated tape backup server and SCSI-attached tape library can provide centralized backup over the LAN. This facilitates resource sharing, but places large block data transfer on the same LAN that is used for user traffic. The bandwidth of the LAN itself may create additional problems. The backup window, or period of time in which a nondisruptive backup could occur, may not be sufficient for the amount of data that requires duplication to tape.
The conflict between backup requirements and the constraints imposed by LAN-based backup is resolved by storage networking. By placing tape subsystems on a storage network, they become shared resources for multiple servers and can now move backup traffic independently of the LAN. This simultaneously reduces costs, simplifies administration, and provides greater bandwidth for backup streams. LAN-free backup on a storage network has also enabled new backup solutions. Even on a SAN, the backup traffic moves from disk storage to server and from server to tape. The server is in the backup path because it is responsible for reads from disk and for writes to tape. However, because servers, storage, and tape subsystems are now peers on the storage network, data paths can be enabled directly between disk storage and tape resources. Server-free backup is predicated on intelligent backup agents on the SAN that can perform the server's read/write functions for tape. A third-party copy (extended copy) agent may be embedded in the tape library, in the SAN switch, or in a SAN-to-SCSI bridge used to connect a SCSI tape library. Because the third-party copy agent assumes the task of reading from disk and writing to tape, server CPU cycles are freed for user transactions. LAN-free and server-free backup solutions for IP-based SANs are discussed further in Chapter 13.
The internal design of a tape subsystem is vendor specific, but typically includes an external interface for accessing the subsystem, controller logic for formatting data for tape placement, one or more tape drives that perform the write and read functions, robotics for manipulating tape cartridges and feeding the drives, and slots to hold the cartridges while not in use. Vendors may promote a variety of tape technologies, including advanced intelligent tape, linear tape-open, and digital linear tape, which are differentiated by performance and capacity.
The external interface to a tape library may be legacy SCSI, Fibre Channel, or Gigabit Ethernet. Although each tape drive within a library may only support 10 to 15 MBps of throughput, multiple drives can leverage the bandwidth provided by a gigabit interface. Theoretically this would allow multiple servers (or third-party copy agents) to back up simultaneously to a single library over the SAN, although the library controller must support this feature. Another potential performance enhancement is provided by the application of RAID striping algorithms to tape, or tape RAID. As with disk RAID, tape RAID implies additional complexity and logic, and therefore additional expense.
The initiative for IP-based SCSI solutions for tape was launched by SpectraLogic Corporation in the spring of 2001. Tape, like storage arrays, relies on block SCSI data for moving large volumes of data efficiently. SCSI over IP on Gigabit Ethernet infrastructures provides tape vendors with much greater flexibility in deploying their solutions. The Gigabit Ethernet network may be a dedicated SAN or a virtual segment (VLAN) of an enterprise network. Backups may thus occur wherever sufficient bandwidth has been allocated, and familiar IP and Ethernet management tools can be leveraged to monitor backup traffic. Because third-party copy is infrastructure neutral, serverless backup can also be used for IP-based tape subsystems.
3.1.4 SCSI Over IP-to-Parallel SCSI Bridges
Like first-generation Fibre Channel SANs, IP-based SANs must accommodate legacy devices, including SCSI disk arrays and SCSI tape subsystems. SCSI tape libraries, in particular, represent a substantial investment, and few information technology (IT) administrators have the luxury of discarding a valuable resource simply because interface technology has improved. The common denominator between the IP SAN and the legacy tape device is the SCSI protocol. The legacy tape device, however, supports parallel SCSI, or the SCSI-2 protocol. The IP SAN supports serial SCSI, or the SCSI-3 protocol. The function of a bridge is to translate between the two SCSI variants, and to make the SCSI-2 tape or storage subsystem appear to be a bona fide IP-addressable device.
An IP storage-to-SCSI bridge may provide multiple parallel SCSI ports to accommodate legacy units, and one or more Gigabit Ethernet ports to front the SAN. Just as a Fibre Channel-to-SCSI bridge must assign a Fibre Channel address to each legacy SCSI device, an IP storage-to-SCSI bridge must proxy IP addresses. The specific type of serial SCSI-3 supported by the Gigabit Ethernet ports on the bridge is vendor dependent, but may be iSCSI or iFCP.
SAN bridges provide a valuable function, both for preserving the customer's investment in expensive subsystems and for making those subsystems participants in a shared storage network. Although bridges are normally used to bring tape subsystems into a SAN, legacy SCSI disk arrays can also be supported. In some vendor implementations, even SCSI hosts (for example, servers with SCSI adapter cards) can be accommodated. This enables an IP storage network to be constructed with SCSI end systems only, using IP and Gigabit Ethernet as the SAN infrastructure. Customers with large SCSI installations could thus enjoy the benefits of shared storage networking without investing in new host adapters, storage, or tape.
3.1.5 Host Adapters
Host adapter cards provide the interface between the server or workstation internal bus and the external storage network. Interface cards are available for different bus architectures and may offer different physical connections for network interface. The adapter card vendor also supplies a device driver that allows the card to be recognized by the operating system. The device driver software may also perform protocol translation or other functions if these are not already executed by onboard logic.
Whether Fibre Channel or Gigabit Ethernet, the HBA or network interface card (NIC) must provide reliable gigabit communication at the physical and data link levels. As discussed later, Gigabit Ethernet has taken the physical layer and data encoding standards from Fibre Channel. Above the data encoding level, however, Gigabit Ethernet must appear as standard Ethernet to provide seamless integration to operating systems and applications.
For storage networking, two additional components may appear on the host Gigabit Ethernet interface card. To support storage data transfer efficiently, a storage NIC must incorporate an upper protocol layer for serial SCSI-3. This may be an iSCSI interface or an FCP interface. The purpose of this protocol interface is to deliver SCSI data to the operating system with high performance and low processor overhead. Fibre Channel FCP has solved this SCSI delivery issue, whereas the iSCSI initiative is reengineering an entirely new solution.
The second component that may be embedded on a storage NIC is additional logic to off-load TCP/IP processing from the host CPU. At gigabit speeds, TCP overhead may completely consume the resources of a host system. This would be unacceptable for servers in a storage network, which must simultaneously service both disk and user transactions. TCP off-load engines (for those lacking enough acronyms, TOEs) may be provided by software routines or, more efficiently, in ASICs or dedicated processors onboard the NIC.
Figure 35 A Fibre Channel HBA.
Figure 36 A storage over IP NIC.
Both adapters provide transmit and receive connections to the storage network, which may be via fiber-optic or copper cabling. Both adapter types provide a clock and data recovery (CDR) logic to retrieve gigabit signaling from the inbound bit stream. Both provide serializing/deserializing logic to convert serial bits into parallel data characters for inbound streams, and convert parallel into serial for outbound streams. The mechanism for data encoding is provided by an 8b/10b encoding scheme originally developed by IBM and discussed in further detail later. For transmission on gigabit links, the data encoding method also uses special formatting of data and commands known as ordered sets.
Above the ordered set logic, a storage NIC will include a LAN controller chip, auxiliary logic and memory, an optional TOE, and hardware-based or software drivers for the serial SCSI-3 protocol. All of this functionality is made physically accessible to the host platform through the PCI, S-bus, or other bus interface and is logically accessible through the host device driver supplied by the manufacturer.
One of the often-cited benefits of IP-based storage networking is the ability to leverage familiar hardware and management software to deploy and maintain a SAN. In the example given here of a storage network adapter, there are clearly common components to ordinary Ethernet NICs. It is unlikely, however, that off-the-shelf Gigabit Ethernet NICs will be suited to storage applications. Without embedded logic to speed serial SCSI processing and to off-load TCP overhead, a standard Gigabit Ethernet NIC will not provide the performance required for storage applications. From a management and support standpoint, however, the distinction between a specialized storage NIC and a standard NIC are minimal when compared with Fibre Channel HBAs.