- Introduction to Cluster File Systems
- The NFS
- A Survey of Some Open-Source Parallel File Systems
- Commercially Available Cluster File Systems
- Cluster File System Summary
Introduce the concept of a cluster file system
Explore available file system options
Present examples of cluster file system configurations
The compute slices in a cluster work on pieces of a larger problem. Without the ability to read data and write results, the cluster's computation is of little value. Whether the cluster is a database cluster or a scientific cluster, the compute slices need coordinated access to the shared data that is "behind" their computations. This chapter introduces some of the design issues associated with providing data to multiple compute slices in a cluster and presents some of the currently available solutions.
14.1 Introduction to Cluster File Systems
When talking about cluster file systems we have to be careful about how we define the phrase. One simple definition might be: "a file system that is capable of serving data to a cluster." Another more complex definition might be: "a file system that allows multiple systems to access shared file system data independently, while maintaining a consistent view of that data between clients."
There is a subtle difference in the two definitions to which we need to pay close attention. The first definition might include network-based file systems like the NFS, which has only advisory locking and can therefore allow applications to interfere with each other destructively. The second definition tends to imply a higher level of parallelism coupled with consistency checks and locking to prevent collisions between multiple application components accessing the same data. Although the first definition is somewhat vague, the second definition might be said to refer to a "parallel file system."
14.1.1 Cluster File System Requirements
Because the I/O to a cluster's file system is being performed by multiple clients at the same time, there is a primary requirement for performance. Systems that run large parallel jobs often perform data "checkpointing" to allow recovery in the event that the job fails. There may be written requirements for a "full cluster checkpoint' within a given period of time. This has very direct requirements in terms of file system throughput.
If one assumes that the compute slices in the cluster must simultaneously write all memory to the disk for a checkpoint in a given length of time, you can compute the upper boundary requirement for the cluster file system's write performance. For example, 128 compute slices, each with 4 GB of RAM, would generate 512 GB of checkpoint data. If the period of time required for a clusterwide checkpoint is ten minutes, this yields 51.2 GB/min or 853 MB/s.
This is an extraordinarily high number, given the I/O capabilities of back-end storage systems. For example, a single 2-Gbps Fibre-Channel interface is capable of a theoretical maximum throughput of 250 MB/s. Getting 80% of this theoretical maximum is probably wishful thinking, but provides 200 MB/s of write capability. To "absorb" the level of I/O we are contemplating would take (at least) five independent 2 Gbps Fibre-Channel connections on the server. This ignores the back-end disk array performance requirements. Large, sequential writes are a serious performance problem, particularly for RAID 5 arrays, which must calculate parity information.
This example assumes a compute cluster performing sequential writes for a checkpoint operation. The requirements for database applications have different I/O characteristics, but can be just as demanding in their own right. Before we get too much ahead of ourselves, let's consider the general list of requirements for a cluster's file system. Some of the attributes we seek are
High levels of performance
Support for large data files or database tables
Multiuser and multiclient security
One might also add "consistent name space" to this list. This phrase refers to the ability for all clients to access files using the same file system paths. Some file systems provide this ability automatically (AFS, the DCE distributed file system [DFS], and Microsoft's distributed file system [dfs]), and others require the system administrator to create a consistent access structure for the clients (NFS, for example). We will need to watch for situations that might affect the client's view of the file system hierarchy.
There are some standard questions that we can ask when we need to analyze the architecture of a given cluster file system solution.
How is the file system storage shared?
How many systems can access the file system simultaneously?
How is the storage presented to client systems?
Is data kept consistent across multiple systems and their accesses, and if so, how?
How easy is the file system to administer?
What is the relative cost of the hardware and software to support the file system?
How many copies of the file system code are there and where does it run?
How well does the architecture scale?
These questions, and others, can help us to make a choice for our cluster's file system. Two primary types of file system architectures are available to us: network-attached storage (NAS) and parallel access. As we will see in the upcoming sections, each has their own fans and detractors, along with advantages and disadvantages.
14.1.2 Networked File System Access
Both NFS and CIFS are clientserver implementations of "remote" network-attached file systems. These implementations are based on client-to-server RPC calls that allow the client's applications to make normal UNIX file system calls (open, close, read, write, stat, and so on) that are redirected to a remote server by the operating system. The server's local, physical file systems are exported via an Ethernet network to the server's clients.
The server's physical file systems, such as ext3, jfs, reiser, or xfs, which are exported via NFS, exist on physical storage attached to the server that exports them. The I/O between the network, system page cache, and the file blocks and meta-data on the storage is performed only by a single server that has the file system mounted and exported. In this configuration, the physical disks or disk array LUNs, and the data structures they contain, are modified by a single server system only, so coordination between multiple systems is not necessary. This architecture is shown in Figure 14-1.
Figure 14-1 A network-attached file system configuration
Figure 14-1 shows two disk arrays that present a single RAID 5 LUN each. Because each array has dual controllers for redundancy, the NFS servers "see" both LUNs through two 2-Gbps (full-duplex) Fibre-Channel paths. (It is important that the two systems "see" the same view of the storage, in terms of the devices. There are "multipath I/O" modules and capabilities available in Linux that can help with this task.) The primary server system uses the LUNs to create a single 2.8-TB file system that it exports to the NFS clients through the network. The secondary server system is a standby replacement for the primary and can replace the primary system in the event of a server failure. To keep the diagram from becoming too busy, the more realistic example of one LUN per array controller (four total) is not shown.
The NAS configuration can scale only to a point. The I/O and network interface capabilities of the server chassis determine how many client systems the NFS server can accommodate. If the physical I/O or storage attachment abilities of the server are exceeded, then the only easy way to scale the configuration is to split the data set between multiple servers, or to move to a larger and more expensive server chassis configuration. The single-server system serializes access to file system data and meta-data on the attached storage while maintaining consistency, but it also becomes a potential performance choke point.
Splitting data between multiple equivalent servers is inconvenient, mainly because it generates extra system administration work and can inhibit data sharing. Distributing the data between multiple servers produces multiple back-up points and multiple SPOFs, and it makes it difficult to share data between clients from multiple servers without explicit knowledge of which server "has" the data. If read-only data is replicated between multiple servers, the client access must somehow be balanced between the multiple instances of the data, and the system administrator must decide how to split and replicate the data.
In the case of extremely large data files, or files that must be both read and written by multiple groups of clients (or all clients) in a cluster, it is impossible to split either the data or the access between multiple NFS servers without a convoluted management process. To avoid splitting data, it is necessary to provide very large, high-performance NFS servers that can accommodate the number of clients accessing the data set. This can be a very expensive approach and once again introduces performance scaling issues.
Another point to make about this configuration is that although a SAN may appear to allow multiple storage LUNs containing a traditional physical file system to be simultaneously accessible to multiple server systems, this is not the case. With the types of local file systems that we are discussing, only one physical system may have the LUN and its associated file system data mounted for read and write at the same time. So only one server system may be exporting the file system to the network clients at a time. (This configuration is used in active standby high-availability NFS configurations. Two [or more] LUNs in a shared SAN may be mounted by a server in readwrite mode, whereas a second server mounts the LUNs in read-only mode, awaiting a failure. When the primary server fails, the secondary server may remount the LUNs in readwrite mode, replay any journal information, and assume the IP address and host name of the primary server. The clients, as long as they have the NFS file systems mounted with the hard option, will retry failed operations until the secondary server takes over.)
The SAN's storage LUNs may indeed be "visible" to any system attached to the SAN (provided they are not "zoned" out by the switches), but the ability to read and write to them is not available. This is a limitation of the physical file system code (not an inherent issue with the SAN), which was not written to allow multiple systems to access the same physical file system's data structures simultaneously. (There are, however, restrictions on the number of systems that can perform "fabric logins" to the SAN at any one time. I have seen one large cluster design fall into this trap, so check the SAN capabilities before you commit to a particular design.) This type of multiple-system, readwrite access requires a parallel file system.
14.1.3 Parallel File System Access
If you imagine several to several thousand compute slices all reading and writing to separate files in the same file system, you can immediately see that the performance of the file system is important. If you imagine these same compute slices all reading and writing to the same file in a common file system, you should recognize that there are other attributes, in addition to performance, that the file system must provide. One of the primary requirements for this type of file system is to keep data consistent in the face of multiple client accesses, much as semaphores and spin locks are used to keep shared-memory resources consistent between multiple threads or processes.
Along with keeping the file data blocks consistent during access by multiple clients, a parallel cluster file system must also contend with multiple client access to the file system meta-data. Updating file and directory access times, creating or removing file data in directories, allocating inodes for file extensions, and other common meta-data activities all become more complex when the parallel coordination of multiple client systems is considered. The use of SAN-based storage makes shared access to the file system data from multiple clients easier, but the expense of Fibre-Channel disk arrays, storage devices, and Fibre-Channel hub or switch interconnections must be weighed against the other requirements for the cluster's shared storage.
Figure 14-2 shows a parallel file system configuration with multiple LUNs presented to the client systems. Whether the LUNs may be striped together into a single file system depends on which parallel file system implementation is being examined, but the important aspect of this architecture is that each client has shared access to the data in the SAN, and the client's local parallel file system code maintains the data consistency across all accesses.
Figure 14-2 A parallel file system configuration
A primary issue with this type of "direct attach" parallel architecture involves the number of clients that can access the shared storage without performance and data contention issues. This number depends on the individual parallel file system implementation, but a frequently encountered number is in the 16 to 64 system range. It is becoming increasingly common to find a cluster of parallel file system clients in a configuration such as this acting as a file server to a larger group of cluster clients.
Such an arrangement allows scaling the file system I/O across multiple systems, while still providing a common view of the shared data "behind" the parallel file server members. Using another technology to "distribute" the file system access also reduces the number of direct attach SAN ports that are required, and therefore the associated cost. SAN interface cards and switches are still quite expensive, and a per-attachment cost, including the switch port and interface card, can run in the $2,000- to $4,000-per-system range.
There are a variety of interconnect technologies and protocols that can be used to distribute the parallel file system access to clients. Ethernet, using TCP/IP protocols like Internet SCSI (iSCSI), can "fan out" or distribute the parallel file server data to cluster clients. It is also possible to use NFS to provide file access to the larger group of compute slices in the cluster. An example is shown in Figure 14-3.
Figure 14-3 A parallel file system server
The advantage to this architecture is that the I/O bandwidth may be scaled across a larger number of clients without the expense of direct SAN attachments to each client, or the requirement for a large, single-chassis NFS server. The disadvantage is that by passing the parallel file system "through" an NFS server layer, the clients lose the fine-grained locking and other advantages of the parallel file system. Where and how the trade-off is made in your cluster depends on the I/O bandwidth and scaling requirements for your file system.
The example in Figure 14-3 shows an eight-system parallel file server distributing data from two disk arrays. Each disk array is dual controller, with a separate 2-Gbps Fibre-Channel interface to each controller. If a file system were striped across all four LUNs, we could expect the aggregate write I/O to be roughly four times the real bandwidth of a 2 Gbps FC link, or 640 MB/s (figuring 80% of 200 MB/s).
Dividing this bandwidth across eight server systems requires each system to deliver 80 MB/s, which is the approximate bandwidth of one half of a full-duplex GbE link (80% of 125 MB/s). Further dividing this bandwidth between four client systems in the 32-client cluster leaves 20 MB/s per client. At this rate, a client with 8 GB of RAM would require 6.67 minutes to save its RAM during a checkpoint.
Hopefully you can begin to see the I/O performance issues that clusters raise. It is likely that the selection of disk arrays, Fibre-Channel switches, network interfaces, and the core switch will influence the actual bandwidth available to the cluster's client systems. Scaling up this file server example involves increasing the number of Fibre-Channel arrays, LUNs, file server nodes, and GbE links.
The examples shown to this point illustrate two basic approaches to providing data to cluster compute slices. There are advantages and disadvantages to both types of file systems in terms of cost, administration overhead, scaling, and underlying features. In the upcoming sections I cover some of the many specific options that are available.