Understanding Differences Between Types of Shared File Systems
Three primary shared file systems are available on the Solaris OS: NFS, QFS/Shared Writer (QFS/SSW), and the cluster Proxy File System (PxFS). Each of these file systems is designed to do different things and their unique characteristics give them wide differentiation.
Sharing Data With NFS
To Solaris OS users, NFS is by far the most familiar file system. It is an explicit over-the-wire file sharing protocol that has been a part of the Solaris OS since 1986. Its manifest purpose is to permit safe, deterministic access to files located on a server with reasonable security. Although NFS is media independent, it is most commonly seen operating over TCP/IP networks.
NFS is specifically designed to operate in multiclient environments and to provide a reasonable tradeoff between performance, consistency, and ease-of-administration. Although NFS has historically been neither particularly fast nor particularly secure, recent enhancements address both of these areas. Performance improved by 5060 percent between the Solaris 8 and Solaris 9 OSs, primarily due to greatly increased efficiency processing attribute-oriented operations5. Data-intensive operations don't improve by the same margin because they are dominated by data transfer times rather than attribute operations.
Security, particularly authentication, has been addressed through the use of much stronger authentication mechanisms such as those available using Kerberos. NFS clients now need to trust only their servers, rather than their servers and their client peers.
Understanding the Sharing Limitations of UFS
UFS is not a shared file system. Despite a fairly widespread interest in a limited-use configuration (specifically, mounted for read/write operation on one system, while mounted read-only on one or more "secondary" systems), UFS is not sharable without the use of an explicit file sharing protocol such as NFS. Although read-only sharing seems as though it should work, it doesn't. This is due to fairly fundamental decisions made in the UFS implementation many years ago, specifically in the caching of metadata. UFS was designed with only a single system in mind and it also has a relatively complex data structure for files, notably including "indirect blocks," which are blocks of metadata that contain the addresses of real user data.
To maintain reasonable performance, UFS caches metadata in memory, even though it writes metadata to disk synchronously. This way, it is not required to re-read inodes, indirect-blocks, and double-indirect blocks to follow an advancing file pointer. In a single-system environment, this is a safe assumption. However, when another system has access to the metadata, assuming that cached metadata is valid is unsafe at best and catastrophic at worst.
A writable UFS file system can change the metadata and write it to disk. Meanwhile, a read-only UFS file system on another node holds a cached copy of that metadata. If the writable system creates a new file or removes or extends an existing file, the metadata changes to reflect the request. Unfortunately, the read-only system does not see these changes and, therefore, has a stale view of the system. This is nearly always a serious problem, with the consequences ranging from corrupted data to a system crash. For example, if the writable system removes a file, its blocks are placed in the free list. The read-only system isn't provided with this information, therefore, a read of the same file will cause the read-only to follow the original data pointers and read blocks that are now on the free list!
Rather than risk such extreme consequences, it is better to use one of the many other options that exist. The selection of which option is driven by a combination of how often updated data must be made available to the other systems, and the size of the data sets involved. If the data is not updated too often, the most logical option is to make a copy of the file system and to provide the copy to other nodes. With point-in-time copy facilities such as Sun Instant Image, HDS ShadowImage, and EMC TimeFinder, copying a file system does not need to be an expensive operation. It is entirely reasonable to export a point-in-time copy of a UFS file system from storage to another node (for example, for backup) without risk because neither the original nor the copy is being shared. If the data changes frequently, the most practical alternative is to use NFS. Although performance is usually cited as a reason not to do this, the requirements are usually not demanding enough to warrant other solutions. NFS is far faster than most users realize, especially in environments that involve typical files smaller than 510 megabytes. If the application involves distributing rapidly changing large streams of bulk data to multiple clients, QFS/SW is a more suitable solution, albeit not bundled with the operating system.
Maintaining Metadata Consistency Among Multiple Systems With QFS Shared Writer
The architectural problem that prevents UFS file systems from being mounted on multiple systems is that there exist no provisions for maintaining metadata consistency among multiple systems. QFS Shared Writer (QFS/SW) implements precisely such a mechanism by centralizing access to metadata in a metadata server located in the network. Typically, this is accomplished using a split data and metadata path. Metadata is accessed through IP networks, while user data is transferred over a SAN.
All access to metadata is required to go over regular networks for arbitration by the metadata server. The metadata server is responsible for coordinating possibly conflicting access to metadata from varying clients. Assured by the protocol and the centralized server that metadata are consistent, all client systems are free to cache metadata without fear of catastrophic changes. Clients then use the metadata to access user data directly from the underlying disk resources, providing the most efficient available path to user data.
Direct Access Shared Storage
The direct access architecture offers vastly higher performance than existing network-sharing protocols when it comes to manipulating bulk data. This arrangement eliminates or greatly reduces two completely different types of overhead. First, data is transferred using the semantic-free SCSI block protocol. Transferring the data between the disk array and the client system requires no interpretation; no semantics are implied in the nature of the protocol, thus eliminating any interpretation overhead. By comparison, the equivalent NFS operations must use the NFS, RPC, XFS, TCP, and IP protocols, all of which are normally interpreted by the main processors.
The QFS/SW arrangement also eliminates most of the copies involved in traditional file sharing protocols such as NFS and CIFS. These file systems transfer data several times to get from the disk to the client's memory. A typical NFS configuration transfers from the disk array to the server's memory, then from the server's memory to the NIC, then across the network, and then from the NIC to the client's memory. (This description overlooks many implementation details.) In contrast, QFS/SW simply transfers data directly from the disk to the client's memory, once the metadata operations are completed and the client is given permission to access data. For these reasons, QFS/SW handles bulk data transfers at vastly higher performance than traditional file sharing techniques.
Furthermore, QFS/SSW shares the on-disk format with QFS/local. In particular, user data can be configured with all of the options available with QFS/local disk groups, including striped and round-robin organizations. These capabilities make it far easier to aggregate data transfer bandwidth with QFS than with NFS, further increasing the achievable throughput for bulk data operations. User installations have measured single-stream transfers in excess of 800 megabytes per second using QFS or QFS/SW. One system has been observed transferring in excess of 3 gigabytes per second. (Obviously, such transfer rates require non-trivial underlying storage configurations, usually requiring 1032 Fibre-Channel disk arrays, depending on both the file system parameters and array capability and configuration.) For comparison, the maximum currently achievable single-stream throughput with NFS is roughly 70 megabytes per second.
Handling of Metadata
The performance advantages of direct-client access to bulk data are so compelling that one might reasonably ask why data isn't always handled this way. There are several reasons. Oddly enough, one is performance, particularly scalability. Although performance is vastly improved when accessing user data using direct storage access, metadata operations are essentially the same speed for NFS and QFS/SW. However, the metadata protocols are not equivalent because they were designed for quite different applications. NFS scales well with the number of clients. NFS servers are able to support hundreds or thousands of clients with essentially the same performance. NFS was designed with sharing many files to many clients in mind, and it scales accordingly in this dimension, even when multiple clients are accessing the same files.
QFS/SW was designed primarily for environments in which data sets are accessed by only a few clients, and engineering tradeoffs favor high-performance bulk transfer over linear client scalability. Empirical studies report that while QFS/SW scales well for small numbers of nodes (four to eight), scalability diminishes rapidly thereafter. To a large degree, this is to be expected: the bulk transfer capabilities provided are so high that a few clients can easily exhaust the capabilities of even high performance disk arrays.
Another consideration is that the efficiency of QFS/SW is fundamentally derived from its direct, low-overhead access to the storage. This necessarily limits the storage configurations to which it can be applied. QFS/SW is distinctly not a wide-area sharing protocol. Furthermore, as noted elsewhere in this article, the NFS and QFS/SW trust models are completely different. One of the key considerations in the efficiency of shared file systems is the relative weight of metadata operations when compared to the weight of data transfer operations. Most file sharing processes involve a number of metadata operations. Opening a file typically requires reading and identifying the file itself and all of its containing directories, as well as identifying access rights to each. The process of finding and opening /home/user/.csh requires an average of seven metadata lookup operations and nine client/server exchanges with NFS; QFS/SW is of similar complexity. Compared with the typical 70-kilobyte file in typical user directories, these metadata operations so dominate the cost of data transfer that even completely eliminating transfer overhead would have little material impact on the efficiency of the client/server system. The efficiency advantages of direct access storage are only meaningful when the storage is sufficiently accessible and when the data is large enough for the transfer overhead to overwhelm the cost of the required metadata operations.
Understanding How PxFS Interacts With Sun Cluster Software
One of Sun Cluster software's key components is the Proxy File System (PxFS). This is an abstraction used within a cluster to present to clients the illusion that all cluster members have access to a common pool of data. In particular, PxFS differs from more general file sharing protocols in two fundamental ways.
Most significantly, the PxFS differs in scope. NFS provides data access to a set of clients that are general in nature; the clients do not even need to run the same OS as the server, and the clients definitely have no requirement to participate in the same cluster as the server. PxFS has quite a different scope. It exports data only to other members of Sun Cluster.
The type of processing clients apply to data obtained through PxFS depends on how the data will be used. For example, if clients outside the cluster must have access to data residing in the cluster, client member nodes can export data through standard export protocols such as NFS and CIFS. In this case, a node would simultaneously be both a PxFS client and an NFS server.
PxFS data is normally transferred over the cluster interconnect. This is usually the highest performance interconnect available in the configuration and it usually delivers somewhat higher performance than typical local area network (LAN) interfaces. However, because most of the overhead in sharing data is due to underlying transports rather than to the sharing protocol itself, PxFS transfers over the cluster interconnect are often not as fast as users might expect, even using fast media such as InterDomain Networking, Myrinet, or WildFire.
The second main area of difference between PxFS and general file sharing protocols is intimacy between the cooperating nodes. PxFS is an integral part of Sun Cluster software; it is layered above native local file systems on each cluster node. The PxFS interface is fairly intimate and requires explicit changes in the underlying file system to accommodate the PxFS support calls. At the time of this writing, UFS and VxFS are the only local file systems supported under PxFS. (Other file systems might operate on cluster nodes, but their data is not necessarily available to other members of the cluster.)
Another basic design characteristic of PxFS is that it operates at the same trust level as the cluster software and the operating system itself. Clients and servers inherently trust each other. This is not true with either NFS or CIFS, and the level of trust between client and server is also higher than for QFS/SW.
FIGURE 1 Capabilities of Various Solaris Shared File Systems
The preceding figure illustrates the various capabilities and design tradeoffs for typical Solaris shared file systems. Indicated throughput is for a single thread running on a single client. Greater (often much greater) throughput can be obtained from a server, usually by configuring additional storage resources.