InformIT

Design, Features, and Applicability of Solaris File Systems

Date: Mar 26, 2004

Article is provided courtesy of Prentice Hall Professional.

Return to the article

The Solaris Operating System includes many file systems, and more are available as add-ons. Deciding which file system to apply to a particular application can be puzzling without insight into the design criteria and engineering tradeoffs that go into each product. This article offers a taxonomy of file systems, describes some of the strengths and weaknesses of the different file systems, and provides insight into the issues you should consider when deciding how to apply the set of file systems that are available for specific applications. This article requires an intermediate reader.

The SolarisTM Operating System (Solaris OS) includes many file systems, and more are available as add-ons. Deciding which file system to apply to a particular application can be puzzling without some insight into the design criteria and engineering tradeoffs that go into each product.

This Sun BluePrintsTM OnLine article offers a taxonomy of file systems as a means of classifying the multitude of different offerings. This can be particularly difficult because some file systems have many different facets, leading to confusion even about what they do. Within each category, each file system has strengths and weaknesses, and these often dictate which applications are appropriate for each product. After addressing the designs, we consider the general problem of deciding how to apply the set of available file systems of specific applications.

This article provides crucial, often overlooked, information for system administrators and architects and assumes that you are comfortable with Solaris concepts and basic file systems. This article addresses the following topics:

Understanding What a File System Is

Before considering what file systems to use with which applications, you should understand what we mean by the term file system. In the context of this article, a file system stores named data sets and attributes about those data sets for subsequent data access and interpretation of the attributes. Attributes include things like ownership, access rights, date of last access, and physical location. More advanced attributes might be textual descriptions, migration or performance policies, and encryption keys. This definition sounds simple, but in fact it is quite broad and covers many things.

Understanding File System Taxonomy

When thinking about file systems, it is useful to break the group into categories. In particular, there are five broad categories of file system available in the Solaris OS.

Although intellectually interesting and often useful for observing the way a system functions, pseudo file systems are typically not flexible; there is usually just one way to use them. In contrast, many of the other file systems can be used in a multitude of ways. Most of the remainder of this article is devoted to figuring out which file system to apply to what task, and how one might best go about applying them.

Understanding Local File System Functionality

Local file systems are the most visible type of file system. Those most often encountered on Solaris systems are UFS and its relatives, VxFS, QFS/local, and SAMFS. VxFS is a good compromise design that does most things fairly well. UFS and QFS/local, in particular, have almost diametrically opposite design centers. SAMFS is essentially QFS, without separation of data and metadata. QFS or SAMFS with the SAM option are different in several senses because their primary values are that they offer hierarchical storage management functions not available in the other three main local file systems.

Handling General File System Functionality With UFS

Every Solaris system includes UFS. Because it is the most integrated of the file systems, it has received a lot of development attention over the past few years. While it is definitely old and lacking some features, it is also very suitable for a wide variety of applications. The UFS design center handles typical files found in office and business automation systems. The basic I/O characteristics are huge numbers of small, cachable files, accessed randomly by individual processes; bandwidth demand is low. This profile is common in most workloads, such as software development and network services (for example, in name services, web sites, and ftp sites).

In addition to the basic UFS, there are two variants, logging UFS (LUFS) and the metatrans UFS. All three versions share the same basic code that blocks allocation, directory management, and data organization. In particular, all current versions of UFS have a nominal maximum file system size of 1 terabyte (the limit will be raised to 16 terabytes in the Solaris 10 OS). Obviously, a single file stored in any of them must fit inside a file system, so the maximum size file is slightly smaller, about 1009 gigabytes out of a 1024 gigabyte file system. There is no reasonable limit to the number of file systems that can be built on a single system; systems have been run with over 2880 UFS file systems.

Benefits of Logging

The major differences between the three UFS variants are in how they handle metadata. Metadata is information that the file system stores about the data, such as the name of the file, ownership and access rights, last modified date, file size, and other similar details. Other, less obvious, but possibly more important metadata are the location of the data on the disk, such as data blocks and the indirect blocks that indicate where data locks reside in the disk.

Getting this metadata wrong would not only mean that the affected file might be lost, but could lead to serious file system-wide problems or even a system crash in the event that live data found itself in the free space list, or worse, that free blocks somehow appeared in the middle of a file. UFS takes the simplest approach to assuring metadata integrity: it writes metadata synchronously and requires an extensive fsck on recovery from a system crash. The time and expense of the fsck operation is proportional to the number of files in the file system being checked. Large file systems with millions of small files can take tens of hours to check. Logging file systems were developed to avoid both the ongoing performance issues associated with synchronous writes and excessive time for recovery.

Logging uses the two-phase commit technique to ensure that metadata updates are either fully updated on disk, or that they will be fully updated on disk upon crash recovery. Logging implementations store pending metadata in a reserved area, and then update the master file system based on the content of the reserved area or log. In the event of a crash, metadata integrity is assured by inspecting the log and applying any pending metadata updates to the master file system before accepting any new I/O operations from applications. The size of the log is dependent on the amount of changing metadata, not the size of the file system. Because the amount of pending metadata is quite small, usually on the order of a few hundred kilobytes for typical file systems and several tens of megabytes for very busy file systems. Replaying the log against the master is therefore a very fast operation. Once the metadata integrity is guaranteed, the fsck operation becomes a null operation and crash recovery becomes trivial. Note that for performance reasons, only metadata is logged; user data is not logged.

The metatrans implementation was the first version of UFS to implement logging. It is built into Solstice DiskSuiteTM or SolarisTM Volume Manager software (the name of the product depends on the version of the code, but otherwise, they are the same). The metatrans implementation was integrated in the Solaris 7 OS foundation, the two versions were the same. The only difference visible to users is that the integrated version stores the log internally rather than on a separate device. Although one would expect that performance would be better with separate devices, this did not prove to be the case due to the typical access patterns to UFS files The extra administrative overhead was accordingly removed.

The Solaris Volume Manager software version was withdrawn when it was integrated into the Solaris 8 OS. It is recommended only for very old releases (the Solaris 2.5.1 and Solaris 2.6 OSs) in which logging UFS (LUFS) is not available. Although LUFS has been integrated since the Solaris 7 OS, logging is not enabled by default. This is due to performance degradation, found typically only at artificially high-load levels, and almost no cases have been seen in the field.

As of the Solaris 10 OS, logging is enabled by default, although in practice, Sun recommends using logging any time that fast crash recovery is required with releases as early as the Solaris 8 OS first customer shipment (FCS). This is particularly true of root file systems, which do not sustain enough I/O to trip even the rather obscure performance bugs found in the Solaris 7 OS.

Performance Impact of Logging

One of the most confusing issues associated with logging file systems (and particularly with logging UFS, for some reason) is the effect that the log has on performance. First, and most importantly, logging has absolutely no impact on user data operations; this is because only metadata operations are logged.

The performance of metadata operations is another story, and it is not as easy to describe. The log works by writing pending changes to the log, then actually applying the changes to the master file system. When the master is safely updated, the log entry is marked as committed, meaning that it does not need to be reapplied to the master in the event of a crash. This algorithm means that metadata changes that are accomplished primarily when creating or deleting files might actually require twice as many physical I/O operations as a non-logging implementation. The net impact of this aspect of logging performance is that there are more I/O operations going to storage. Typically, this has no real impact on overall performance, but in the case where the underlying storage was already nearly 100 percent busy, the extra operations associated with logging can tip the balance and produce significantly lower file system throughput. (In this case, throughput is not measured in megabytes per second, but rather in file creations and deletions per second.) If the utilization of the underlying storage is less than approximately 90 percent, the logging overhead is inconsequential.

On the positive side of the ledger, the most common impact on performance has to do with the cancellation of some physical metadata operations. These cases occur only when metadata updates are issued very rapidly, such as when doing a tar (1) extract operation or when removing the entire contents of a directory ("rm -f *"). Without logging, the system is required to force the directory to disk after every file is processed (this is the definition of the phrase "writing metadata synchronously); the effect is to write 512 or 2048 bytes every time 14 bytes is changed. When the file system is logging, the log record is pushed to disk when the log record fills, often when the 512 byte block is completed. This results in a 512/14 = 35 times reduction in physical I/O, and obvious performance improvements result.

The following table illustrates these results. The times are given in seconds, and lower scores are better. Times are the average of five runs, and are intended to show relative differences rather than the fastest possible absolute time. These tests were run on Solaris 8 7/01 using a single disk drive.

TABLE 1 Analyzing the Impact of UFS Logging on Performance

Test

No Logging (seconds)

Logging (seconds)

Delta

tar extract

127

21

505%

rm -rf *

76

2

3700%

Create 1 GB file

35

34

2.94%

Read 1 GB file

34

34

0.00%


The tar test consists of extracting 7092 files from a 175 megabyte archive (the contents of /usr/openwin). Although a significant amount of data is moved, this test is dominated by metadata updates for creating the files. Logging is five times faster. The rm test removes the 7092 extracted files. It is also dominated by metadata updates and is an astonishing 37 times faster than the non-logging case.

On the other hand, the dd write test creates a single 1 gigabyte file in the file system, and the difference between logging and non-logging is a measurable, but insignificant, three percent. Reading the created file from the file system shows no performance impact from logging. Both tests use large block sizes (1 megabyte per I/O) to optimize throughput of the underlying storage.

UFS Direct I/O

Another feature present in most of the local file systems is the use of direct I/O. UFS, VxFS, and QFS all have forms of this feature, which is primarily intended to avoid the overhead associated with managing cache buffers for large I/O. At first glance, it might seem that caching is a good thing and that it would improve I/O performance.

There is a great deal of reality underlying these expectations. All of the local file systems perform buffer caching by default. The expected improvements occur for typical workloads that are dominated by metadata manipulation and data sets that are very small when compared to main memory sizes. Metadata, in particular, is very small, amounting to less than one kilobyte per file in most UFS applications, and only slightly more in other file systems. Typical user data sets are also quite small; they average about 70 kilobytes. Even the larger files used in every day work such as presentations created using StarOfficeTM software, JPEG images, and audio clips are generally less than 2 megabytes. Compared to typical main memory sizes of 256–2048 megabytes, it is reasonable to expect that these data sets and their attributes can be cached for substantial periods of time. They are reasonably likely to still be in memory when they are accessed again, even if that access comes an hour later.

The situation is quite different with bulk data. Systems that process bulk data tend to have larger memories, up to perhaps 16 gigabytes (for example, 8–64 times larger than typical), but the data sets in these application spaces often exceed 1 gigabyte and sometimes range into the tens or even hundreds of gigabytes. Even if the file literally fits into memory and could theoretically be cached, these data sets are substantially larger than memory that is consistently available for I/O caching. As a result, the likelihood that the data will still be in cache when the data is referenced again is quite low. In practice, cache reuse in these environments is nil.

Direct I/O Performance

Caching data anyway would be fine except, that the process requires effort on the part of the OS and processors. For small files, this overhead is insignificant. However, the overhead becomes not only significant, but excessive when "tidal waves" of data flow through the system. When reading 1 gigabyte of data from a disk in large blocks, throughput is similar for both direct and buffered cases; the buffered case delivers 13 percent greater throughput. The big difference between these two cases is that the buffered process consumes five times as much CPU effort. Because there is so little practical value to caching large data sets, Sun recommends using the forcedirectio option on file systems that operate on large files. In this context, large generally means more than about 15–20 megabytes. Note that the direct I/O recommendation is especially true when the server in question is exporting large files through NFS.

If direct I/O is so much more efficient, why not use direct I/O all the time? Direct I/O means that caching is disabled. The impact of standard caching becomes obvious when using a UFS file system in direct I/O mode while doing small file operations. The same tar extraction benchmark used in the logging section above takes over 51 minutes, even with logging enabled, more than 29 times as long as when using regular caching (2:08)! The benchmark results are summarized in the following table.

TABLE 2 Analyzing the Performance of Direct I/O and Buffered I/O

 

Direct I/O

Throughput (seconds)

CPU %

Buffered I/O

Throughput (seconds)

CPU %

Create 1 GB file

36

5.0%

31

25.00%

Read 1 GB file

30

0.0%

22

22.00%

tar extract

3062

0.0%

128

6.0%

rm rf *

76

1.2%

65

1.0%


In this table, throughput is represented by elapsed times in seconds, and smaller numbers are better. The system in question is running Solaris 9 FCS on a 750-megahertz processor. The tests are disk-bound on a single 10K RPM Fibre Channel disk drive. The differences in throughput are mainly attributable to how the file system makes use of the capabilities of the underlying hardware.

Supercaching and 32-Bit Binaries

A discussion of buffered and direct I/O methodology is incomplete without addressing one particular attribute of the cached I/O strategy. Because file systems are part of the operating system, they can access the entire capability of the hardware. Of particular relevance is that file systems are able to address all of the physical memory, which now regularly exceeds the ability of 32-bit addressing. As a result, the file system is able to function as a kind of memory management unit (MMU) that permits applications that are strictly 32-bit aware to make direct use of physical memories that are far larger than their address pointers.

This technique, known as supercaching, can be particularly useful to provide extended caching for applications that are not 64-bit aware. The best examples of this are the open-source databases, MySQL and Postgres. Both of these are compiled in 32-bit mode, leaving their direct addressing capabilities limited to 4 gigabytes.1 However, when their data tables are hosted on a file system operating in buffered mode, they benefit from cached I/O. This is not as efficient as simply using a 64-bit pointer because the application must run I/O system calls instead of merely dereferencing a 64-bit pointer, but the advantages gained by avoiding I/O outweigh these considerations by a wide margin.

Handling Very Large Data Sets With QFS/Local

Whereas UFS was designed as a general-purpose file system to handle the prosaic needs of typical users, QFS originates from a completely different design center. QFS is designed with bulk data sets in mind, especially those with high bandwidth requirements. These files are typically accessed sequentially at very high bandwidth. As previously noted, attempts to cache such files are usually futile.

The key design features of QFS/local are the ability to span disk devices, to separate data and metadata, and to explicitly handle underlying data devices and associated explicit user policies for assigning data locations to specific devices. Taken together, the net effect of these features is to create a file system that can handle massive data sets and provide mechanisms for accessing them at very high speeds.

QFS Direct I/O

As one might expect from such design criteria, QFS also offers direct I/O capabilities, and for the same reasons that UFS has them. QFS's massive I/O throughput capability puts an even greater premium on eliminating overhead than with UFS. The major difference between the QFS and UFS direct I/O implementations is that QFS allocates a per-file attribute, permitting administrators more control of which files are accessed without the cache. This can be set with the setfa (1m) command. As with UFS, direct I/O can be selected on an entire file system using a mount option or on an individual file using a call to ioctl (2).

QFS Volume Management

Whereas UFS and VxFS are hosted on a single disk resource, QFS can be hosted on multiple disk resources. In this context, a disk resource is anything that presents the appearance of a single virtual disk. This could be a LUN exported out of a disk array, a slice of a directly attached disk drive, or some synthesis of one or more of these created by a volume manager such as VxVM or Solaris Volume Manager software. The main point is that there are limits to virtual disk resources. In particular, a 1 terabyte maximum size exists when the file system is hosted on a single disk resource, and it is necessarily limited to the size of that resource2.

QFS essentially includes a volume manager in its inner core. A QFS file system is hosted on top of disk groups. (Do not confuse these with the completely unrelated VxVM concept of the same name.) A QFS disk group is a collection of disk resources that QFS binds together internally.

There are two types of disk group: round-robin and striped. The striped disk group is effectively the same thing as a RAID-0 of the underlying disk resources. Blocks are logically disbursed across each of the constituent disk resources according to a RAID-0 organization. One might use this configuration to maximize the available I/O bandwidth from a given underlying storage configuration.

Disk Organizations

The round-robin disk group is one of the most interesting features in QFS. Like striped disk groups, round-robin disk groups permit a file system to span disk resources. The difference is that the round-robin disk group is explicitly known to the block allocation procedures in QFS. More specifically, all blocks for a given file are kept in a single disk resource within the disk group. The next file is allocated out of another disk resource, and so on. This has the property of segregating access to the data set to a specific set of disk resources. For typical high-bandwidth, relatively low-user-count applications, this is a major advance over the more common striping mechanisms because it allows bandwidth and disk resource attention to be devoted to servicing access to fewer files. In contrast, striped groups provide greater overall bandwidth, but they also require that every device participate in access to all files.

To see how round-robin groups can improve performance over striped groups, consider a QFS file system containing two large files and built on two devices. If two processes each access one file, the access pattern to the underlying disks is very different, depending on whether the file system uses striped or round-robin groups.

In the round-robin case, each file is stored on a single device. Therefore, sequential access to each file results in sequential access to the underlying device, with no seeks between disk accesses. If the file system uses striped groups, each file resides on both devices; therefore, each of the two processes send access to both drives, and the access is interleaved on the disk. As a result, virtually every I/O requires a seek to get from one file to the other, dropping the throughput of the underlying disks by a factor of 5–20 times. The following table illustrates these results. Results are given in megabytes per second, and larger numbers are better.

TABLE 3 Analyzing File Access in Round-Robin and Striped Groups

I/O Size (KB )

Segregated

Interleaved

Ratio

8

37.0

3.2

11.7

32

66.7

3.4

19.9

64

71.4

5.0

14.4

128

71.4

8.1

8.9

512

71.4

13.2

5.4

1024

71.4

14.3

5.0


In each case, two I/O threads each issue sequential requests to the disks using the listed I/O size. In the segregated case, each thread has a dedicated disk drive; in the interleaved case, the data is striped across both drives, so both threads issue requests to both drives.

The test system uses two disk drives, each delivering about 36 megabytes per second. The segregated case goes about as fast as theory suggests, while the seeks required in the interleaved case result in throughput that is far lower than users would expect.

Segregation of User and Metadata

The principle of segregating I/Os to reduce contention for physical disk resources is why QFS has options for placing metadata on different disk resources than user data. Moving the seek-laden metadata access to dedicated disk resources permits most of the disk resources to concentrate on transferring user data at maximum speed.

Note that even though QFS includes things that are like volume managers, it is still possible and useful to use other disk aggregation technologies such as volume managers and RAID disk arrays. For example, it is often useful to use the RAID-5 implementations inside disk arrays, aggregated together to form disk groups, either directly or indirectly through a volume manager. Of course, the complexity of the solution increases with each added step, but sometimes this complexity is appropriate when other goals (performance, reliability, and the like) cannot be met in other ways.

One of the obvious consequences of QFS including a volume manager is that it can accommodate file systems that are far larger than a single volume or LUN. The theoretical limit of a single QFS on disk file system is in excess of 250 terabytes, although no single on-disk file system of such size is currently deployed. When aggregate data sizes get this large, it is far more cost effective to deal with them in a mixed online and nearline environment made possible by hierarchical storage management software such as SAM. Hierarchical storage management (HSM) systems enable the effective management of single file systems extending into multiple petabytes.

Understanding the Differences Between QFS/Local and SAMFS

SAMFS is something of a confusing term. Strictly speaking, it refers to a local file system that is strongly related to QFS/local. More specifically, it is functionally the same as QFS/local, except that it does not offer the ability to place user data and metadata on separate devices. In my opinion, there is so little difference between QFS/local and SAMFS that they can be treated together as QFS/local.

Managing Storage Efficiency With SAM

Unfortunately, the related term "SAM" is often confusingly referred to as SAMFS. SAM really refers to storage, archive, and migration, and is a hierarchical storage facility. SAM is actually built into the same executables as QFS and SAMFS, and is separately licensed.

The combination of QFS+SAM is, in many ways, almost a different file system. SAM is a tool for minimizing the cost of the storage that supports data. Its primary goal is to find data that is not being productively stored and to migrate it transparently to lower-cost storage. This can result in rather surprising savings.

A number of studies over the years have shown that a huge proportion of data is actually dead, in the sense that it will not be referenced again. Studies have shown that daily access to data is often as little as one percent3, and that data not accessed in three days has a 55–85 percent probability of being completely dead in some studied systems4. Clearly, moving unused or low-usage data to the least expensive storage will help save expensive, high-performance storage for more important, live data.

One of SAM's daemons periodically scans the file system searching for migration candidates. When suitable candidates are found, they are copied to the destination location and the directory entry is updated to reflect the new location. File systems maintain location information about where each file's data is stored. In traditional on-disk file systems such as UFS, the location is restricted to the host disk resource. SAM's host file systems (QFS and SAMFS) extend this notion to include other locations, such as an offset and file number on a tape with a particular serial number. Because mapping the file name to data location is handled inside the file system, the fact that the data has been migrated is completely invisible to the application, with the possible exception of how long it might take to retrieve data.

Partial Migration

To augment these traditional HSM techniques, SAM offers options for migration, including movement of partial data sets and disk-to-disk migration. Partial migration means that the file system has the notion of segmentation; each file is transparently divided into segments that can be staged or destaged independently. In effect, this feature provides each segment with its own location data and last-modified times. The feature is especially useful for very large data sets because applications might not actually reference all of a large data set. In this case, restaging the entire set is both slow and wasteful.

For example, the use of file (1) on an unsegmented file illustrates the extent to which transfer times can be affected by segmentation. In this case, the program reads the first few bytes of a file and compares them to a set of known signatures in an attempt to identify the contents of the file. If file (1) is applied to a stale, unsegmented 10-terabyte file, the entire 10 terabytes must be staged from tape! However, such large files are normally segmented, and only the referenced segment is restaged. Because segments can be as small as 1 megabyte, this represents a substantial savings in data transfer time.

When SAM destages data sets, it usually writes them to a lower-cost, nearline medium such as tape or, in some cases, optical disk. However, these media are generally quite slow compared to disks, especially for recovery scenarios. Tapes are inherently serial access devices, and even tapes with rapid-seek capability, such as the STK 9940B, locate data far more slowly than disks. Tapes take tens of seconds to locate a specific byte in a data set, compared to tens of milliseconds on a disk, a disparity of five orders of magnitude. Traditional transports such as DLT-8000 can take even longer (several minutes). Note also that locating a cartridge and physically mounting a tape also takes an interesting period of time, which is especially true in the event that human intervention is required to reload a tape into a library, or if all transports are busy.

Cached Archiving on Disk

In many cases, it might be preferable to archive recently used data on lower-cost disks, rather than on relatively inconvenient tape media. For example, archiving to disk might be preferable to avoid giving users the impression that they are losing access to their data. In these circumstances, SAM can be directed to place the archive files on disk. The destination can be any file system available to the archiver process, including QFS, UFS, PxFS, or even NFS.

The archive files placed on the destination disk have the same format as tape archives. This means that small files are packed into large archive files for placement on disk. This is significantly more efficient than simply copying the files because there is no fragmentation within the archive file. Therefore, space is more efficiently utilized and the file system itself only needs to manage the name of the archive file itself, avoiding any hint of problems with directory sizes. Furthermore, access to the archives is a serial I/O problem, which is much more efficient than typical random access I/O found when the files are copied wholesale (for example, with cp -pr /from/* /to or similar.

One of the reasons that on-disk archives have the same format as tape archives is because one of SAM's key features is that it is able to save multiple copies of a single data set. Multiple copies might be saved to different tapes, or one might be saved on disk, two on tape, and one on optical. This makes it possible to keep a cached copy of archived data sets on disk while still retaining them in safe off-site storage.

One of the advantages of retaining a cache copy on disk is that the disk underlying the cached copy does not need to be expensive or particularly reliable. Even if disks were to fail, data can be recovered transparently from tape copies.

Backup Considerations

One unexpected benefit from the use of an on-disk archive is a large improvement in backup speed, with the possible consequence of dramatically shortening backup windows. Such improvements will only occur in environments that are dominated by small files. The problem is that backup processes must read small files from the disk in order to copy them to tape. Small file I/O is dominated by random access (seek) time and small physical I/Os. These two factors combine to reduce the effective bandwidth of a disk by approximately 95–97 percent. In environments primarily consisting of small files, this is sometimes enough to extend backup windows unacceptably when the backup is written directly to tape. This is because the backup data is being supplied to the tape at rates far lower than the rated speed of the tape transports. During this time, the tape is opened for exclusive use and is unavailable for any other purpose.

An on-disk archive can be constructed at leisure because the process does not require exclusive access to a physical device. The on-disk archives have a different form; specifically, many small files are coalesced into large archive files. When the on-disk archive is complete, the resulting large archive files can be copied to tape at streaming media speeds. Because the tape transports can be easily driven to their full-rated speed without fear of backhitching, they are used efficiently both in time and in capacity. Backhitches, an inherent characteristic of helical-scan media such as DLT, DAT, and some other technologies, can sometimes cause transports to disable compression because a higher effective incoming data rate might be able to keep the transport streaming. In these cases, the on-disk archive even improves tape density.

Understanding Differences Between Types of Shared File Systems

Three primary shared file systems are available on the Solaris OS: NFS, QFS/Shared Writer (QFS/SSW), and the cluster Proxy File System (PxFS). Each of these file systems is designed to do different things and their unique characteristics give them wide differentiation.

Sharing Data With NFS

To Solaris OS users, NFS is by far the most familiar file system. It is an explicit over-the-wire file sharing protocol that has been a part of the Solaris OS since 1986. Its manifest purpose is to permit safe, deterministic access to files located on a server with reasonable security. Although NFS is media independent, it is most commonly seen operating over TCP/IP networks.

NFS is specifically designed to operate in multiclient environments and to provide a reasonable tradeoff between performance, consistency, and ease-of-administration. Although NFS has historically been neither particularly fast nor particularly secure, recent enhancements address both of these areas. Performance improved by 50–60 percent between the Solaris 8 and Solaris 9 OSs, primarily due to greatly increased efficiency processing attribute-oriented operations5. Data-intensive operations don't improve by the same margin because they are dominated by data transfer times rather than attribute operations.

Security, particularly authentication, has been addressed through the use of much stronger authentication mechanisms such as those available using Kerberos. NFS clients now need to trust only their servers, rather than their servers and their client peers.

Understanding the Sharing Limitations of UFS

UFS is not a shared file system. Despite a fairly widespread interest in a limited-use configuration (specifically, mounted for read/write operation on one system, while mounted read-only on one or more "secondary" systems), UFS is not sharable without the use of an explicit file sharing protocol such as NFS. Although read-only sharing seems as though it should work, it doesn't. This is due to fairly fundamental decisions made in the UFS implementation many years ago, specifically in the caching of metadata. UFS was designed with only a single system in mind and it also has a relatively complex data structure for files, notably including "indirect blocks," which are blocks of metadata that contain the addresses of real user data.

To maintain reasonable performance, UFS caches metadata in memory, even though it writes metadata to disk synchronously. This way, it is not required to re-read inodes, indirect-blocks, and double-indirect blocks to follow an advancing file pointer. In a single-system environment, this is a safe assumption. However, when another system has access to the metadata, assuming that cached metadata is valid is unsafe at best and catastrophic at worst.

A writable UFS file system can change the metadata and write it to disk. Meanwhile, a read-only UFS file system on another node holds a cached copy of that metadata. If the writable system creates a new file or removes or extends an existing file, the metadata changes to reflect the request. Unfortunately, the read-only system does not see these changes and, therefore, has a stale view of the system. This is nearly always a serious problem, with the consequences ranging from corrupted data to a system crash. For example, if the writable system removes a file, its blocks are placed in the free list. The read-only system isn't provided with this information, therefore, a read of the same file will cause the read-only to follow the original data pointers and read blocks that are now on the free list!

Rather than risk such extreme consequences, it is better to use one of the many other options that exist. The selection of which option is driven by a combination of how often updated data must be made available to the other systems, and the size of the data sets involved. If the data is not updated too often, the most logical option is to make a copy of the file system and to provide the copy to other nodes. With point-in-time copy facilities such as Sun Instant Image, HDS ShadowImage, and EMC TimeFinder, copying a file system does not need to be an expensive operation. It is entirely reasonable to export a point-in-time copy of a UFS file system from storage to another node (for example, for backup) without risk because neither the original nor the copy is being shared. If the data changes frequently, the most practical alternative is to use NFS. Although performance is usually cited as a reason not to do this, the requirements are usually not demanding enough to warrant other solutions. NFS is far faster than most users realize, especially in environments that involve typical files smaller than 5–10 megabytes. If the application involves distributing rapidly changing large streams of bulk data to multiple clients, QFS/SW is a more suitable solution, albeit not bundled with the operating system.

Maintaining Metadata Consistency Among Multiple Systems With QFS Shared Writer

The architectural problem that prevents UFS file systems from being mounted on multiple systems is that there exist no provisions for maintaining metadata consistency among multiple systems. QFS Shared Writer (QFS/SW) implements precisely such a mechanism by centralizing access to metadata in a metadata server located in the network. Typically, this is accomplished using a split data and metadata path. Metadata is accessed through IP networks, while user data is transferred over a SAN.

All access to metadata is required to go over regular networks for arbitration by the metadata server. The metadata server is responsible for coordinating possibly conflicting access to metadata from varying clients. Assured by the protocol and the centralized server that metadata are consistent, all client systems are free to cache metadata without fear of catastrophic changes. Clients then use the metadata to access user data directly from the underlying disk resources, providing the most efficient available path to user data.

Direct Access Shared Storage

The direct access architecture offers vastly higher performance than existing network-sharing protocols when it comes to manipulating bulk data. This arrangement eliminates or greatly reduces two completely different types of overhead. First, data is transferred using the semantic-free SCSI block protocol. Transferring the data between the disk array and the client system requires no interpretation; no semantics are implied in the nature of the protocol, thus eliminating any interpretation overhead. By comparison, the equivalent NFS operations must use the NFS, RPC, XFS, TCP, and IP protocols, all of which are normally interpreted by the main processors.

The QFS/SW arrangement also eliminates most of the copies involved in traditional file sharing protocols such as NFS and CIFS. These file systems transfer data several times to get from the disk to the client's memory. A typical NFS configuration transfers from the disk array to the server's memory, then from the server's memory to the NIC, then across the network, and then from the NIC to the client's memory. (This description overlooks many implementation details.) In contrast, QFS/SW simply transfers data directly from the disk to the client's memory, once the metadata operations are completed and the client is given permission to access data. For these reasons, QFS/SW handles bulk data transfers at vastly higher performance than traditional file sharing techniques.

Furthermore, QFS/SSW shares the on-disk format with QFS/local. In particular, user data can be configured with all of the options available with QFS/local disk groups, including striped and round-robin organizations. These capabilities make it far easier to aggregate data transfer bandwidth with QFS than with NFS, further increasing the achievable throughput for bulk data operations. User installations have measured single-stream transfers in excess of 800 megabytes per second using QFS or QFS/SW. One system has been observed transferring in excess of 3 gigabytes per second. (Obviously, such transfer rates require non-trivial underlying storage configurations, usually requiring 10–32 Fibre-Channel disk arrays, depending on both the file system parameters and array capability and configuration.) For comparison, the maximum currently achievable single-stream throughput with NFS is roughly 70 megabytes per second.

Handling of Metadata

The performance advantages of direct-client access to bulk data are so compelling that one might reasonably ask why data isn't always handled this way. There are several reasons. Oddly enough, one is performance, particularly scalability. Although performance is vastly improved when accessing user data using direct storage access, metadata operations are essentially the same speed for NFS and QFS/SW. However, the metadata protocols are not equivalent because they were designed for quite different applications. NFS scales well with the number of clients. NFS servers are able to support hundreds or thousands of clients with essentially the same performance. NFS was designed with sharing many files to many clients in mind, and it scales accordingly in this dimension, even when multiple clients are accessing the same files.

Scalability

QFS/SW was designed primarily for environments in which data sets are accessed by only a few clients, and engineering tradeoffs favor high-performance bulk transfer over linear client scalability. Empirical studies report that while QFS/SW scales well for small numbers of nodes (four to eight), scalability diminishes rapidly thereafter. To a large degree, this is to be expected: the bulk transfer capabilities provided are so high that a few clients can easily exhaust the capabilities of even high performance disk arrays.

Another consideration is that the efficiency of QFS/SW is fundamentally derived from its direct, low-overhead access to the storage. This necessarily limits the storage configurations to which it can be applied. QFS/SW is distinctly not a wide-area sharing protocol. Furthermore, as noted elsewhere in this article, the NFS and QFS/SW trust models are completely different. One of the key considerations in the efficiency of shared file systems is the relative weight of metadata operations when compared to the weight of data transfer operations. Most file sharing processes involve a number of metadata operations. Opening a file typically requires reading and identifying the file itself and all of its containing directories, as well as identifying access rights to each. The process of finding and opening /home/user/.csh requires an average of seven metadata lookup operations and nine client/server exchanges with NFS; QFS/SW is of similar complexity. Compared with the typical 70-kilobyte file in typical user directories, these metadata operations so dominate the cost of data transfer that even completely eliminating transfer overhead would have little material impact on the efficiency of the client/server system. The efficiency advantages of direct access storage are only meaningful when the storage is sufficiently accessible and when the data is large enough for the transfer overhead to overwhelm the cost of the required metadata operations.

Understanding How PxFS Interacts With Sun Cluster Software

One of Sun Cluster software's key components is the Proxy File System (PxFS). This is an abstraction used within a cluster to present to clients the illusion that all cluster members have access to a common pool of data. In particular, PxFS differs from more general file sharing protocols in two fundamental ways.

Most significantly, the PxFS differs in scope. NFS provides data access to a set of clients that are general in nature; the clients do not even need to run the same OS as the server, and the clients definitely have no requirement to participate in the same cluster as the server. PxFS has quite a different scope. It exports data only to other members of Sun Cluster.

The type of processing clients apply to data obtained through PxFS depends on how the data will be used. For example, if clients outside the cluster must have access to data residing in the cluster, client member nodes can export data through standard export protocols such as NFS and CIFS. In this case, a node would simultaneously be both a PxFS client and an NFS server.

PxFS data is normally transferred over the cluster interconnect. This is usually the highest performance interconnect available in the configuration and it usually delivers somewhat higher performance than typical local area network (LAN) interfaces. However, because most of the overhead in sharing data is due to underlying transports rather than to the sharing protocol itself, PxFS transfers over the cluster interconnect are often not as fast as users might expect, even using fast media such as InterDomain Networking, Myrinet, or WildFire.

The second main area of difference between PxFS and general file sharing protocols is intimacy between the cooperating nodes. PxFS is an integral part of Sun Cluster software; it is layered above native local file systems on each cluster node. The PxFS interface is fairly intimate and requires explicit changes in the underlying file system to accommodate the PxFS support calls. At the time of this writing, UFS and VxFS are the only local file systems supported under PxFS. (Other file systems might operate on cluster nodes, but their data is not necessarily available to other members of the cluster.)

Another basic design characteristic of PxFS is that it operates at the same trust level as the cluster software and the operating system itself. Clients and servers inherently trust each other. This is not true with either NFS or CIFS, and the level of trust between client and server is also higher than for QFS/SW.

Figure 1FIGURE 1 Capabilities of Various Solaris Shared File Systems

The preceding figure illustrates the various capabilities and design tradeoffs for typical Solaris shared file systems. Indicated throughput is for a single thread running on a single client. Greater (often much greater) throughput can be obtained from a server, usually by configuring additional storage resources.

Understanding How Applications Interact With Different Types of File Systems

Although there are a wide variety of ways to categorize applications, one of the more straightforward methods is to consider the size of the data sets involved. This means both the size of individual data sets as well as the size of the logical groups of data sets; these correspond to the size of the files of significance and the file systems that will store them together. As we saw in the descriptions of each file system, one of the key engineering design tradeoffs that is made in the development of each file system is the way it handles scalability in the dimensions of size and number of clients.

In addition to the size of files, there are several other broad application categories that include the most common applications found in the Solaris environment. Database systems such as Oracle, DB2, MySQL, and Postgres contain a large proportion of all data on these servers. The applications that use the databases can be divided into two qualitatively different categories: online transaction processing (OLTP) and decision support (DS) systems.

OLTP work is primarily concerned with recording and utilizing the transactions that represent the operation of a business. In particular, the dominant operations focus on individual transactions, which are relatively small. Decision support or business intelligence applications are concerned with identifying trends and patterns in transactions captured by OLTP systems. The key distinction is that DSS systems process groups of transactions, while OLTP systems deal with individual transactions.

For the most part, OLTP systems experience activity that is similar to small-file work, while DSS systems experience activity that is similar to large-file workloads. High performance technical computing (HPTC) is a basic category independent of the others. This group typically makes use of the largest files and usually also creates the highest demand on I/O bandwidth. The files that support various Internet service daemons, such as ftp servers, web servers, LDAP directories, and the like typically fall into the small-file category (even ftp servers). Occasionally, the concentrated access visibility will create higher than usual I/O rates to these files, but usually they are just more small files to a file system. One reason that ftp servers fall into this category, even if they are actually providing large files (for example, the 150–650 megabyte files associated with large software distributions such as StarOfficeTM, Linux, or Solaris) is that the ftp server's outbound bandwidth constrains individual I/O request rates for individual streams, causing physical I/O to be strongly interleaved. The result is activity that has the same properties as small-file activity rather than the patterns that are more commonly associated with files of this size.

The activities and I/O characteristics associated with these primary workload groups can be used to assign file systems to applications in a fairly straight-forward way.

Selecting a Shared File System

Deciding which shared file system to use is fairly straightforward because there is relatively little overlap between the main products in this space. The three primary considerations are as follows:

PxFS is the primary solution sharing data within a cluster. Although NFS works within a cluster, it has somewhat higher overhead than PxFS, while not providing features that would be useful within luster.

QFS/SW is being evaluated in this application today, since the security model is common between Sun Cluster software and QFS/SW. Furthermore, Sun Cluster software and QFS/SSW usually use the same interconnectivity model when the nodes are geographically collocated. Early indications suggest that QFS/SW might be able to serve a valuable role in this application with lower overhead, but such configurations are not currently supported because the evaluation is not complete.

For clients that are not clustered together, access to small files should use NFS unless the server that owns the files happens to run Windows. In this context, "small" means files that are less than a few megabytes or so.

The most common examples of small-file access are typical office automation home directories, software distribution servers (such as /usr/dist within Sun), software development build or archive servers, and support of typical Internet services such as web pages, mail and even typical ftp sites. This last category usually occurs when multiple systems are used to provide service to a large population using a single set of shared data.The clearest example is the use of 10–20 web servers in a load-balanced configuration, all sharing the same highly available and possibly writable data.

CIFS can also be used in this same space, but the lack of an efficient kernel implementation in Solaris, combined with the lack of any CIFS client package6 mean that CIFS will be used almost exclusively by Windows and Linux systems. Because server hardware and especially file system implementations scale to thousands of clients (NFS) or at least hundreds of clients (CIFS) per server, they are clearly preferred when processing is diverse and divided amongst many clients.

The other main application space is that of bulk data, applications in which the most interesting data sets are larger than a few megabytes and often vastly larger. Applications such as satellite download, oil and gas exploration, movie manipulation and other scientific computations often manipulate files that are tens of gigabytes in size. The leading edge of these applications are approaching 100 terabytes per data set. Decision support or business intelligence systems have data set profiles that would place them in this category, but these systems do not typically use shared data.

Because of the efficiencies of handling bulk data through direct storage access, QFS/SW is the primary choice in this space. The nature of bulk data applications generally keeps data storage physically near the processing systems, and these configurations typically enjoy generous bandwidth. However, NFS is required in the uncommon case where Fibre Channel is not the storage interconnection medium or where disk interconnect bandwidth is quite limited. For example, if the distance between processor and storage is more than a few kilometers, NFS is the most suitable alternative.

A consideration that is receiving attention is security. Highly sensitive data might require substantial security measures that are not available in QFS/SW. Data privacy is typically addressed by encryption. Neither QFS/SW nor Fibre Channel have any effective encryption capability. This is hardly surprising given the performance goals associated with each of these technologies. In contrast, NFS can be configured with Kerberos-v5 for more secure authentication and/or privacy. Security can also be addressed at the transport level. The NFS transport is IP, which can optionally be configured with IPsec link-level encryption. No corresponding capability is available in Fibre Channel.

In addition to these implementation differences, there is a subtle architectural difference between QFS/SW and NFS. QFS/SW clients trust both the storage and the other clients. NFS clients only need to trust the server as long as the server is using a secure authentication mechanism.

Selecting a Local File System

As with shared file systems, deciding which local file system to apply to given applications centers on the size of the interesting files. The local file systems have the additional consideration of how large the file system is, and what proportion of it might be dead or inactive.

When the size of the important files is relatively small any of the local fail systems will handle the job. For example, files in most office automation home directories average around 70 kilobytes today, despite an interesting proportion of image files, MP3 files, massive StarOffice presentations, and archived mail folders.

How those files will be accessed is not a discriminator. The files might be used on the local server (such as on a time-sharing system) or being exported through a file-sharing protocol for use on clients elsewhere in the network. Because logging UFS is a part of the Solaris OS, it is usually recommended for these applications, especially for the Solaris 9 OS and later with the availability of recent improvements such as hashed DNLC handling, logging performance, and optional suppression of access time updates.

In the small-file arena, Sun recommends SAM functionality (for example, either QFS or SAMFS with the archiving features enabled) over logging UFS when the size of the file systems becomes truly massive or when the proportion of dead or stale files represents an interesting overhead cost. Although it might seem counter intuitive that small files could result in massive file systems, several fairly common scenarios result in such installations. One is sensor data capture, in which a large number of sensors each report a small amount of data. Another is server consolidation as applied to file servers. This particular application has been implemented within Sun's own infrastructure systems and has resulted in many very large file systems full of small, stale files.

One major exception to the rule of small files is the use of database tables in a file system. Because databases carefully manage the data within their tablespaces, the files visible to the file system have some special characteristics that make them particularly easy for relatively simple file systems to manage them. The files tend to be constantly "used" in that the database always opens all of them, and the files are essentially never extended. Finally, the databases have their own caching and consistency mechanisms, so access is almost always quite careful and deliberate.

Operating under these assumptions, UFS is easily able to handle database tables used for OLTP systems. At one time (Solaris 2.3, 40 megahertz SuperSPARC®), the performance difference between raw disk devices and UFS tablespaces was quite substantial: performance was about 40 percent lower on file systems. Since that time, constant evolution of the virtual memory system, file system, and microprocessors have narrowed this gap to about 5 percent, an amount that is well worth the savings ease of administration.

Because the databases access many parts of the tablespaces, placing database tables on SAM file systems is rarely worthwhile. The access patterns result in files that are fully staged almost all the time; there is little or nothing to be gained for the complexity.

For a variety of reasons, file systems that contain large files not associated with database tables are best handled with QFS. Probably the most important consideration is linear (streaming) performance because QFS's storage organization is specifically designed to accommodate the needs of large data sets. The performance characteristics of the QFS storage organization also mean that databases hosted on file systems that are primarily used for decision-support applications will perform best on QFS rather than on UFS.

As with small files, the intended use of the file is not particularly material. QFS is usually the most appropriate file system for large, data-intensive applications, whether the local system is doing the processing—as in a high performance computing (HPC) environment—or if it is exporting the data to some other client for processing. In particular, data-intensive NFS applications (such as geological exploration systems) should use QFS.

As one might expect, file systems that operate on very large data sets are prime candidates for SAM archiving. It doesn't take very many idle 50-gigabyte data sets to require the management of a few more disk arrays. This is also the context in which segmented files are most useful. Because most of the overhead is in the manipulation of metadata and not user data, there is very little additional work for the system to do and substantial improvement in smoothness of handling the data through the system. In fact, a reasonable default policy might be to segment all large files at 1 gigabyte boundaries. The 1 gigabyte size permits a reasonable tradeoff between consumption of inodes (and especially their corresponding in-memory data structures) and staging mechanism utilization.

Accelerating Operations With tmpfs

The special file system whose application requires some explanation is tmpfs. As the name implies, tmpfs by definition stores its data in virtual memory. tmpfs is volatile; data does not persist across a reboot (whether intentional or not). The intent behind tmpfs is to accelerate operations on data that will not be saved permanently; for example, the intermediate files that are passed between the various phases of a compiler, sort temporary files, and editor scratch files. The running instances of these applications are lost in the event of a reboot, so loss of their temporary data is of no interest.

Even if data is temporary, there is no advantage to putting it in a tmpfs if it is large enough to cause the VM system to have to make special efforts to manage it. As a rule of thumb, tmpfs ceases to be productive when the data sets exceed about a third of physical memory configured in the system. When the data sets get this big, they tend to compete with the applications for physical memory and this ends up lowering overall system performance. (Obviously, a truly massive number of small files exceeding 35 percent of memory is also a problem.)

In a very small number of cases, tmpfs has been used to accelerate some specific database operations. This is a bit complex because it involves having a tablespace created in the temporary file system before the database is brought up. This has been done by creating and initializing the tablespace, then copying it to a reference location on stable storage. When the system comes up, the reference table is copied to a well-known location on a tmpfs, and then the database is started. This is a somewhat extreme solution, and most database systems are now capable of making direct use of very large (64-bit) shared memory pools, which was the original reason why the tmpfs technique was developed. However, this might be another useful trick for optimizing the performance of database engines that have not been extended to fully utilize 64-bit address spaces (notably MySQL and Postgres).

Conclusions

After proposing a taxonomy for classifying the many file systems that are available, this paper has described some of the prominent design details of each of Sun's shared and local file systems. Each file system has some design tradeoffs intended to address the peculiarities of specific target markets, and these considerations reflect how the products are most appropriately used. The most useful criteria for considering which file systems to use is usually the size of the data sets in question, with security being another notable issue. Because most files are small, the familiar bundled file systems, UFS and NFS, are the most suitable. When bulk data must be processed at high bandwidth, QFS/local and QFS/Shared Writer are almost always more appropriate. Finally, SAM capabilities can reduce the cost of underlying storage by substantial amounts due to policy-driven data migration capabilities.

About the Author

Brian Wong is a Sun Distinguished Engineer specializing in non-specialization. While earning an advanced degree in reality from the school of hard knocks, he has worked in the areas of capacity planning, expert systems development, deployment methodolgies, operating systems, storage architecture, and systems analysis. A motorsport fanatic, he and his wife were most recently seen chasing chickens and ducks around their Virginia farm.

Ordering Sun Documents

The SunDocsSM program provides more than 250 manuals from Sun Microsystems, Inc. If you live in the United States, Canada, Europe, or Japan, you can purchase documentation sets or individual manuals through this program.

Accessing Sun Documentation Online

The docs.sun.com web site enables you to access Sun technical documentation online. You can browse the docs.sun.com archive or search for a specific book title or subject. The URL is http://docs.sun.com/

To reference Sun BluePrints OnLine articles, visit the Sun BluePrints OnLine Web site at: http://www.sun.com/blueprints/online.html

800 East 96th Street, Indianapolis, Indiana 46240