Home > Articles > Programming > General Programming/Other Languages

To Protect and Serve: Providing Data to Your Cluster

Feb 25, 2005

␡

Introduction to Cluster File Systems
The NFS
A Survey of Some Open-Source Parallel File Systems
Commercially Available Cluster File Systems
Cluster File System Summary

⎙ Print

Page 1 of 5 Next >

This chapter introduces some of the design issues associated with providing data to multiple compute slices in a cluster and presents some of the currently available solutions.

This chapter is from the book 

Building Clustered Linux Systems

Learn More Buy

Chapter Objectives

Introduce the concept of a cluster file system
Explore available file system options
Present examples of cluster file system configurations

The compute slices in a cluster work on pieces of a larger problem. Without the ability to read data and write results, the cluster's computation is of little value. Whether the cluster is a database cluster or a scientific cluster, the compute slices need coordinated access to the shared data that is "behind" their computations. This chapter introduces some of the design issues associated with providing data to multiple compute slices in a cluster and presents some of the currently available solutions.

14.1 Introduction to Cluster File Systems

When talking about cluster file systems we have to be careful about how we define the phrase. One simple definition might be: "a file system that is capable of serving data to a cluster." Another more complex definition might be: "a file system that allows multiple systems to access shared file system data independently, while maintaining a consistent view of that data between clients."

There is a subtle difference in the two definitions to which we need to pay close attention. The first definition might include network-based file systems like the NFS, which has only advisory locking and can therefore allow applications to interfere with each other destructively. The second definition tends to imply a higher level of parallelism coupled with consistency checks and locking to prevent collisions between multiple application components accessing the same data. Although the first definition is somewhat vague, the second definition might be said to refer to a "parallel file system."

14.1.1 Cluster File System Requirements

Because the I/O to a cluster's file system is being performed by multiple clients at the same time, there is a primary requirement for performance. Systems that run large parallel jobs often perform data "checkpointing" to allow recovery in the event that the job fails. There may be written requirements for a "full cluster checkpoint' within a given period of time. This has very direct requirements in terms of file system throughput.

If one assumes that the compute slices in the cluster must simultaneously write all memory to the disk for a checkpoint in a given length of time, you can compute the upper boundary requirement for the cluster file system's write performance. For example, 128 compute slices, each with 4 GB of RAM, would generate 512 GB of checkpoint data. If the period of time required for a clusterwide checkpoint is ten minutes, this yields 51.2 GB/min or 853 MB/s.

This is an extraordinarily high number, given the I/O capabilities of back-end storage systems. For example, a single 2-Gbps Fibre-Channel interface is capable of a theoretical maximum throughput of 250 MB/s. Getting 80% of this theoretical maximum is probably wishful thinking, but provides 200 MB/s of write capability. To "absorb" the level of I/O we are contemplating would take (at least) five independent 2 Gbps Fibre-Channel connections on the server. This ignores the back-end disk array performance requirements. Large, sequential writes are a serious performance problem, particularly for RAID 5 arrays, which must calculate parity information.

This example assumes a compute cluster performing sequential writes for a checkpoint operation. The requirements for database applications have different I/O characteristics, but can be just as demanding in their own right. Before we get too much ahead of ourselves, let's consider the general list of requirements for a cluster's file system. Some of the attributes we seek are

High levels of performance
Support for large data files or database tables
Multiuser and multiclient security
Data consistency
Fault tolerance

One might also add "consistent name space" to this list. This phrase refers to the ability for all clients to access files using the same file system paths. Some file systems provide this ability automatically (AFS, the DCE distributed file system [DFS], and Microsoft's distributed file system [dfs]), and others require the system administrator to create a consistent access structure for the clients (NFS, for example). We will need to watch for situations that might affect the client's view of the file system hierarchy.

There are some standard questions that we can ask when we need to analyze the architecture of a given cluster file system solution.

How is the file system storage shared?

How many systems can access the file system simultaneously?

How is the storage presented to client systems?

Is data kept consistent across multiple systems and their accesses, and if so, how?

How easy is the file system to administer?

What is the relative cost of the hardware and software to support the file system?

How many copies of the file system code are there and where does it run?

How well does the architecture scale?

These questions, and others, can help us to make a choice for our cluster's file system. Two primary types of file system architectures are available to us: network-attached storage (NAS) and parallel access. As we will see in the upcoming sections, each has their own fans and detractors, along with advantages and disadvantages.

14.1.2 Networked File System Access

Both NFS and CIFS are client–server implementations of "remote" network-attached file systems. These implementations are based on client-to-server RPC calls that allow the client's applications to make normal UNIX file system calls (open, close, read, write, stat, and so on) that are redirected to a remote server by the operating system. The server's local, physical file systems are exported via an Ethernet network to the server's clients.

The server's physical file systems, such as ext3, jfs, reiser, or xfs, which are exported via NFS, exist on physical storage attached to the server that exports them. The I/O between the network, system page cache, and the file blocks and meta-data on the storage is performed only by a single server that has the file system mounted and exported. In this configuration, the physical disks or disk array LUNs, and the data structures they contain, are modified by a single server system only, so coordination between multiple systems is not necessary. This architecture is shown in Figure 14-1.

Figure 14-1 A network-attached file system configuration

Figure 14-1 shows two disk arrays that present a single RAID 5 LUN each. Because each array has dual controllers for redundancy, the NFS servers "see" both LUNs through two 2-Gbps (full-duplex) Fibre-Channel paths. (It is important that the two systems "see" the same view of the storage, in terms of the devices. There are "multipath I/O" modules and capabilities available in Linux that can help with this task.) The primary server system uses the LUNs to create a single 2.8-TB file system that it exports to the NFS clients through the network. The secondary server system is a standby replacement for the primary and can replace the primary system in the event of a server failure. To keep the diagram from becoming too busy, the more realistic example of one LUN per array controller (four total) is not shown.

The NAS configuration can scale only to a point. The I/O and network interface capabilities of the server chassis determine how many client systems the NFS server can accommodate. If the physical I/O or storage attachment abilities of the server are exceeded, then the only easy way to scale the configuration is to split the data set between multiple servers, or to move to a larger and more expensive server chassis configuration. The single-server system serializes access to file system data and meta-data on the attached storage while maintaining consistency, but it also becomes a potential performance choke point.

Splitting data between multiple equivalent servers is inconvenient, mainly because it generates extra system administration work and can inhibit data sharing. Distributing the data between multiple servers produces multiple back-up points and multiple SPOFs, and it makes it difficult to share data between clients from multiple servers without explicit knowledge of which server "has" the data. If read-only data is replicated between multiple servers, the client access must somehow be balanced between the multiple instances of the data, and the system administrator must decide how to split and replicate the data.

In the case of extremely large data files, or files that must be both read and written by multiple groups of clients (or all clients) in a cluster, it is impossible to split either the data or the access between multiple NFS servers without a convoluted management process. To avoid splitting data, it is necessary to provide very large, high-performance NFS servers that can accommodate the number of clients accessing the data set. This can be a very expensive approach and once again introduces performance scaling issues.

Another point to make about this configuration is that although a SAN may appear to allow multiple storage LUNs containing a traditional physical file system to be simultaneously accessible to multiple server systems, this is not the case. With the types of local file systems that we are discussing, only one physical system may have the LUN and its associated file system data mounted for read and write at the same time. So only one server system may be exporting the file system to the network clients at a time. (This configuration is used in active standby high-availability NFS configurations. Two [or more] LUNs in a shared SAN may be mounted by a server in read–write mode, whereas a second server mounts the LUNs in read-only mode, awaiting a failure. When the primary server fails, the secondary server may remount the LUNs in read–write mode, replay any journal information, and assume the IP address and host name of the primary server. The clients, as long as they have the NFS file systems mounted with the hard option, will retry failed operations until the secondary server takes over.)

The SAN's storage LUNs may indeed be "visible" to any system attached to the SAN (provided they are not "zoned" out by the switches), but the ability to read and write to them is not available. This is a limitation of the physical file system code (not an inherent issue with the SAN), which was not written to allow multiple systems to access the same physical file system's data structures simultaneously. (There are, however, restrictions on the number of systems that can perform "fabric logins" to the SAN at any one time. I have seen one large cluster design fall into this trap, so check the SAN capabilities before you commit to a particular design.) This type of multiple-system, read–write access requires a parallel file system.

14.1.3 Parallel File System Access

If you imagine several to several thousand compute slices all reading and writing to separate files in the same file system, you can immediately see that the performance of the file system is important. If you imagine these same compute slices all reading and writing to the same file in a common file system, you should recognize that there are other attributes, in addition to performance, that the file system must provide. One of the primary requirements for this type of file system is to keep data consistent in the face of multiple client accesses, much as semaphores and spin locks are used to keep shared-memory resources consistent between multiple threads or processes.

Along with keeping the file data blocks consistent during access by multiple clients, a parallel cluster file system must also contend with multiple client access to the file system meta-data. Updating file and directory access times, creating or removing file data in directories, allocating inodes for file extensions, and other common meta-data activities all become more complex when the parallel coordination of multiple client systems is considered. The use of SAN-based storage makes shared access to the file system data from multiple clients easier, but the expense of Fibre-Channel disk arrays, storage devices, and Fibre-Channel hub or switch interconnections must be weighed against the other requirements for the cluster's shared storage.

Figure 14-2 shows a parallel file system configuration with multiple LUNs presented to the client systems. Whether the LUNs may be striped together into a single file system depends on which parallel file system implementation is being examined, but the important aspect of this architecture is that each client has shared access to the data in the SAN, and the client's local parallel file system code maintains the data consistency across all accesses.

Figure 14-2 A parallel file system configuration

A primary issue with this type of "direct attach" parallel architecture involves the number of clients that can access the shared storage without performance and data contention issues. This number depends on the individual parallel file system implementation, but a frequently encountered number is in the 16 to 64 system range. It is becoming increasingly common to find a cluster of parallel file system clients in a configuration such as this acting as a file server to a larger group of cluster clients.

Such an arrangement allows scaling the file system I/O across multiple systems, while still providing a common view of the shared data "behind" the parallel file server members. Using another technology to "distribute" the file system access also reduces the number of direct attach SAN ports that are required, and therefore the associated cost. SAN interface cards and switches are still quite expensive, and a per-attachment cost, including the switch port and interface card, can run in the $2,000- to $4,000-per-system range.

There are a variety of interconnect technologies and protocols that can be used to distribute the parallel file system access to clients. Ethernet, using TCP/IP protocols like Internet SCSI (iSCSI), can "fan out" or distribute the parallel file server data to cluster clients. It is also possible to use NFS to provide file access to the larger group of compute slices in the cluster. An example is shown in Figure 14-3.

Figure 14-3 A parallel file system server

The advantage to this architecture is that the I/O bandwidth may be scaled across a larger number of clients without the expense of direct SAN attachments to each client, or the requirement for a large, single-chassis NFS server. The disadvantage is that by passing the parallel file system "through" an NFS server layer, the clients lose the fine-grained locking and other advantages of the parallel file system. Where and how the trade-off is made in your cluster depends on the I/O bandwidth and scaling requirements for your file system.

The example in Figure 14-3 shows an eight-system parallel file server distributing data from two disk arrays. Each disk array is dual controller, with a separate 2-Gbps Fibre-Channel interface to each controller. If a file system were striped across all four LUNs, we could expect the aggregate write I/O to be roughly four times the real bandwidth of a 2 Gbps FC link, or 640 MB/s (figuring 80% of 200 MB/s).

Dividing this bandwidth across eight server systems requires each system to deliver 80 MB/s, which is the approximate bandwidth of one half of a full-duplex GbE link (80% of 125 MB/s). Further dividing this bandwidth between four client systems in the 32-client cluster leaves 20 MB/s per client. At this rate, a client with 8 GB of RAM would require 6.67 minutes to save its RAM during a checkpoint.

Hopefully you can begin to see the I/O performance issues that clusters raise. It is likely that the selection of disk arrays, Fibre-Channel switches, network interfaces, and the core switch will influence the actual bandwidth available to the cluster's client systems. Scaling up this file server example involves increasing the number of Fibre-Channel arrays, LUNs, file server nodes, and GbE links.

The examples shown to this point illustrate two basic approaches to providing data to cluster compute slices. There are advantages and disadvantages to both types of file systems in terms of cost, administration overhead, scaling, and underlying features. In the upcoming sections I cover some of the many specific options that are available.

Page 1 of 5 Next >

🔖 Save To Your Account

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

Privacy Notice

Overview

Pearson Education, Inc., 221 River Street, Hoboken, New Jersey 07030, (Pearson) presents this site to provide information about products and services that can be purchased through this site.

This privacy notice provides an overview of our commitment to privacy and describes how we collect, protect, use and share personal information collected through this site. Please note that other Pearson websites and online products and services have their own separate privacy policies.

Collection and Use of Information

To conduct business and deliver products and services, Pearson collects and uses personal information in several ways in connection with this site, including:

Questions and Inquiries

For inquiries and questions, we collect the inquiry or question, together with name, contact details (email address, phone number and mailing address) and any other additional information voluntarily submitted to us through a Contact Us form or an email. We use this information to address the inquiry and respond to the question.

Online Store

For orders and purchases placed through our online store on this site, we collect order details, name, institution name and address (if applicable), email address, phone number, shipping and billing addresses, credit/debit card information, shipping options and any instructions. We use this information to complete transactions, fulfill orders, communicate with individuals placing orders or visiting the online store, and for related purposes.

Surveys

Pearson may offer opportunities to provide feedback or participate in surveys, including surveys evaluating Pearson products, services or sites. Participation is voluntary. Pearson collects information requested in the survey questions and uses the information to evaluate, support, maintain and improve products, services or sites, develop new products and services, conduct educational research and for other purposes specified in the survey.

Contests and Drawings

Occasionally, we may sponsor a contest or drawing. Participation is optional. Pearson collects name, contact information and other information specified on the entry form for the contest or drawing to conduct the contest or drawing. Pearson may collect additional personal information from the winners of a contest or drawing in order to award the prize and for tax reporting purposes, as required by law.

Newsletters

If you have elected to receive email newsletters or promotional mailings and special offers but want to unsubscribe, simply email information@informit.com.

Service Announcements

On rare occasions it is necessary to send out a strictly service related announcement. For instance, if our service is temporarily suspended for maintenance we might send users an email. Generally, users may not opt-out of these communications, though they can deactivate their account information. However, these communications are not promotional in nature.

Customer Service

We communicate with users on a regular basis to provide requested services and in regard to issues relating to their account we reply via email or phone in accordance with the users' wishes when a user submits their information through our Contact Us form.

Other Collection and Use of Information

Application and System Logs

Pearson automatically collects log data to help ensure the delivery, availability and security of this site. Log data may include technical information about how a user or visitor connected to this site, such as browser type, type of computer/device, operating system, internet service provider and IP address. We use this information for support purposes and to monitor the health of the site, identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents and appropriately scale computing resources.

Web Analytics

Pearson may use third party web trend analytical services, including Google Analytics, to collect visitor information, such as IP addresses, browser types, referring pages, pages visited and time spent on a particular site. While these analytical services collect and report information on an anonymous basis, they may use cookies to gather web trend information. The information gathered may enable Pearson (but not the third party web trend services) to link information with application and system log data. Pearson uses this information for system administration and to identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents, appropriately scale computing resources and otherwise support and deliver this site and its services.

Cookies and Related Technologies

This site uses cookies and similar technologies to personalize content, measure traffic patterns, control security, track use and access of information on this site, and provide interest-based messages and advertising. Users can manage and block the use of cookies through their browser. Disabling or blocking certain cookies may limit the functionality of this site.

Do Not Track

This site currently does not respond to Do Not Track signals.

Security

Pearson uses appropriate physical, administrative and technical security measures to protect personal information from unauthorized access, use and disclosure.

Children

This site is not directed to children under the age of 13.

Marketing

Pearson may send or direct marketing communications to users, provided that

Pearson will not use personal information collected or processed as a K-12 school service provider for the purpose of directed or targeted advertising.
Such marketing is consistent with applicable law and Pearson's legal obligations.
Pearson will not knowingly direct or send marketing communications to an individual who has expressed a preference not to receive marketing.
Where required by applicable law, express or implied consent to marketing exists and has not been withdrawn.

Pearson may provide personal information to a third party service provider on a restricted basis to provide marketing solely on behalf of Pearson or an affiliate or customer for whom Pearson is a service provider. Marketing preferences may be changed at any time.

Correcting/Updating Personal Information

If a user's personally identifiable information changes (such as your postal address or email address), we provide a way to correct or update that user's personal data provided to us. This can be done on the Account page. If a user no longer desires our service and desires to delete his or her account, please contact us at customer-service@informit.com and we will process the deletion of a user's account.

Choice/Opt-out

Users can always make an informed choice as to whether they should proceed with certain services offered by InformIT. If you choose to remove yourself from our mailing list(s) simply visit the following page and uncheck any communication you no longer want to receive: www.informit.com/u.aspx.

Sale of Personal Information

Pearson does not rent or sell personal information in exchange for any payment of money.

While Pearson does not sell personal information, as defined in Nevada law, Nevada residents may email a request for no sale of their personal information to NevadaDesignatedRequest@pearson.com.

Supplemental Privacy Statement for California Residents

California residents should read our Supplemental privacy statement for California residents in conjunction with this Privacy Notice. The Supplemental privacy statement for California residents explains Pearson's commitment to comply with California law and applies to personal information of California residents collected in connection with this site and the Services.

Sharing and Disclosure

Pearson may disclose personal information, as follows:

As required by law.
With the consent of the individual (or their parent, if the individual is a minor)
In response to a subpoena, court order or legal process, to the extent permitted or required by law
To protect the security and safety of individuals, data, assets and systems, consistent with applicable law
In connection the sale, joint venture or other transfer of some or all of its company or assets, subject to the provisions of this Privacy Notice
To investigate or address actual or suspected fraud or other illegal activities
To exercise its legal rights, including enforcement of the Terms of Use for this site or another contract
To affiliated Pearson companies and other companies and organizations who perform work for Pearson and are obligated to protect the privacy of personal information consistent with this Privacy Notice
To a school, organization, company or government agency, where Pearson collects or processes the personal information in a school setting or on behalf of such organization, company or government agency.

Links

This web site contains links to other sites. Please be aware that we are not responsible for the privacy practices of such other sites. We encourage our users to be aware when they leave our site and to read the privacy statements of each and every web site that collects Personal Information. This privacy statement applies solely to information collected by this web site.

Requests and Contact

Please contact us about this Privacy Notice or if you have any requests or questions relating to the privacy of your personal information.

Changes to this Privacy Notice

We may revise this Privacy Notice through an updated posting. We will identify the effective date of the revision in the posting. Often, updates are made to provide greater clarity or to comply with changes in regulatory requirements. If the updates involve material changes to the collection, protection, use or disclosure of Personal Information, Pearson will provide notice of the change through a conspicuous notice on this site or other appropriate way. Continued use of the site after the effective date of a posted revision evidences acceptance. Please contact us if you have questions or concerns about the Privacy Notice or any objection to any revisions.

Last Update: November 17, 2020

Email Address