Introduction to the Solaris Cluster Grid - Part 2
This article is a follow up article for the Sun BluePrints™ OnLine article titled "Introduction to the Cluster Grid Part 1", which provided a description of a cluster grid and the architecture of the Sun Cluster Grid stack.
This article takes the next step by describing the Sun Cluster Grid design phase. The Sun Cluster Grid design process involves information gathering followed by design implementation.
Information gathering involves defining the type and extent of service provision and the type and mix of applications to be supported by the Sun Cluster Grid. The former impacts the access and management tiers, while the latter primarily impacts the design of the compute tier.
This article is intended for IT professionals, system administrators, and anyone interested in understanding how to design a Sun Cluster Grid.
This section describes the services that can be implemented in the three logical tiers of the Sun Cluster Grid. Use this information in the information gathering stage when planning your Sun Cluster Grid.
The choice of which cluster grid services to provide depends on various factors such as the requirements for security, service availability, manageability and scalability. The only essential component of a Sun Cluster Grid is the distributed resource management (DRM) software, which provides a single point of access for job submission, and controls the compute environment for the cluster grid. Additional service provisions can be implemented in the tiers as follows:
Access tierAuthentication, web-based access, administration
Management tierHigh availability, health monitoring, install management, hardware testing, NFS, license key management, backup management
Compute tierMPI runtime environment, other runtime libraries
For each tier, the various services are discussed and reasons for providing (or not providing) the services are given.
The authentication schemes vary widely between implementations. In many cases, the cluster grid will run under an existing authentication scheme, so a new authentication scheme need not be implemented as a cluster grid service. In such cases, access to the cluster grid service can still be restricted by the administrator. The Sun™ Grid Engine (SGE) software integrates with the authentication services by automatically checking user credentials at job submission time. Access to the cluster grid can be restricted by explicitly denying (or enabling) user or group access to Sun Grid Engine software, or just certain SGE queues.
If no web-service provision is needed, all access to the cluster by users is usually through a SGE submit host. In some cases, the administrator designates users' desktop machines to be submit hosts. Alternatively, the submit hosts could be physically under the control of the administrator (for example, in a secure data center) and accessible to users using commands such as telnet, ssh, and the like. The latter case would apply, for example, if the cluster grid is always accessed remotely, or if the cluster grid exists under its own authentication scheme.
At a minimum, this tier includes the SGE master. Other services can be implemented to provide increased reliability, availability, and to simplify management.
Health monitoring services are provided by Sun™ Management Center (SunMC) software. A minimal installation of SunMC software does not require agents to be installed on any hosts. In this case, the SunMC server still reports alive status for hosts on the network using SNMP ping. Installing the SunMC agent is particularly useful in maximizing the availability of management nodes that provide vital cluster grid services, and of NFS servers, or large SMP nodes. The agents report detailed information and can email the administrator to warn of low disk space, high swapping rates, or hardware failures and errors, and can be programmed to take prescribed corrective measures automatically. In a compute intensive environment, the decision to employ SunMC agents on smaller compute nodes depends on the perceived benefits given the inevitable small load that the agent introduces.
In large compute environments, particularly when based on large numbers of thin-node compute hosts, the Solaris Jumpstart™ environment should be used to facilitate installations. Custom scripts can be written to perform complete automated installations of new compute servers. Where large numbers of identically configured systems exist, Solaris™ Flash software becomes very efficient, allowing a single execution server image to be copied onto new servers in minutes. This is discussed in more detail in "Solaris Jumpstart and Flash Software" on page 16.
Both the Sun Grid Engine, Enterprise Edition software and the open-source version of Grid Engine provide facilities for multi-departmental control. These should be chosen rather than Sun Grid Engine standard edition if advanced accounting and share-based resource allocation is needed. If there are plans to allow access to Sun Grid Engine software for external regional, national, or global grid users, then the share-based scheme enables external use of the cluster grid to be tightly controlled by the local administrator.
The archiving and backup strategy usually does not encompass all the available storage in a cluster grid. Thin node hosts in the compute tier usually just hold temporary data associated with running jobs and are usually excluded from backups. If the cost of backup hardware and software is to be minimized, users may be allocated a limited storage space for home directories and a larger working directory for day-to-day use, which is not backed up.
For business critical implementations of the cluster grid, some high availability can be built in. High availability features can be implemented through HA clustering software such as Sun™ Cluster 3.0 software. The DRM and NFS services are usually the first candidates to be supported by an HA solution.
Compute Tier Application Support
The user applications that are to be supported by the cluster grid strongly affect the design of the compute tier. Both current and future applications should be characterized as well as possible. For each application that is to be supported, at least approximate answers to the following questions should be gathered:
Is the application a single-threaded, multi-threaded, or multiprocess application?
What data access patterns are expected?
What are the memory requirements?
What is the average runtime?
If the application is multiprocess, which message passing approach is implemented?
How does the application scale?
Also, information on how the applications are used from day-to-day can be important. For example, some applications in development and research environments require multiple test runs before final submission. In this case, it might be wise to provide some machines dedicated for interactive use.
Often, the cluster grid workload falls into one of the following categories or some superposition of them:
ThroughputCharacterized by maximizing the execution rate of large numbers of independent, serial jobs.
On demandCharacterized by maximizing day-to-day utilization while enabling high priority jobs to execute on demand.
Highly ParallelCharacterized by minimizing the execution time for relatively small numbers of multiprocess jobs scaling beyond ~10 processors.