Cluster Grid Example Implementations
In this section, a sample Sun Cluster Grid implementation is outlined, demonstrating how to perform the following tasks:
Translate site gathered information into grid architectural features.
Take advantage of features that enable the management of heterogeneous environments in terms of the compute tier.
Configure the MPI environment, OpenMP, serial, and interactive queues to maximize utilization.
Take advantage of the inherent scalability of the architecture, and the flexibility with respect to changing user demands.
Consider various network architectures to enhance the grid environment.
Scenario 1General HPC Environment
This example represents a multi-departmental academic site that caters to 200 users with differing computational requirements. The information gathering stage is outlined below in terms of service provision and the three tiers.
In this example, a highly available service is not required, but system health monitoring is considered advantageous. The authentication scheme for the new cluster grid should integrate with the existing University scheme so that users can use their existing user names and passwords.
It is expected that the cluster grid will be configured to allow the compute tier to scale four-fold in terms of CPU count without further enhancement of the management tier.
A new storage facility will be associated with the cluster grid giving users significant disk space (at least 2 Gbytes per user) on high performance storage arrays. The total storage capacity can be increased within the year.
The DRM service is not currently required to implement a fair-share scheme across the departments, but this might be a future requirement.
Users are competent with UNIX and expect to have telnet, rlogin, and ftp access to the cluster grid storage system to configure jobs. Because the authentication scheme is to be integrated with the university's existing authentication scheme, the access nodes will be added to that scheme. A maximum of 50 users are expected to be simultaneously accessing cluster grid services.
A desktop machine with a monitor is required by the administrator for accessing the cluster grid management software.
The HPC resource must cater to mixed workloads including throughput jobs, highly parallel jobs as well as large memory and medium-scaling threaded or OpenMP jobs.
Besides the DRM, there is a requirement for the file storage to be served out to the compute tier using NFS. This storage should be fully backed up. Some health monitoring service should also be installed.
The HPC resource must cater to mixed workloads including throughput jobs, highly parallel jobs as well as large memory and medium-scaling threaded or OpenMP jobs. The applications are roughly characterized in FIGURE 4. There are large numbers of serial jobs and a number of shared-memory parallel applications (OpenMP and threaded). A few OpenMP applications scale to 20 processors, while the high scaling jobs of 20 and above processors are MPI.
Data access patterns are highly varied but in this case, one key serial application is typified by very high read intensive I/O. Application runtimes for serial applications are typically 24 hours. The parallel application users have requested the ability to run for 24 hours on occasion, but day-to-day usage will include runtimes about 212 hours.
In this section, the information gathering results are translated into hardware and software implementation decisions (FIGURE 3).
FIGURE 3 Grid Hardware and Network Infrastructure for Scenario 1
Users will be allowed login access to one dual-CPU node (a Sun Blade™ 1000 workstation) in order to access the cluster grid file system and submit jobs. These machines must cater for file transfers to and from the storage system. A maximum of 50 users are expected to be simultaneously accessing cluster grid services.
A Sun Blade 100 is supplied for the administrator's use.
A four-way Sun Fire V480 server is used for the SGE and SunMC server. The internal disks are mirrored for added resilience. this configuration should allow for substantial future scaling of the compute tier. The NFS service is provided by the Sun Fire 6800 server which is to be used also in the compute tier. The Sun Fire 6800 server will also provide full backup service through the use of a Sun StorEdge L25 tape library.
The applications are characterized in FIGURE 4. There are large numbers of serial jobs and a number of shared-memory based parallel applications (OpenMP and threaded), as well as MPI applications beyond 40 processors.
FIGURE 4 Approximate Workload for Example Site
A single Sun fire 6800 server is provided primarily to supply compute power for those OpenMP jobs that scale beyond 8 threads. It may also provide a platform for those MPI jobs that benefit from the SMP architecture. Smaller threaded jobs and serial jobs requiring greater than 1 Gbyte memory will be catered to by three Sun Fire V880 servers.
Price and performance considerations usually tend towards distributed memory systems for those appropriate applications. In this example, while applications scaling up to 24 processors must be catered to using a shared memory system, the high scaling MPI applications can take advantage of a distributed memory system such as a network of workstations (NOW). A Linux-based cluster, comprising x86 architecture processors packaged in a large number of dual-processor enclosures caters to those MPI jobs scaling beyond 24 processors. The Linux cluster is interconnected with Myrinet, Gigabit Ethernet, and serial network.
Shared storage across the cluster grid is hosted by a Sun Fire 6800 server on two Sun StorEdge™ T3 disk arrays implementing hardware RAID 5. The choice to host the shared storage here was by the desire to host storage direct attached (and therefore lowest access latency) on the largest SMP server, which is likely to be the host which accesses the storage most frequently.
Serial, gigabit Ethernet and Myrinet interconnects all feature in the example solution as shown in FIGURE 3. Gigabit Ethernet interconnects the management server with all compute hosts. A dedicated Myrinet interconnect provided a low-latency connection between the nodes of the Linux cluster.
This section describes the software as it is applied to the example implementation.
Sun Grid Engine
In this example, the SGE is sufficient (rather than SGEEE) because fair-share scheduling across multiple departments is not required. If resource allocation based on departments or projects becomes a future requirement, the upgrade to the Enterprise Edition is a straightforward implementation.
The SGE queue structure must cater to serial, OpenMP and MPI jobs.
Queues are implemented on a calendar basis across the cluster grid. In the daytime, short 2- or 4-hour queues are enabled on all machines covering parallel environments and serial queues. Each evening, a 12-hour queue is enabled in place of the daytime queues. On weekends, 24-hour queues are enabled for jobs with long runtimes.
Three parallel queues for OpenMP jobs exist:
20-slot PE on Sun Fire 6800 server
8-slot PE on Sun Fire V880(a) server
4-slot PE on Sun Fire V880(b) server
A total of 12 slots are available for serial jobs on the Sun Fire V880 servers (four slots on the Sun Fire V880(b) and 8 slots on the Sun Fire V880(c)).
CacheFS software is implemented on one of the Sun Fire V880 servers. The CacheFS file system encompasses the subdirectory, which holds the data that is prone to high read-intensive activity by one key serial application. A Sun Grid Engine queue is configured on this machine to run jobs requesting this application.
On the management host, four slots are allocated for users to run interactive jobs (leaving approximately four processors continuously available for system processes, DRM and SunMC.)
MPICH is installed across the Linux cluster to provide the runtime environment for MPI applications. MPICH can be tightly integrated with Sun Grid Engine.
Scenario 2 Minimal Cluster Grid Implementation
This scenario (FIGURE 5) demonstrates a minimal cluster grid implementation that uses the Sun Grid Engine resource manager to control a number of single processor machines. The following information pertains to the requirements placed upon the cluster grid.
FIGURE 5 Grid Hardware and Network Infrastructure for Scenario 2
The service provision requirement calls for a queueing system to optimize the utilization of a new compute farm. The new service should impose minimal disruption to users, and take into account moderate scaling of the compute layer for future growth. Storage should be integrated within the new system.
Users currently work using desktop workstations and these should be integrated into the access tier to allow access to file storage, job submissions, and so on. A maximum of 30 users are expected to simultaneously access the cluster grid.
Minimal health monitoring of the system and administrative access is needed from an existing administration workstation.
DRM and NFS services must be provided. An existing SunMC server will be used to monitor any servers which are critical to the grid service.
The HPC resource must cater to serial jobs only. A small number of floating point intensive applications are used. Jobs have memory requirements of approximately 500 Mbytes, with approximately ten percent of the jobs requiring up to 2 Gbytes. Average runtimes are around 30 minutes with little variation.
In this section, the information gathering results are translated into hardware and software implementation decisions.
The user's existing desktop workstations will be registered with SGE as submit hosts, allowing users to submit applications without using telnet, and rlogin type commands. Access to the cluster grid file system where users have file space will be enabled using NFS.
Administrative access will be provided by registering the existing administration workstation as an admin host with SGE.
The management tier service is provided by a single 2-way Sun Fire 280R with a Sun StorEdge D1000 array directly attached. A SunMC agent will be installed on the 280R server so it can be monitored by the existing SunMC server. The file space will be NFS shared across both the Compute Layer and the users desktops. The network for the cluster grid (between compute and management layer) will be isolated from the office traffic through the use of an additional gigabit Ethernet interface which will connect the 280R to the compute layer only using a switch.
Serial connections to the compute tier will be installed along with a terminal concentrator to allow console access by the administrator.
The compute tier is composed of 20 Sun Fire V100 servers, each configured with 1 Gbyte of RAM. An additional four Sun Fire V100 systems will be configured with 2 Gbyte of RAM.
The SGE queue structure will be simple. This simplicity will be reflected in a fast scheduling loop which will improve throughput if very large numbers of jobs are submitted over a short time span. By default, the SGE installation results in single slot queues on single processor machines. Those jobs requesting greater than 1 Gbyte of RAM will be automatically scheduled to use the 2 Gbyte compute nodes. Further adjustments to time limits on these queues or tuning of the scheduler should be considered during the early stages of implementation.