HPC Administration Tips and Techniques
This article contains a brief introduction to the features introduced with the latest Sun HPC ClusterTools 4 software and discussions of the administration practices for successfully configuring the Sun HPC ClusterTools software. The first administration practice covered in this paper has long been requested by HPC customers and deals with the ability to provide root-privileges to regular HPC users to maintain the Sun HPC ClusterTools software. The second practice relates to configuring mixed HPC clusters. This article also introduces the latest release of the Sun Grid Engine (Sun GE) software release 184.108.40.206 and the Condor standalone user-level checkpointing library. Best practices are given on how to configure a checkpointing and migration environment by using both Sun GE software and the Condor standalone checkpointing libraries.
Introduction to Grid Computing
The products covered in this paper are among the basic and fundamental components needed to build a grid infrastructure. Grid computing has been making the headlines lately and is touted as the new computing paradigm for this decade because it can increase the return on your computing assets by more effectively using your existing hardware. The Sun GE software handles compute and resource management at the cluster level by providing the required hooks to access the computing grid through known application program interfaces (APIs), such as Globus and Avaki. The Sun HPC ClusterTools 4 software provides the distributed parallel programming environment that enables users to execute their message passing interface (MPI) programs on a Sun UltraSPARC based cluster. The Condor standalone libraries allow serial threaded programs to be checkpointed for later restart if the need arises.
Sun HPC ClusterTools 4 Software Overview
The Sun HPC ClusterTools 4 software is designed specifically for compute-intensive, technical computing environments and enables the execution of serial and parallel high-performance applications. It provides middleware to facilitate and manage a workload of highly resource-intensive applications on Sun servers, as well as clusters of these servers. Additionally, it provides the software development environment for creating and debugging MPI applications that are parallelized and optimized for Sun servers and clusters.
The Sun HPC ClusterTools 4 software is the follow-on to the Sun HPC ClusterTools 3.1 release. Both versions can be installed on the same system, but only one release can be activated for use by using a reconfig command. The Sun HPC ClusterTools includes the following new features:
Cluster nodes can span over subnets.
Administrators can use the sudo utility to set superuser (root) privileges.
The software has been optimized for better visualization.
Loadable protocol modules are supported.
UltraSPARC III processors are supported.
The next-generation high-performance interconnect is supported.
The Sun HPC ClusterTools 4 software supports the Solaris™ 8 Operating Environment. It is also released under the Sun Community Source License. The Sun HPC ClusterTools 4 software consists of the following components:
Cluster runtime environment
Sun message passing interface
Sun Prism™ software programming environment
Scalable Scientific Subroutine library (S3L)
Parallel file system (PFS)
FIGURE 1 shows the software architecture of the Sun HPC ClusterTools 4 software.
FIGURE 1 Sun HPC ClusterTools Software Architecture
The Sun Cluster Runtime Environment (Sun CRE) is a principal component of the Sun HPC ClusterTools 4 software because it provides the job launching and load balancing capabilities for MPI-based C, C++, and Fortran programs. The Sun HPC ClusterTools 4 software supports up to 2048 processes and up to 64 nodes in a cluster. The software also supports Platform Computing's load sharing facility (LSF) as a distributed resource manager. The Sun GE software supports only the external launcher mechanism of MPI programs and is loosely integrated with the Sun HPC ClusterTools 4 software (refer to the Sun HPC ClusterTools 4 software documentation). The LSF provides batch queuing capabilities and integrated launching of MPI applications. A fully Sun CRE integrated, portable batch system (PBS) was recently made available through a patch. FIGURE 2 shows the current distributed resource management integration types with the Sun CRE software.
FIGURE 2 Distributed Resource Management Software Integration With CRE