A New Open Resource Management Architecture in the Sun HPC ClusterTools Environment
This article presents a new architecture for the integration of the Sun HPC ClusterTools™ parallel computing environment with distributed resource management systems such as the Sun™ Grid Engine system. This new architecture achieves a tight integration with multiple distributed resource management systems in a uniform and extensible framework, which means that any of the popular management systems may be used to launch and monitor Sun™ MPI parallel jobs.
Unlike previously available loose integrations, tight integrations allow a resource manager (RM) to:
Accurately measure resources used by the parallel processes
Terminate jobs that exceed resource limits
Generate accurate accounting information for multiprocess jobs
We have implemented tight integrations with Sun Grid Engine software, PBS from Veridian Systems, and LSF from Platform Computing. We provide examples showing correct resource accounting, ease of use to launch and debug Sun MPI jobs under these systems, and the improvements in behavior that result from the tight integration.
This article consists of the following topics:
- "Introduction to Sun HPC ClusterTools"
- "Sun CRE Infrastructure"
- "Integrating a Resource Manager"
- "New Integration Architecture"
- "Usage" on page 10
- "Parallel Tool Support"
- "Other Features"
Introduction to Sun HPC ClusterTools
The Sun HPC ClusterTools environment enables users to develop, debug, and deploy multiprocess distributed-memory applications. The environment includes:
Highly optimized version of the industry standard MPI (message passing interface)
Sun™ Cluster Runtime Environment (Sun CRE) for launching and monitoring parallel jobs
Prism™ multiprocess debugger
Sun Scalable Scientific Subroutine Library
Applications developed with this environment are often put into production using a resource management system. High performance computing centers use distributed resource management systems such as Sun Grid Engine software to share resources fairly and accountably, and to guarantee that every job can obtain the resources it needs to run at maximum efficiency. To properly monitor a job's resource consumption, the management system must be the agent that launches the job. These RM systems do not know how to launch multiprocess jobs such as MPI applications. Instead, they reserve the resources, and then ask the native parallel environment to launch the job. As a result, the RM has no visibility into the new processes and cannot monitor their resource consumption. This means the RM cannot generate accurate accounting records, and cannot terminate jobs that use more than their allowed share of resources. The result can be poor performance for all jobs on the system due to overcommitment of resources. This is commonly referred to as a loose integration between the RM and the parallel environment.
Vendors have solved this problem by communicating the process information between the parallel environment and the RM, thus achieving a tight integration. For example, Sun provided a tight integration between Sun MPI and LSF in the Sun HPC ClusterTools 3.0 release. The disadvantage of this approach is that it tends to yield point solutions that are not easily extensible to additional resource management systems; yet it is critical that the parallel environment support all of the commonly used systems. The RM is an integral component of a production environment, and vendors cannot expect sites to adopt a different system as a prerequisite for running parallel jobs.