Home > Articles > Operating Systems, Server > Solaris

A New Open Resource Management Architecture in the Sun HPC ClusterTools Environment

  • Print
  • + Share This
This article presents a new architecture for the integration of the Sun HPC ClusterTools parallel computing environment with distributed resource management systems such as the Sun Grid Engine system. The architecture enables a tight integration to be achieved with multiple distributed resource management systems in a uniform and extensible framework, which means that any of the popular management systems may be used to launch and monitor Sun MPI parallel jobs. Unlike previously available loose integrations, tight integrations allow a resource manager to accurately measure resources used by the parallel processes, to terminate jobs that exceed resource limits, and to generate accurate accounting information for multi-process jobs. Tight integrations are implemented with Sun Grid Engine software, PBS, and LSF. Correct resource accounting with this tight integration is demonstrated and launching and debugging Sun MPI jobs using each system is detailed.
Like this article? We recommend

This article presents a new architecture for the integration of the Sun HPC ClusterTools™ parallel computing environment with distributed resource management systems such as the Sun™ Grid Engine system. This new architecture achieves a tight integration with multiple distributed resource management systems in a uniform and extensible framework, which means that any of the popular management systems may be used to launch and monitor Sun™ MPI parallel jobs.

Unlike previously available loose integrations, tight integrations allow a resource manager (RM) to:

  • Accurately measure resources used by the parallel processes

  • Terminate jobs that exceed resource limits

  • Generate accurate accounting information for multiprocess jobs

We have implemented tight integrations with Sun Grid Engine software, PBS from Veridian Systems, and LSF from Platform Computing. We provide examples showing correct resource accounting, ease of use to launch and debug Sun MPI jobs under these systems, and the improvements in behavior that result from the tight integration.

This article consists of the following topics:

  • "Introduction to Sun HPC ClusterTools"
  • "Sun CRE Infrastructure"
  • "Integrating a Resource Manager"
  • "New Integration Architecture"
  • "Usage" on page 10
  • "Parallel Tool Support"
  • "Other Features"

Introduction to Sun HPC ClusterTools

The Sun HPC ClusterTools environment enables users to develop, debug, and deploy multiprocess distributed-memory applications. The environment includes:

  • Highly optimized version of the industry standard MPI (message passing interface)

  • Sun™ Cluster Runtime Environment (Sun CRE) for launching and monitoring parallel jobs

  • Prism™ multiprocess debugger

  • Sun Scalable Scientific Subroutine Library

Applications developed with this environment are often put into production using a resource management system. High performance computing centers use distributed resource management systems such as Sun Grid Engine software to share resources fairly and accountably, and to guarantee that every job can obtain the resources it needs to run at maximum efficiency. To properly monitor a job's resource consumption, the management system must be the agent that launches the job. These RM systems do not know how to launch multiprocess jobs such as MPI applications. Instead, they reserve the resources, and then ask the native parallel environment to launch the job. As a result, the RM has no visibility into the new processes and cannot monitor their resource consumption. This means the RM cannot generate accurate accounting records, and cannot terminate jobs that use more than their allowed share of resources. The result can be poor performance for all jobs on the system due to overcommitment of resources. This is commonly referred to as a loose integration between the RM and the parallel environment.

Vendors have solved this problem by communicating the process information between the parallel environment and the RM, thus achieving a tight integration. For example, Sun provided a tight integration between Sun MPI and LSF in the Sun HPC ClusterTools 3.0 release. The disadvantage of this approach is that it tends to yield point solutions that are not easily extensible to additional resource management systems; yet it is critical that the parallel environment support all of the commonly used systems. The RM is an integral component of a production environment, and vendors cannot expect sites to adopt a different system as a prerequisite for running parallel jobs.

  • + Share This
  • 🔖 Save To Your Account