Issues in Selecting a Job Management System
The Job Management System (JMS) is one of the key software components needed to administer a typical compute site. Evaluating, selecting, and deploying a JMS has always been a daunting task for system administrators and data center managers. This article addresses this critical issue by giving the reader insight into how to evaluate and compare Job Management System products.
"Whose job you run in addition to when and where it is run, may be as important as how many jobs you run!" This expression typically defines the function of a Job Management System at large compute sites. No matter how fast computing power grows, it is inevitable that there is not enough to satisfy the user community which keeps always asking for more. The High Performance Computing (HPC) segment of the computing marketplace has long recognized this challenge and fostered a class of tools that promotes and effectively optimizes the sharing of resources across clusters of systems. These tools are frequently referred to as Resource Management Software, Queuing Systems, or Job Management System, which is the preferred name used throughout this article.
The reader is first introduced to the most important JMS features and requirements followed by a brief architectural overview of each JMS product. The article then presents a comparison of the three most popular JMSs available on Sun platforms and concludes with general recommendations on how to select the most appropriate JMS in a typical compute site. This article assumes that the reader is somewhat familiar with a particular JMS. The novice reader is strongly urged to consult the references in the bibliography when appropriate to be able to follow the contents of this document.
JMS Feature Classification
This section gives a classification that includes the most important JMS features required by typical HPC sites. The JMS requirements mentioned in this section have been distilled and modified from an extensive list that was published in the NAS technical report7. The curious reader will also find a detailed description of each requirement in the referred NAS document. The features are classified into five separate categories:
Job Management System
Must operate in a heterogeneous multi-computer environment.
Should inter-operate with OS level check pointing, providing the ability for the JMS to restart a job from where it left off and not simply from the beginning.
Must include a published API to every component of the JMS.
Must be able to enforce resource allocations and limits.
Must support multiple instances and versions of the JMS software to simultaneously run on the cluster.
Should provide administrative hierarchy.
Should be scalable.
Must provide integration status with the Sun HPC ClusterTools™ software.
Should make the source code available to the user community.
Must be compliant with the POSIX 1003.2d "Batch Queuing Extensions for Portable Operating Systems" standard.
Resource Manager Requirements
Must be parallel aware.
Must support MPI- based parallel programs.
Must support User-level check pointing/restart.
Should provide a history log of all jobs.
Should provide asynchronous communication between application and Job Manager with a published API.
Must provide support for authentication/security system.
Should provide a mechanism to allow reservations of any resource.
Must be highly configurable.
Must provide well known scheduling policies such as first-in first-out, shortest job first, fair sharing, etc.
Must be separable from JMS. A site needs the ability to both modify and replace the scheduler.
Must schedule multiple resources simultaneously.
Must be able to change the priority, privileges, run order, and resource limits of all jobs, regardless of the job state.
Support for coordinated scheduling.
Queuing System Requirements
Must support job migration: The ability to suspend a job or part of a job, and move it to a different node or set of nodes for completion.
Must provide for restricting access to the batch system.
Must be fault tolerant and reliable.
Must provide support for complete administrative functionality.
Should provide support for jobs to be submitted from one cluster and run on another.
Should be resilient to dynamic reconfiguration of the system.
Must provide support for interactive-batch jobs.
Other Useful Capabilities
Should provide a graphical interface to all JMS components.
Should be able to support both hard and soft limits.
Should support job accounting.
Should provide wide area network support.
Should provide Job array support.
The above feature requirements will be the basis to analyze and compare the three JMS packages in the JMS Comparitive Analysis section.