Integrating a Resource Manager
Each RM system defines a command that is used to submit a job script, which is a shell script on UNIX™ systems. To submit a Sun MPI job, the user includes an mprun command in the job script. The arguments required by the mprun command might vary depending on the type of integration and the target resource management system.
Loose integrations are currently possible between all of the major RM systems and MPI implementations1, 2. In a loose integration, the RM reserves the parallel resources, creates a host list that is saved in a file or environment variable, and executes the job script on the first host in the list. The job script invokes the mprun command, passing the host file as an argument. A loose integration works adequately when all users voluntarily stay within queue limits, and when programs are bug free and never spin out of controlthat is, not in this world! Even in a hypothetical world where everyone is well behaved, the system cannot keep accounting records that reflect actual usage.
The key to achieving a tight integration is making the RM aware of the parallel processes. One solution is for the RM to walk the process tree and find all descendants of the job script, which include mprun and its children. This works on a single host cluster for versions of MPI in which the mpirun command creates the MPI processes directly using the fork and exec system calls. The public domain MPICH implementation is a good example3. PBS implicitly takes this approach, by looking for all processes on a host with the same session ID. However, this approach fails when multiple hosts are involved, or when mpirun must contact a daemon in the parallel environment to cause the process launch.
An alternate solution is to have the parallel environment communicate process identifiers back to the RM. However, it is more difficult and less accurate to monitor and control processes that are not children of the monitoring agent, because system calls such as getrusage and nice cannot be used.
Yet another solution is to eliminate the separate parallel environment and migrate the parallel environment features into the RM system. This was our previous approach for achieving a tight integration between Sun MPI and LSF. This was a large effort, and not one we wish to repeat with other resource management systems. Also, Sun CRE is still required because many sites do not use LSF, and there is an ongoing cost to maintaining the Sun HPC ClusterTools infrastructure in both implementations. It takes twice the effort to add features which require new infrastructure support, and regression testing costs also double. These are real considerations for a systems vendor. In addition to these practical considerations, it is not elegant to require multiple implementations of the parallel environment.
Resource Usage Examples
The differences between a loose and a tight integration can be illustrated with a few simple examples, for which we use PBS. (Examples using Sun Grid Engine software or LSF would show the same effects.) The cluster used for these examples consists of four Sun Enterprise™ 450 servers with four CPUs each. For now, ignore the differences in syntax for each example; they will be explained shortly.
In CODE EXAMPLE 1, a CPU-intensive MPI application is submitted for execution on 16 processors. Here we show results based on a loose integration:
CODE EXAMPLE 1 Inadequate Resource Accounting In a Loose Integration
% cat loose.sh mprun -np 16 -m $PBS_NODEFILE reactdiff % qsub -l nodes=4:ppn=4 loose.sh ... % qstat -f | grep resources_used resources_used.cput = 00:00:00 resources_used.mem = 5256kb resources_used.vmem = 7672kb resources_used.walltime = 00:12:17
We submit the job script loose.sh, and towards the end of the run we check the status of the job using qstat. Despite running for 12 minutes, the CPU time shown is 0! This is because PBS can only see the mprun process, which is idle during the entire run.
Now see what happens with our tight integration in CODE EXAMPLE 2. Here, qstat shows that the total CPU time is approximately 16 times the wall-clock time, which is the desired result.
CODE EXAMPLE 2 Proper Resource Accounting In a Tight Integration
% cat tight.sh mprun -np 16 reactdiff % qsub -l nodes=4:ppn=4 tight.sh ... % qstat -f | grep resources_used resources_used.cput = 02:41:38 resources_used.mem = 138520kb resources_used.vmem = 250304kb resources_used.walltime = 00:11:54
An RM must be able to account for all used resources in order to enforce limits on consumption. This is shown in CODE EXAMPLE 3, in which each process of a parallel program allocates a block of memory.
CODE EXAMPLE 3 Resource Limit Enforcement
% qsub -l nodes=3:ppn=3,vmem=30mb -I qsub: waiting for job 64.hpc-4 to start qsub: job 64.hpc-4 ready % % mprun -np 9 memtest 10 Rank 0 on hpc-6: allocating 10 Mbytes Rank 1 on hpc-6: allocating 10 Mbytes Rank 2 on hpc-6: allocating 10 Mbytes Rank 3 on hpc-5: allocating 10 Mbytes [...] =>> PBS: job killed: vmem 47570944 exceeded limit 31457280 Aborted.
The program is run in interactive mode using the PBS -I flag, under the tight integration, with the restriction that the job may use no more than 30 megabytes of aggregate memory. A short time after this limit is exceeded, PBS kills the job. (The job temporarily exceeds the limit, because PBS only measures resources at regular intervals.) A loose integration would not be able to detect this condition, and the program would continue to run. In CODE EXAMPLE 3, the memory restriction is specified by the user at job submission time, but the system administrator could also define the restriction as a property of the queue with equivalent results.
The shortcomings of the loose integration schemes drove us to develop the following architecture.