Home > Articles > Operating Systems, Server > Solaris

  • Print
  • + Share This
Like this article? We recommend

New Integration Architecture

The key enabling element in our architecture is that it moves the responsibility for launching processes from the parallel environment to the RM, while maintaining the infrastructure of the environment. This solves the fundamental problem of giving the RM visibility into the parallel processes, and enables the proper monitoring found in a tight integration. Our approach requires that the RM export three basic capabilities:

  • Ability to list the set of allocated resources (hosts). This allows the parallel environment to start its own daemon infrastructure on the proper set of hosts.

  • Ability to launch a new process under RM control. The parallel environment calls the RM rather than forking a new process.

  • Ability for that process to identify its containing job. The launched process must rendezvous with the parallel environment, and needs a means of identifying itself to the environment.

We find these capabilities are ubiquitous amongst distributed resource management systems. TABLE 1 shows the interfaces that export these capabilities for our target systems4, 5, 6,7.

TABLE 1 Resource Manager Interfaces

Resource Manager

Host List

Job ID

Spawn Method

LSF LSB_HOSTS LSB_JOBID pam
Sun Grid Engine PE_HOSTFILE JOB_ID qrsh-inherit
PBS PBS_NODEFILE PBS_JOBID tm_spawn()

We have defined abstract function call interfaces for each of these capabilities that are called during job startup, as we will describe shortly. The interfaces for a particular RM are collected into a single plug-in library that is dynamically loaded by the parallel environment using the dlopen system call. Thus, the environment is insulated from the implementation details of the RM. The plug-in library is small; it requires fewer than 250 lines of source code for each RM that we implement. The plug-in architecture allows customers and other third parties to implement tight integrations with additional RMs without modifying the source code of Sun CRE. The plug-in interface specification is described in the xrm.3 manual page in the HPC ClusterTools 5 software documentation.

The interaction between Sun CRE and PBS serves to illustrate the specifics of this architecture, and is shown in FIGURE 2. (The interaction with other RM systems is similar).

Figure 2FIGURE 2 Interaction Between Sun CRE and PBS During Job Startup

The user submits a job with the qsub command of PBS, which contacts the pbsmom daemon in step 1. The pbsmom daemon forks the job script myjob.sh (step 2), which invokes an mprun command (step 3). In step 4, mprun reads the list of allocated hosts, and performs the Sun CRE daemon setup on these hosts as was shown in FIGURE 1 (these steps are elided in FIGURE 2 for clarity). Unlike in FIGURE 1, iod no longer forks a.out. In step 5, mprun invokes a launcher program which calls the PBS tm_spawn function (step 6), which contacts another pbsmom daemon. The pbsmom daemon forks the a.out process in step 7. Lastly, the process determines its job ID, finds the iod that is associated with this job, and establishes a socket connection to that iod. This becomes the query socket over which the process obtains services from the parallel environment. Although our PBS integration calls the launcher from the single mprun process, we could instead make simultaneous calls to the launcher from multiple iod instances, which yields better startup scalability. This is done in our Sun Grid Engine software integration. An RM's ability to handle a distributed launch is a property in our plug-in interface.

Our new architecture has a number of advantages:

  • Achieves a tight integration with multiple RMs in a uniform way. Differences between RM systems are abstracted in a plug-in library.

  • Allows the RM to have a parent-child relationship with the processes it manages.

  • Allows fast parallel startup of jobs.

  • No modifications were made to the resource management software.

  • Gives the parallel environment and the RM visibility into a job. Commands of both systems can be used to view and signal jobs.

  • Complete infrastructure of the parallel environment is available to parallel jobs and tools.

  • + Share This
  • 🔖 Save To Your Account