Home > Articles > Operating Systems, Server > Solaris

  • Print
  • + Share This
Like this article? We recommend

Like this article? We recommend

JMS Comparative Analysis

This section is devoted to comparing the three JMSs. This comparison will not stress the common features found in the three JMSs; rather, the discussion will be particularly focused on the features that differentiate them.

General JMS Features

TABLE 1 gives a brief comparison of the general features followed by a detailed explanation of how each JMS supports these features.

TABLE 1 General JMS Features Comparison

Features \ JMS

SGE

LSF

PBS

Heterogeneous platform support

Unix flavors only

Unix & NT

Unix flavors only

Multi-cluster support

No

Yes

Somewhat

System level Checkpoint restart

when underlying OS supports it

Yes

Yes

Yes

User level Checkpoint restart

Yes

Yes

No

Large computational grid support

Somewhat

Somewhat

No

Massive Scalability(jobs & nodes)

Yes

Yes

Yes

Parallel job support with Sun HPC ClusterTools

Loose

Integration

Tight

Integration

Loose

Integration

Distribution format of end product

Binary and

Source

Binary Only

Source

Freely available

Yes

No

Yes

Posix 1003.2d compliance

Yes

No

Yes


Platform offers the LSF MutiCluster product that allows for the resource sharing among heterogeneous computer clusters. PBS allows for linking multiple clusters together which gives the ability to route and forward jobs between them. The restriction is that one single scheduling policy is enforced across multiple clusters by PBS. The Sun Grid Engine product is released mainly to be used in one computer cluster due to the support issues that result when using the multi-cluster feature.

Before the checkpoint restart features are compared, it is necessary to define them in order to clear any confusion from the reader's mind. The system level checkpoint restart feature allows a machine, which crashes and later restarts, to continue at the same point without loss of data as if no failure had occurred. This feature relies on the strict support of the underlying operating system. The user level checkpoint restart, however, requires that applications link to user level libraries in order for them to be able to restart transparently following a recovery from either a system or user failure.

Be aware that the support for the checkpoint restart feature comes with limitations and restrictions. Check pointed applications that had open socket and/or pipe connections and private stack operation at the time of failure, will not be able to recover properly and need to be restarted from their initial point. The application, however, if written with significant sophistication, can allow checkpoints to be clearly made in the source code which help it recover to the last source checkpoint. The reader is referred to the corresponding documentation of each JMS to find out about the limitations of the user check pointing feature.

System-level checkpoint/restart is supported by each of the three JMS packages wherever it is supported by the operating system, such as on Cray's UNICOS or SGI's IRIX. As far as the user-level checkpoint feature, PBS is the only product that does not currently include user-level checkpoint libraries.

An advanced version of PBS is called PBS Pro9. PBS Pro provides support for large-scale computational grids, and an interface to the Globus Grid tool. LSF does support Grid environments through the LSF MultiCluster product. LSF is not integrated to Globus; however, Globus does have the ability to send jobs to LSF through the Globus resource manager interface (GRAM). Sun Grid Engine package is integrated with Globus, Legion, and Punch. Globus and Legion have been supported with Codine and they work as well with the Sun Grid Engine package. Punch has been integrated recently. The integration of the Sun Grid Engine package with these Grid packages is currently the Sun Grid Engine solution for multi-clustering.

The three JMSs claim that they scale up to thousands of nodes. The comparison based on this criterion requires a separate study and is beyond the scope of this article.

The three packages have parallel job support. LSF is tightly integrated with the Sun HPC ClusterTools software which allows for submitting Sun MPI jobs directly from the LSF environment. PBS and SGE, however, currently support only the external launcher mechanism of Sun MPI programs and are not yet fully integrated with the Sun HPC ClusterTools product. More information about this issue can be found in the paper presented last year6 and a paper that is due to be presented at the same conference11.

As far as the distribution format of the package, PBS is currently provided as a source code distribution and LSF is distributed as binaries. The Sun Grid Engine package, however, is currently distributed both as a binary12 and as source under the Source Code Community License13. The Sun Grid Engine package and PBS products are available free of charge, however LSF and PBS Pro are not.

PBS and the Sun Grid Engine package are both compliant with the POSIX 1003.2d "Batch Queuing Extensions for Portable Operating Systems" standard. However, Platform's LSF is not compliant with POSIX 1003.2d.

Scheduler

In this section, the JMS scheduler component is compared across the three covered JMS products with TABLE 2, followed by a detailed explanation.

TABLE 2 JMS Scheduler Comparison

Features \ JMS

SGE

LSF

PBS

Known job scheduling policies

Yes

Yes

Yes

Independent scheduler component

Yes

No

Yes


The three JMS schedulers have the capability to provide most of the known job scheduling policies such as First-In First-Out (FIFO), Fairshare, etc. The current version of the Sun Grid Engine package supports only the default option which implements FIFO.

The high level software architecture described earlier in this article shows that both PBS and the Sun Grid Engine package have the scheduler as a separate process. This makes the scheduler an easy candidate to be substituted and/or enhanced. The LSF scheduler is integrated with the remaining software components and this makes LSF immune from accepting a customized scheduler. Since PBS is distributed as source, it is possible to rewrite the scheduler. Similar to PBS, the Sun Grid Engine package's open source provides a framework14 for modifying the scheduler.

Queuing System

TABLE 3 provides a comparison of the three JMSs queuing components.

TABLE 3 JMS Queuing Component Comparison

Features \ JMS

SGE

LSF

PBS

Support for interactive jobs

Yes

Yes

Yes

Job migration feature

Yes

Yes

Somewhat

Inter-cluster job launching

No

Yes

Yes

Logical queues concept

No

Yes

Yes


The three JMSs support both interactive and batch jobs. They are all highly available and can sustain system failure. They also provide administrator-configurable programs to be run by the JMS before and after a job is executed.

With respect to the job migration, this feature is intimately related to the user level checkpoint restart feature mentioned earlier. Both the Sun Grid Engine package and LSF allow jobs that have been user level check pointed to be migrated from a machine to another within the cluster. However, PBS currently does not provide this feature due to the fact that the user level check pointing feature is not yet supported by this product.

LSF and PBS allow for a job to be launched from a cluster and executed on another cluster. The Sun Grid Engine package does not provide this capability simply because it does not support multi-clustering.

The concept of queues in the Sun Grid Engine package is different from the one in its counterparts. The Sun Grid Engine package queues are defined per host whereas LSF and PBS provide logical queues that can dispatch jobs to a host or to multiple hosts. It is still possible to define logical queues in the Sun Grid Engine package using the complex feature, but it would be nice to have logical queues initially defined.

  • + Share This
  • 🔖 Save To Your Account