Home > Articles > Operating Systems, Server > Solaris

  • Print
  • + Share This
Like this article? We recommend

User-Level Checkpointing Deployment

The submission of a checkpoint job to a Sun GE environment is similar to the submission of a regular job, with the addition of the following options to the qsub(1) command:

  • -ckpt checkpoint_env_name

  • -c [m|s|n|x]

FIGURE 6 shows how a checkpointing job is submitted.

FIGURE 6 Submitting a Checkpointing Application to the Sun GE Environment

In the Sun lab experiment, the -c x option was used because the job was to be checkpointed only when it was suspended. The Sun GE software provides other possibilities, and you should consult the qsub(1) man page to find out more about what behavior is desired for your specific application.

Migration of Checkpointing Jobs

The Sun GE software provides several ways to initiate the job migration capability. FIGURE 7 shows the framework of the migration feature. In the Sun lab experiment, the job suspension and the queue suspension to trigger the job migration were tested.

FIGURE 7 Migrating a Checkpointing Application

You can use the following procedure to apply job migration for a checkpointing application.

To Migrate a Job Using the Sun GE Software

  1. Type the following qsub(1) command:

  2. qsub -ckpt condor_ckpt -c x ...
  3. Use the qmon graphical window to monitor the job execution on a particular queue.

  4. Open qmon the Job Control window, and suspend the job.

FIGURE 8 Job Control Window

The job then shows up on the queue of a second executable host.

  1. Suspend the job on the second host.

The job should be migrated to the queue of the first execution host and be successfully completed. The migration feature was also tested with the queue getting suspended, instead of the job. The job migration also completed successfully in this case.

Condor User-Level Checkpointing Limitations

The Condor user-level checkpointing libraries have some limitations on jobs that it can transparently checkpoint and migrate. The following list contains some of the limitations:

  • Multiprocess jobs are not supported.

  • This includes system calls such as fork(), exec(), and system(). Consequently, MPI programs are not supported.

  • Interprocess communication is not supported.

  • This includes pipes, semaphores, and shared memory.

  • Network communication must be short.

  • A job may make network connections using system calls, such as socket(), but a network connection left open for long periods will delay checkpointing and migration.

  • Sending or receiving the SIGUSR2 or SIGTSTP signals is not allowed.

  • These signals are reserved by the Condor system. Sending or receiving all other signals is allowed.

  • Alarms, timers, and sleep calls are not allowed.

  • This includes system calls such as alarm(), gettimer(), and sleep().

  • Multiple kernel-level threads are not supported.

  • However, multiple user-level threads are supported.

  • Memory mapped files are not supported.

  • This includes system calls such as mmap() and munmap().

  • File locks are allowed, but they are not retained between checkpoints.

  • All files must be opened read-only or write-only.

  • A file opened for both reading and writing will cause trouble if a job must be rolled back to an old checkpoint image. For compatibility reasons, a file opened for both reading and writing will result in a warning, but not an error.

  • A fair amount of disk space must be available on the submitting machine for storing checkpoint images.

  • A checkpoint image is approximately equal to the virtual memory consumed by a job while it runs. If disk space is short, a special checkpoint server can be designated for storing all of the checkpoint images in a pool.

  • + Share This
  • 🔖 Save To Your Account