Sun GE Software Checkpointing Environment
In a Sun lab experiment, the following system configuration was used to test the Sun GE software checkpointing environment:
Two Sun Ultra™ 80 workstations were used. One was configured as master-submit-execution host and the other as an execution host. Both machines were loaded with the Solaris 8 OE, the Sun GE 220.127.116.11 software, the Forte™ 6U1 compilers, and the Condor 6.2.1 libraries.
The two Sun GE host queues were configured to support checkpointing.
The application was set up to checkpoint when the sge_execd daemon was shut down or when the job was suspended.
Sun GE was configured so that the job would be rescheduled in case it was suspended.
The checkpoint signal was set up to SIGTSTP because the Condor libraries use it to checkpoint the application. Alternately, there was the SIGUSR2 signal used by the Condor libraries to checkpoint the application and then continue its normal execution.
Finally, user-defined checkpointing was set. User-defined checkpointing means that the application periodically writes checkpoints without any intervention by the Sun GE software. At restart time, the application continues from the last checkpoint. FIGURE 4 shows the Checkpoint Configuration window used during the test.
FIGURE 4 Checkpoint Configuration Window in the qmon GUI
Standalone Checkpointing Setup
The full Condor source code can be downloaded as a TAR file from:
For the Sun lab experiment, there was no need to install the whole Condor software because only the entire lib subdirectory and the condor_compile command from the bin subdirectory was needed.
The condor_compile shell script needs to be modified at the following line:
CONDOR_LIBDIR='condor_config_val LIB' to CONDOR_LIBDIR=full_path_of_lib
Where full_path_of_lib is the path to the highest level of the Condor lib subdirectory.
The above setup allows sequential applications to be checkpointed by using the user-level checkpointing Condor libraries.
Checkpointing Application Preparation
A normal application that needs to be checkpointed does not need source-level modifications. FIGURE 5 shows how to checkpoint an application. The application source or object only needs to be relinked with the Condor checkpointing libraries to take advantage of the checkpointing and remote system calls. The Condor libraries contains an easy mechanism that helps to perform the relink operation by using the condor_compile command as follows:
condor_compile -condor_standalone command [options|files ...]
Where command is any of cc, f77, f90, or ld, and where [options|files ...] are the normal arguments used by the compiler and linker.
FIGURE 5 Checkpointing an Application