+++ title = "DMTCP Checkpointing" description = "How to use the DMTCP utility to checkpoint your application." +++ [DMTCP](http://dmtcp.sourceforge.net) (Distributed MultiThreaded Checkpointing) is a checkpointing package for applications. Using checkpointing allows resuming of a failing simulation due to failing resources (e.g. hardware, software, exceeded time and memory resources). DMTCP supports both sequential and multi-threaded applications. Some examples of binary programs on Linux distributions that can be used with DMTCP are OpenMP, MATLAB, Python, Perl, MySQL, bash, gdb, X-Windows etc. DMTCP provides support for several resource managers, including SLURM, the resource manager used in HCC. The DMTCP module is available both on Crane, and is enabled by typing: {{< highlight bash >}} module load dmtcp {{< /highlight >}} After the module is loaded, the first step is to run the command: {{< highlight bash >}} [<username>@login.crane ~]$ dmtcp_launch --new-coordinator --rm --interval <interval_time_seconds> <your_command> {{< /highlight >}} where `--rm` option enables SLURM support, **\<interval_time_seconds\>** is the time in seconds between automatic checkpoints, and **\<your_command\>** is the actual command you want to run and checkpoint. Beside the general options shown above, more `dmtcp_launch` options can be seen by using: {{< highlight bash >}} [<username>@login.crane ~]$ dmtcp_launch --help {{< /highlight >}} `dmtcp_launch` creates few files that are used to resume the cancelled job, such as *ckpt\_\*.dmtcp* and *dmtcp\_restart\_script\*.sh*. Unless otherwise stated (using `--ckptdir` option), these files are stored in the current working directory. The second step of DMTCP is to restart the cancelled job, and there are two ways of doing that: - `dmtcp_restart ckpt_*.dmtcp` *\<options\>* (before running this command delete any old *ckp\_\*.dmtcp* files in your current directory) - `./dmtcp_restart_script.sh` *\<options\>* If there are no options defined in the *<options>* field, DMTCP will keep running with the options defined in the initial **dmtcp\_launch** call (such as interval time, output directory etc). Simple example of using DMTCP with [BLAST]({{< relref "/applications/app_specific/bioinformatics_tools/alignment_tools/blast/running_blast_alignment" >}}) on crane is shown below: {{% panel theme="info" header="dmtcp_blastx.submit" %}} {{< highlight batch >}} #!/bin/sh #SBATCH --job-name=BlastX #SBATCH --nodes=1 #SBATCH --ntasks=8 #SBATCH --time=50:00:00 #SBATCH --mem=20gb #SBATCH --output=BlastX_info_1.txt #SBATCH --error=BlastX_error_1.txt module load dmtcp module load blast/2.4 cd $WORK/<project_folder> cp -r /work/HCC/DATA/blastdb/nr/ /tmp/ cp input_reads.fasta /tmp/ dmtcp_launch --new-coordinator --rm --interval 3600 blastx -query \ /tmp/input_reads.fasta -db /tmp/nr/nr -out blastx_output.alignments \ -num_threads $SLURM_NTASKS_PER_NODE {{< /highlight >}} {{% /panel %}} In this example, DMTCP takes checkpoints every hour (`--interval 3600`), and the actual command we want to checkpoint is `blastx` with some general BLAST options defined with `-query`, `-db`, `-out`, `-num_threads`. If this job is killed for various reasons, it can be restarted using the following submit file: {{% panel theme="info" header="dmtcp_restart_blastx.submit" %}} {{< highlight batch >}} #!/bin/sh #SBATCH --job-name=BlastX #SBATCH --nodes=1 #SBATCH --ntasks=8 #SBATCH --time=50:00:00 #SBATCH --mem=20gb #SBATCH --output=BlastX_info_2.txt #SBATCH --error=BlastX_error_2.txt module load dmtcp module load blast/2.4 cd $WORK/<project_folder> cp -r /work/HCC/DATA/blastdb/nr/ /tmp/ cp input_reads.fasta /tmp/ # Start DMTCP dmtcp_coordinator --daemon --port 0 --port-file /tmp/port export DMTCP_COORD_HOST=`hostname` export DMTCP_COORD_PORT=$(</tmp/port) # Restart job ./dmtcp_restart_script.sh {{< /highlight >}} {{% /panel %}} {{% notice info %}} `dmtcp_restart` generates new `ckpt_*.dmtcp` and `dmtcp_restart_script*.sh` files. Therefore, if the restarted job is also killed due to unavailable/exceeded resources, you can resubmit the same job again without any changes in the submit file shown above (just don't forget to delete the old `ckpt_*.dmtcp` files if you are using these files instead of `dmtcp_restart_script.sh`) {{% /notice %}} Even though DMTCP tries to support most mainstream and commonly used applications, there is no guarantee that every application can be checkpointed and restarted.