dmtcp_checkpointing.md

+++
title = "DMTCP Checkpointing"
description = "How to use the DMTCP utility to checkpoint your application."
+++

[DMTCP](http://dmtcp.sourceforge.net)
(Distributed MultiThreaded Checkpointing) is a checkpointing package for
applications. Using checkpointing allows resuming of a failing
simulation due to failing resources (e.g. hardware, software, exceeded
time and memory resources).

DMTCP supports both sequential and multi-threaded applications. Some
examples of binary programs on Linux distributions that can be used with
DMTCP are OpenMP, MATLAB, Python, Perl, MySQL, bash, gdb, X-Windows etc.

DMTCP provides support for several resource managers, including SLURM,
the resource manager used in HCC. The DMTCP module is available both on
Crane, and is enabled by typing:

{{< highlight bash >}}
module load dmtcp
{{< /highlight >}}
  
After the module is loaded, the first step is to run the command:

{{< highlight bash >}}
[<username>@login.crane ~]$ dmtcp_launch --new-coordinator --rm --interval <interval_time_seconds> <your_command>
{{< /highlight >}}

where `--rm` option enables SLURM support,
**\<interval_time_seconds\>** is the time in seconds between
automatic checkpoints, and **\<your_command\>** is the actual
command you want to run and checkpoint.

Beside the general options shown above, more `dmtcp_launch` options
can be seen by using:

{{< highlight bash >}}
[<username>@login.crane ~]$ dmtcp_launch --help
{{< /highlight >}}

`dmtcp_launch` creates few files that are used to resume the
cancelled job, such as *ckpt\_\*.dmtcp* and
*dmtcp\_restart\_script\*.sh*. Unless otherwise stated
(using `--ckptdir` option), these files are stored in the current
working directory.

  
The second step of DMTCP is to restart the cancelled job, and there are
two ways of doing that:

-   `dmtcp_restart ckpt_*.dmtcp` *\<options\>* (before running
    this command delete any old *ckp\_\*.dmtcp* files in your current
    directory)

-   `./dmtcp_restart_script.sh` *\<options\>*

If there are no options defined in the *&lt;options&gt;* field, DMTCP
will keep running with the options defined in the initial
**dmtcp\_launch** call (such as interval time, output directory etc).

  
Simple example of using DMTCP with
[BLAST]({{< relref "/applications/app_specific/bioinformatics_tools/alignment_tools/blast/running_blast_alignment" >}})
on crane is shown below:

{{% panel theme="info" header="dmtcp_blastx.submit" %}}
{{< highlight batch >}}
#!/bin/sh
#SBATCH --job-name=BlastX
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --time=50:00:00
#SBATCH --mem=20gb
#SBATCH --output=BlastX_info_1.txt
#SBATCH --error=BlastX_error_1.txt
 
module load dmtcp
module load blast/2.4

cd $WORK/<project_folder>
cp -r /work/HCC/DATA/blastdb/nr/ /tmp/  
cp input_reads.fasta /tmp/

dmtcp_launch --new-coordinator --rm --interval 3600 blastx -query \
/tmp/input_reads.fasta -db /tmp/nr/nr -out blastx_output.alignments \
-num_threads $SLURM_NTASKS_PER_NODE
{{< /highlight >}}
{{% /panel %}}

In this example, DMTCP takes checkpoints every hour (`--interval 3600`),
and the actual command we want to checkpoint is `blastx` with
some general BLAST options defined with `-query`, `-db`, `-out`,
`-num_threads`.

If this job is killed for various reasons, it can be restarted using the
following submit file:

{{% panel theme="info" header="dmtcp_restart_blastx.submit" %}}
{{< highlight batch >}}
#!/bin/sh
#SBATCH --job-name=BlastX
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --time=50:00:00
#SBATCH --mem=20gb
#SBATCH --output=BlastX_info_2.txt
#SBATCH --error=BlastX_error_2.txt

module load dmtcp
module load blast/2.4

cd $WORK/<project_folder>
cp -r /work/HCC/DATA/blastdb/nr/ /tmp/
cp input_reads.fasta /tmp/

# Start DMTCP
dmtcp_coordinator --daemon --port 0 --port-file /tmp/port
export DMTCP_COORD_HOST=`hostname`
export DMTCP_COORD_PORT=$(</tmp/port)

# Restart job 
./dmtcp_restart_script.sh
{{< /highlight >}}
{{% /panel %}}

{{% notice info %}}
`dmtcp_restart` generates new
`ckpt_*.dmtcp` and `dmtcp_restart_script*.sh` files. Therefore, if
the restarted job is also killed due to unavailable/exceeded resources,
you can resubmit the same job again without any changes in the submit
file shown above (just don't forget to delete the old `ckpt_*.dmtcp`
files if you are using these files instead of `dmtcp_restart_script.sh`)
{{% /notice %}}
  
Even though DMTCP tries to support most mainstream and commonly used
applications, there is no guarantee that every application can be
checkpointed and restarted.