Newer
Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
+++
title = "DMTCP Checkpointing"
description = "How to use the DMTCP utility to checkpoint your application."
+++
[DMTCP](http://dmtcp.sourceforge.net)
(Distributed MultiThreaded Checkpointing) is a checkpointing package for
applications. Using checkpointing allows resuming of a failing
simulation due to failing resources (e.g. hardware, software, exceeded
time and memory resources).
DMTCP supports both sequential and multi-threaded applications. Some
examples of binary programs on Linux distributions that can be used with
DMTCP are OpenMP, MATLAB, Python, Perl, MySQL, bash, gdb, X-Windows etc.
DMTCP provides support for several resource managers, including SLURM,
the resource manager used in HCC. The DMTCP module is available both on
Crane, and is enabled by typing:
{{< highlight bash >}}
module load dmtcp
{{< /highlight >}}
After the module is loaded, the first step is to run the command:
{{< highlight bash >}}
[<username>@login.crane ~]$ dmtcp_launch --new-coordinator --rm --interval <interval_time_seconds> <your_command>
{{< /highlight >}}
where `--rm` option enables SLURM support,
**\<interval_time_seconds\>** is the time in seconds between
automatic checkpoints, and **\<your_command\>** is the actual
command you want to run and checkpoint.
Beside the general options shown above, more `dmtcp_launch` options
can be seen by using:
{{< highlight bash >}}
[<username>@login.crane ~]$ dmtcp_launch --help
{{< /highlight >}}
`dmtcp_launch` creates few files that are used to resume the
cancelled job, such as *ckpt\_\*.dmtcp* and
*dmtcp\_restart\_script\*.sh*. Unless otherwise stated
(using `--ckptdir` option), these files are stored in the current
working directory.
The second step of DMTCP is to restart the cancelled job, and there are
two ways of doing that:
- `dmtcp_restart ckpt_*.dmtcp` *\<options\>* (before running
this command delete any old *ckp\_\*.dmtcp* files in your current
directory)
- `./dmtcp_restart_script.sh` *\<options\>*
If there are no options defined in the *<options>* field, DMTCP
will keep running with the options defined in the initial
**dmtcp\_launch** call (such as interval time, output directory etc).
Simple example of using DMTCP with
[BLAST]({{< relref "/applications/app_specific/bioinformatics_tools/alignment_tools/blast/running_blast_alignment" >}})
on crane is shown below:
{{% panel theme="info" header="dmtcp_blastx.submit" %}}
{{< highlight batch >}}
#!/bin/sh
#SBATCH --job-name=BlastX
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --time=50:00:00
#SBATCH --mem=20gb
#SBATCH --output=BlastX_info_1.txt
#SBATCH --error=BlastX_error_1.txt
module load dmtcp
module load blast/2.4
cd $WORK/<project_folder>
cp -r /work/HCC/DATA/blastdb/nr/ /tmp/
cp input_reads.fasta /tmp/
dmtcp_launch --new-coordinator --rm --interval 3600 blastx -query \
/tmp/input_reads.fasta -db /tmp/nr/nr -out blastx_output.alignments \
-num_threads $SLURM_NTASKS_PER_NODE
{{< /highlight >}}
{{% /panel %}}
In this example, DMTCP takes checkpoints every hour (`--interval 3600`),
and the actual command we want to checkpoint is `blastx` with
some general BLAST options defined with `-query`, `-db`, `-out`,
`-num_threads`.
If this job is killed for various reasons, it can be restarted using the
following submit file:
{{% panel theme="info" header="dmtcp_restart_blastx.submit" %}}
{{< highlight batch >}}
#!/bin/sh
#SBATCH --job-name=BlastX
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --time=50:00:00
#SBATCH --mem=20gb
#SBATCH --output=BlastX_info_2.txt
#SBATCH --error=BlastX_error_2.txt
module load dmtcp
module load blast/2.4
cd $WORK/<project_folder>
cp -r /work/HCC/DATA/blastdb/nr/ /tmp/
cp input_reads.fasta /tmp/
# Start DMTCP
dmtcp_coordinator --daemon --port 0 --port-file /tmp/port
export DMTCP_COORD_HOST=`hostname`
export DMTCP_COORD_PORT=$(</tmp/port)
# Restart job
./dmtcp_restart_script.sh
{{< /highlight >}}
{{% /panel %}}
{{% notice info %}}
`dmtcp_restart` generates new
`ckpt_*.dmtcp` and `dmtcp_restart_script*.sh` files. Therefore, if
the restarted job is also killed due to unavailable/exceeded resources,
you can resubmit the same job again without any changes in the submit
file shown above (just don't forget to delete the old `ckpt_*.dmtcp`
files if you are using these files instead of `dmtcp_restart_script.sh`)
{{% /notice %}}
Even though DMTCP tries to support most mainstream and commonly used
applications, there is no guarantee that every application can be
checkpointed and restarted.