submitting_jobs.md 6.96 KB
Newer Older
Caughlin Bohn's avatar
Caughlin Bohn committed
1
2
3
4
5
6
+++
title = "Submitting Jobs"
description =  "How to submit jobs to HCC resources"
weight = "10"
+++

7
Crane and Rhino are managed by
Caughlin Bohn's avatar
Caughlin Bohn committed
8
the [SLURM](https://slurm.schedmd.com) resource manager.  
9
In order to run processing on Crane, you
Caughlin Bohn's avatar
Caughlin Bohn committed
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
must create a SLURM script that will run your processing. After
submitting the job, SLURM will schedule your processing on an available
worker node.

Before writing a submit file, you may need to
[compile your application]({{< relref "/guides/running_applications/compiling_source_code" >}}).

- [Ensure proper working directory for job output](#ensure-proper-working-directory-for-job-output)
- [Creating a SLURM Submit File](#creating-a-slurm-submit-file)
- [Submitting the job](#submitting-the-job)
- [Checking Job Status](#checking-job-status)
  -   [Checking Job Start](#checking-job-start)
- [Next Steps](#next-steps)


### Ensure proper working directory for job output

{{% notice info %}}
Because the /home directories are not writable from the worker nodes, all SLURM job output should be directed to your /work path.
{{% /notice %}}

{{% panel theme="info" header="Manual specification of /work path" %}}
{{< highlight bash >}}
$ cd /work/[groupname]/[username]
{{< /highlight >}}
{{% /panel %}}

The environment variable `$WORK` can also be used.
{{% panel theme="info" header="Using environment variable for /work path" %}}
{{< highlight bash >}}
$ cd $WORK
$ pwd
/work/[groupname]/[username]
{{< /highlight >}}
{{% /panel %}}

Review how /work differs from /home [here.]({{< relref "/guides/handling_data/_index.md" >}})

### Creating a SLURM Submit File

{{% notice info %}}
The below example is for a serial job. For submitting MPI jobs, please
look at the [MPI Submission Guide.]({{< relref "submitting_an_mpi_job" >}})
{{% /notice %}}

A SLURM submit file is broken into 2 sections, the job description and
the processing.  SLURM job description are prepended with `#SBATCH` in
the submit file.

**SLURM Submit File**

{{< highlight batch >}}
#!/bin/sh
#SBATCH --time=03:15:00          # Run time in hh:mm:ss
#SBATCH --mem-per-cpu=1024       # Maximum memory required per CPU (in megabytes)
#SBATCH --job-name=hello-world
#SBATCH --error=/work/[groupname]/[username]/job.%J.err
#SBATCH --output=/work/[groupname]/[username]/job.%J.out

module load example/test

hostname
sleep 60
{{< /highlight >}}

- **time**  
  Maximum walltime the job can run.  After this time has expired, the
  job will be stopped.
- **mem-per-cpu**  
  Memory that is allocated per core for the job.  If you exceed this
  memory limit, your job will be stopped.
- **mem**  
  Specify the real memory required per node in MegaBytes. If you
  exceed this limit, your job will be stopped. Note that for you
84
  should ask for less memory than each node actually has. For Crane, the
Caughlin Bohn's avatar
Caughlin Bohn committed
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
  max is 500GB.
- **job-name**
  The name of the job.  Will be reported in the job listing.
- **partition**  
  The partition the job should run in.  Partitions determine the job's
  priority and on what nodes the partition can run on.  See the
  [Partitions]({{< relref "/guides/submitting_jobs/partitions/_index.md" >}}) page for a list of possible partitions.
- **error**  
  Location of the stderr will be written for the job.  `[groupname]`
  and `[username]` should be replaced your group name and username.
  Your username can be retrieved with the command `id -un` and your
  group with `id -ng`.
- **output**  
  Location of the stdout will be written for the job.

More advanced submit commands can be found on the [SLURM Docs](https://slurm.schedmd.com/sbatch.html).
You can also find an example of a MPI submission on [Submitting an MPI Job]({{< relref "submitting_an_mpi_job" >}}).

### Submitting the job

Submitting the SLURM job is done by command `sbatch`.  SLURM will read
the submit file, and schedule the job according to the description in
the submit file.

Submitting the job described above is:

{{% panel theme="info" header="SLURM Submission" %}}
{{< highlight batch >}}
$ sbatch example.slurm
Submitted batch job 24603
{{< /highlight >}}
{{% /panel %}}

The job was successfully submitted.

### Checking Job Status

Job status is found with the command `squeue`.  It will provide
information such as:

- The State of the job: 
    - **R** - Running
    - **PD** - Pending - Job is awaiting resource allocation.
    - Additional codes are available
      on the [squeue](http://slurm.schedmd.com/squeue.html)
      page.
- Job Name
- Run Time
- Nodes running the job

Checking the status of the job is easiest by filtering by your username,
using the `-u` option to squeue.

{{< highlight batch >}}
$ squeue -u <username>
  JOBID PARTITION     NAME       USER  ST       TIME  NODES NODELIST(REASON)
  24605     batch hello-wo <username>   R       0:56      1 b01
{{< /highlight >}}

Additionally, if you want to see the status of a specific partition, for
example if you are part of a [partition]({{< relref "/guides/submitting_jobs/partitions/_index.md" >}}),
you can use the `-p` option to `squeue`:

{{< highlight batch >}}
$ squeue -p esquared
  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
  73435  esquared MyRandom tingting   R   10:35:20      1 ri19n10
  73436  esquared MyRandom tingting   R   10:35:20      1 ri19n12
  73735  esquared SW2_driv   hroehr   R   10:14:11      1 ri20n07
  73736  esquared SW2_driv   hroehr   R   10:14:11      1 ri20n07
{{< /highlight >}}

#### Checking Job Start

You may view the start time of your job with the
command `squeue --start`.  The output of the command will show the
expected start time of the jobs.

{{< highlight batch >}}
$ squeue --start --user lypeng
  JOBID PARTITION     NAME     USER  ST           START_TIME  NODES NODELIST(REASON)
   5822     batch  Starace   lypeng  PD  2013-06-08T00:05:09      3 (Priority)
   5823     batch  Starace   lypeng  PD  2013-06-08T00:07:39      3 (Priority)
   5824     batch  Starace   lypeng  PD  2013-06-08T00:09:09      3 (Priority)
   5825     batch  Starace   lypeng  PD  2013-06-08T00:12:09      3 (Priority)
   5826     batch  Starace   lypeng  PD  2013-06-08T00:12:39      3 (Priority)
   5827     batch  Starace   lypeng  PD  2013-06-08T00:12:39      3 (Priority)
   5828     batch  Starace   lypeng  PD  2013-06-08T00:12:39      3 (Priority)
   5829     batch  Starace   lypeng  PD  2013-06-08T00:13:09      3 (Priority)
   5830     batch  Starace   lypeng  PD  2013-06-08T00:13:09      3 (Priority)
   5831     batch  Starace   lypeng  PD  2013-06-08T00:14:09      3 (Priority)
   5832     batch  Starace   lypeng  PD                  N/A      3 (Priority)
{{< /highlight >}}

The output shows the expected start time of the jobs, as well as the
reason that the jobs are currently idle (in this case, low priority of
the user due to running numerous jobs already).
 
#### Removing the Job

Removing the job is done with the `scancel` command.  The only argument
to the `scancel` command is the job id.  For the job above, the command
is:

{{< highlight batch >}}
$ scancel 24605
{{< /highlight >}}

### Next Steps

{{% children  %}}