_index.md 7.06 KB
Newer Older
1
2
3
4
5
6
+++
title = "Submitting Jobs"
description =  "How to submit jobs to HCC resources"
weight = "10"
+++

7
Crane and Rhino are managed by
Adam Caprez's avatar
Adam Caprez committed
8
the [SLURM](https://slurm.schedmd.com) resource manager.  
9
In order to run processing on Crane or Rhino, you
10
11
12
13
must create a SLURM script that will run your processing. After
submitting the job, SLURM will schedule your processing on an available
worker node.

Adam Caprez's avatar
Adam Caprez committed
14
15
Before writing a submit file, you may need to
[compile your application]({{< relref "/guides/running_applications/compiling_source_code" >}}).
16

Adam Caprez's avatar
Adam Caprez committed
17
18
19
20
21
22
- [Ensure proper working directory for job output](#ensure-proper-working-directory-for-job-output)
- [Creating a SLURM Submit File](#creating-a-slurm-submit-file)
- [Submitting the job](#submitting-the-job)
- [Checking Job Status](#checking-job-status)
  -   [Checking Job Start](#checking-job-start)
- [Next Steps](#next-steps)
23
24


Adam Caprez's avatar
Adam Caprez committed
25
### Ensure proper working directory for job output
26

Adam Caprez's avatar
Adam Caprez committed
27
{{% notice info %}}
28
All SLURM job output should be directed to your /work path.
Adam Caprez's avatar
Adam Caprez committed
29
{{% /notice %}}
30

Adam Caprez's avatar
Adam Caprez committed
31
32
{{% panel theme="info" header="Manual specification of /work path" %}}
{{< highlight bash >}}
33
$ cd /work/[groupname]/[username]
Adam Caprez's avatar
Adam Caprez committed
34
35
{{< /highlight >}}
{{% /panel %}}
36

Adam Caprez's avatar
Adam Caprez committed
37
38
39
The environment variable `$WORK` can also be used.
{{% panel theme="info" header="Using environment variable for /work path" %}}
{{< highlight bash >}}
40
41
42
$ cd $WORK
$ pwd
/work/[groupname]/[username]
Adam Caprez's avatar
Adam Caprez committed
43
44
{{< /highlight >}}
{{% /panel %}}
45

Adam Caprez's avatar
Adam Caprez committed
46
Review how /work differs from /home [here.]({{< relref "/guides/handling_data/_index.md" >}})
47

Adam Caprez's avatar
Adam Caprez committed
48
### Creating a SLURM Submit File
49

Adam Caprez's avatar
Adam Caprez committed
50
{{% notice info %}}
51
The below example is for a serial job. For submitting MPI jobs, please
Adam Caprez's avatar
Adam Caprez committed
52
53
look at the [MPI Submission Guide.]({{< relref "submitting_an_mpi_job" >}})
{{% /notice %}}
54
55
56
57
58
59
60

A SLURM submit file is broken into 2 sections, the job description and
the processing.  SLURM job description are prepended with `#SBATCH` in
the submit file.

**SLURM Submit File**

Adam Caprez's avatar
Adam Caprez committed
61
{{< highlight batch >}}
62
63
64
65
66
67
68
69
70
71
72
#!/bin/sh
#SBATCH --time=03:15:00          # Run time in hh:mm:ss
#SBATCH --mem-per-cpu=1024       # Maximum memory required per CPU (in megabytes)
#SBATCH --job-name=hello-world
#SBATCH --error=/work/[groupname]/[username]/job.%J.err
#SBATCH --output=/work/[groupname]/[username]/job.%J.out

module load example/test

hostname
sleep 60
Adam Caprez's avatar
Adam Caprez committed
73
74
75
76
77
78
79
80
81
82
83
84
{{< /highlight >}}

- **time**  
  Maximum walltime the job can run.  After this time has expired, the
  job will be stopped.
- **mem-per-cpu**  
  Memory that is allocated per core for the job.  If you exceed this
  memory limit, your job will be stopped.
- **mem**  
  Specify the real memory required per node in MegaBytes. If you
  exceed this limit, your job will be stopped. Note that for you
  should ask for less memory than each node actually has. For
85
  instance, Rhino has 1TB, 512GB, 256GB, and 192GB of RAM per node. You may
Adam Caprez's avatar
Adam Caprez committed
86
  only request 1000GB of RAM for the 1TB node, 500GB of RAM for the
87
88
  512GB nodes, 250GB of RAM for the 256GB nodes, and 187.5GB for the 192 nodes.
  For Crane, the max is 500GB.
Adam Caprez's avatar
Adam Caprez committed
89
90
91
92
- **job-name**
  The name of the job.  Will be reported in the job listing.
- **partition**  
  The partition the job should run in.  Partitions determine the job's
Adam Caprez's avatar
Adam Caprez committed
93
94
  priority and on what nodes the partition can run on.  See the
  [Partitions]({{< relref "partitions" >}}) page for a list of possible partitions.
Adam Caprez's avatar
Adam Caprez committed
95
96
97
98
99
100
101
102
103
104
105
106
- **error**  
  Location of the stderr will be written for the job.  `[groupname]`
  and `[username]` should be replaced your group name and username.
  Your username can be retrieved with the command `id -un` and your
  group with `id -ng`.
- **output**  
  Location of the stdout will be written for the job.

More advanced submit commands can be found on the [SLURM Docs](https://slurm.schedmd.com/sbatch.html).
You can also find an example of a MPI submission on [Submitting an MPI Job]({{< relref "submitting_an_mpi_job" >}}).

### Submitting the job
107
108
109
110
111
112
113

Submitting the SLURM job is done by command `sbatch`.  SLURM will read
the submit file, and schedule the job according to the description in
the submit file.

Submitting the job described above is:

Adam Caprez's avatar
Adam Caprez committed
114
115
116
{{% panel theme="info" header="SLURM Submission" %}}
{{< highlight batch >}}
$ sbatch example.slurm
117
Submitted batch job 24603
Adam Caprez's avatar
Adam Caprez committed
118
119
{{< /highlight >}}
{{% /panel %}}
120
121
122

The job was successfully submitted.

Adam Caprez's avatar
Adam Caprez committed
123
### Checking Job Status
124
125
126
127

Job status is found with the command `squeue`.  It will provide
information such as:

Adam Caprez's avatar
Adam Caprez committed
128
129
130
131
132
133
134
135
136
- The State of the job: 
    - **R** - Running
    - **PD** - Pending - Job is awaiting resource allocation.
    - Additional codes are available
      on the [squeue](http://slurm.schedmd.com/squeue.html)
      page.
- Job Name
- Run Time
- Nodes running the job
137
138
139
140

Checking the status of the job is easiest by filtering by your username,
using the `-u` option to squeue.

Adam Caprez's avatar
Adam Caprez committed
141
{{< highlight batch >}}
142
143
144
$ squeue -u <username>
  JOBID PARTITION     NAME       USER  ST       TIME  NODES NODELIST(REASON)
  24605     batch hello-wo <username>   R       0:56      1 b01
Adam Caprez's avatar
Adam Caprez committed
145
{{< /highlight >}}
146
147

Additionally, if you want to see the status of a specific partition, for
Adam Caprez's avatar
Adam Caprez committed
148
example if you are part of a [partition]({{< relref "partitions" >}}),
Adam Caprez's avatar
Adam Caprez committed
149
you can use the `-p` option to `squeue`:
150

Adam Caprez's avatar
Adam Caprez committed
151
{{< highlight batch >}}
152
153
154
155
156
157
$ squeue -p esquared
  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
  73435  esquared MyRandom tingting   R   10:35:20      1 ri19n10
  73436  esquared MyRandom tingting   R   10:35:20      1 ri19n12
  73735  esquared SW2_driv   hroehr   R   10:14:11      1 ri20n07
  73736  esquared SW2_driv   hroehr   R   10:14:11      1 ri20n07
Adam Caprez's avatar
Adam Caprez committed
158
{{< /highlight >}}
159

Adam Caprez's avatar
Adam Caprez committed
160
#### Checking Job Start
161
162
163
164
165

You may view the start time of your job with the
command `squeue --start`.  The output of the command will show the
expected start time of the jobs.

Adam Caprez's avatar
Adam Caprez committed
166
{{< highlight batch >}}
167
168
169
170
171
172
173
174
175
176
177
178
179
$ squeue --start --user lypeng
  JOBID PARTITION     NAME     USER  ST           START_TIME  NODES NODELIST(REASON)
   5822     batch  Starace   lypeng  PD  2013-06-08T00:05:09      3 (Priority)
   5823     batch  Starace   lypeng  PD  2013-06-08T00:07:39      3 (Priority)
   5824     batch  Starace   lypeng  PD  2013-06-08T00:09:09      3 (Priority)
   5825     batch  Starace   lypeng  PD  2013-06-08T00:12:09      3 (Priority)
   5826     batch  Starace   lypeng  PD  2013-06-08T00:12:39      3 (Priority)
   5827     batch  Starace   lypeng  PD  2013-06-08T00:12:39      3 (Priority)
   5828     batch  Starace   lypeng  PD  2013-06-08T00:12:39      3 (Priority)
   5829     batch  Starace   lypeng  PD  2013-06-08T00:13:09      3 (Priority)
   5830     batch  Starace   lypeng  PD  2013-06-08T00:13:09      3 (Priority)
   5831     batch  Starace   lypeng  PD  2013-06-08T00:14:09      3 (Priority)
   5832     batch  Starace   lypeng  PD                  N/A      3 (Priority)
Adam Caprez's avatar
Adam Caprez committed
180
{{< /highlight >}}
181
182
183
184
185

The output shows the expected start time of the jobs, as well as the
reason that the jobs are currently idle (in this case, low priority of
the user due to running numerous jobs already).
 
Adam Caprez's avatar
Adam Caprez committed
186
#### Removing the Job
187
188
189
190
191

Removing the job is done with the `scancel` command.  The only argument
to the `scancel` command is the job id.  For the job above, the command
is:

Adam Caprez's avatar
Adam Caprez committed
192
{{< highlight batch >}}
193
$ scancel 24605
Adam Caprez's avatar
Adam Caprez committed
194
{{< /highlight >}}
195

Adam Caprez's avatar
Adam Caprez committed
196
### Next Steps
197

Adam Caprez's avatar
Adam Caprez committed
198
{{% children  %}}