monitoring_jobs.md 3.82 KB
Newer Older
Adam Caprez's avatar
Adam Caprez committed
1
2
3
+++
title = "Monitoring Jobs"
description =  "How to find out information about running and completed jobs."
4
weight=55
Adam Caprez's avatar
Adam Caprez committed
5
+++
6
7
8
9
10
11
12
13
14
15

Careful examination of running times, memory usage and output files will
allow you to ensure the job completed correctly and give you a good idea
of what memory and time limits to request in the future.

###  Monitoring Completed Jobs:

To see the runtime and memory usage of a job that has completed, use the
sacct command:

Adam Caprez's avatar
Adam Caprez committed
16
{{< highlight bash >}}
17
sacct
Adam Caprez's avatar
Adam Caprez committed
18
{{< /highlight >}}
19
20
21
22

Lists all jobs by the current user and displays information such as
JobID, JobName, State, and ExitCode.

Adam Caprez's avatar
Adam Caprez committed
23
{{< figure src="/images/21070053.png" height="150" >}}
24
25
26
27
28
29
30

Coupling this command with the --format flag will allow you to see more
than the default information about a job. Fields to display should be
listed as a comma separated list after the --format flag (without
spaces). For example, to see the Elapsed time and Maximum used memory by
a job, this command can be used:

Adam Caprez's avatar
Adam Caprez committed
31
{{< highlight bash >}}
32
sacct --format JobID,JobName,Elapsed,MaxRSS
Adam Caprez's avatar
Adam Caprez committed
33
{{< /highlight >}}
34

Adam Caprez's avatar
Adam Caprez committed
35
{{< figure src="/images/21070054.png" height="150" >}}
36
37

Additional arguments and format field information can be found in
Adam Caprez's avatar
Adam Caprez committed
38
[the SLURM documentation](https://slurm.schedmd.com/sacct.html).
39
40
41
42
43
44
45
46
47

### Monitoring Running Jobs:

There are two ways to monitor running jobs, the top command and
monitoring the cgroup files. Top is helpful when monitoring
multi-process jobs, whereas the cgroup files provide information on
memory usage. Both of these tools require the use of an interactive job
on the same node as the job to be monitored.

Adam Caprez's avatar
Adam Caprez committed
48
{{% notice warning %}}
49
50
If the job to be monitored is using all available resources for a node,
the user will not be able to obtain a simultaneous interactive job.
Adam Caprez's avatar
Adam Caprez committed
51
{{% /notice %}}
52
53
54
55

After the job to be monitored is submitted and has begun to run, request
an interactive job on the same node using the srun command:

Adam Caprez's avatar
Adam Caprez committed
56
{{< highlight bash >}}
57
srun --jobid=<JOB_ID> --pty bash
Adam Caprez's avatar
Adam Caprez committed
58
{{< /highlight >}}
59

Adam Caprez's avatar
Adam Caprez committed
60
Where `<JOB_ID>` is replaced by the job id for the monitored job as
61
62
63
64
assigned by SLURM.

Alternately, you can request the interactive job by nodename as follows:

Adam Caprez's avatar
Adam Caprez committed
65
{{< highlight bash >}}
66
srun --nodelist=<NODE_ID> --pty bash
Adam Caprez's avatar
Adam Caprez committed
67
{{< /highlight >}}
68

Adam Caprez's avatar
Adam Caprez committed
69
Where `<NODE_ID>` is replaced by the node name that the monitored
70
job is running. This information can be found out by looking at the
Adam Caprez's avatar
Adam Caprez committed
71
squeue output under the `NODELIST` column.
72

Adam Caprez's avatar
Adam Caprez committed
73
{{< figure src="/images/21070055.png" width="700" >}}
74
75
76
77

Once the interactive job begins, you can run top to view the processes
on the node you are on:

Adam Caprez's avatar
Adam Caprez committed
78
{{< figure src="/images/21070056.png" height="400" >}}
79
80
81

Output for top displays each running process on the node. From the above
image, we can see the various MATLAB processes being run by user
Adam Caprez's avatar
Adam Caprez committed
82
cathrine98. To filter the list of processes, you can type `u` followed
83
by the username of the user who owns the processes. To exit this screen,
Adam Caprez's avatar
Adam Caprez committed
84
press `q`.
85
86
87
88
89
90
91
92
93
94

During a running job, the cgroup folder is created which contains much
of the information used by sacct. These files can provide a live
overview of resources used for a running job. To access the cgroup
files, you will need to be in an interactive job on the same node as the
monitored job. To view specific files, and information, use one of the
following commands:

##### To view current memory usage:

Adam Caprez's avatar
Adam Caprez committed
95
{{< highlight bash >}}
96
less /cgroup/memory/slurm/uid_<UID>/job_<SLURM_JOB_ID>/memory.usage_in_bytes
Adam Caprez's avatar
Adam Caprez committed
97
{{< /highlight >}}
98

Adam Caprez's avatar
Adam Caprez committed
99
Where `<UID>` is replaced by your UID and `<SLURM_JOB_ID>` is
100
101
replaced by the monitored job's Job ID as assigned by Slurm.

Adam Caprez's avatar
Adam Caprez committed
102
103
104
105
{{% notice note %}}
To find your uid, use the command `id -u`. Your UID never changes and is
the same on all HCC clusters (*not* on Anvil, however!).
{{% /notice %}}
106
107
108

##### To view maximum memory usage from start of job to current point:

Adam Caprez's avatar
Adam Caprez committed
109
{{< highlight bash >}}
110
cat /cgroup/memory/slurm/uid_${UID}/job_${SLURM_JOBID}/memory.max_usage_in_bytes
Adam Caprez's avatar
Adam Caprez committed
111
{{< /highlight >}}