monitoring_jobs.md 3.81 KB
Newer Older
Adam Caprez's avatar
Adam Caprez committed
1
2
3
4
+++
title = "Monitoring Jobs"
description =  "How to find out information about running and completed jobs."
+++
5
6
7
8
9
10
11
12
13
14

Careful examination of running times, memory usage and output files will
allow you to ensure the job completed correctly and give you a good idea
of what memory and time limits to request in the future.

###  Monitoring Completed Jobs:

To see the runtime and memory usage of a job that has completed, use the
sacct command:

Adam Caprez's avatar
Adam Caprez committed
15
{{< highlight bash >}}
16
sacct
Adam Caprez's avatar
Adam Caprez committed
17
{{< /highlight >}}
18
19
20
21

Lists all jobs by the current user and displays information such as
JobID, JobName, State, and ExitCode.

Adam Caprez's avatar
Adam Caprez committed
22
{{< figure src="/images/21070053.png" height="150" >}}
23
24
25
26
27
28
29

Coupling this command with the --format flag will allow you to see more
than the default information about a job. Fields to display should be
listed as a comma separated list after the --format flag (without
spaces). For example, to see the Elapsed time and Maximum used memory by
a job, this command can be used:

Adam Caprez's avatar
Adam Caprez committed
30
{{< highlight bash >}}
31
sacct --format JobID,JobName,Elapsed,MaxRSS
Adam Caprez's avatar
Adam Caprez committed
32
{{< /highlight >}}
33

Adam Caprez's avatar
Adam Caprez committed
34
{{< figure src="/images/21070054.png" height="150" >}}
35
36

Additional arguments and format field information can be found in
Adam Caprez's avatar
Adam Caprez committed
37
[the SLURM documentation](https://slurm.schedmd.com/sacct.html).
38
39
40
41
42
43
44
45
46

### Monitoring Running Jobs:

There are two ways to monitor running jobs, the top command and
monitoring the cgroup files. Top is helpful when monitoring
multi-process jobs, whereas the cgroup files provide information on
memory usage. Both of these tools require the use of an interactive job
on the same node as the job to be monitored.

Adam Caprez's avatar
Adam Caprez committed
47
{{% notice warning %}}
48
49
If the job to be monitored is using all available resources for a node,
the user will not be able to obtain a simultaneous interactive job.
Adam Caprez's avatar
Adam Caprez committed
50
{{% /notice %}}
51
52
53
54

After the job to be monitored is submitted and has begun to run, request
an interactive job on the same node using the srun command:

Adam Caprez's avatar
Adam Caprez committed
55
{{< highlight bash >}}
56
srun --jobid=<JOB_ID> --pty bash
Adam Caprez's avatar
Adam Caprez committed
57
{{< /highlight >}}
58

Adam Caprez's avatar
Adam Caprez committed
59
Where `<JOB_ID>` is replaced by the job id for the monitored job as
60
61
62
63
assigned by SLURM.

Alternately, you can request the interactive job by nodename as follows:

Adam Caprez's avatar
Adam Caprez committed
64
{{< highlight bash >}}
65
srun --nodelist=<NODE_ID> --pty bash
Adam Caprez's avatar
Adam Caprez committed
66
{{< /highlight >}}
67

Adam Caprez's avatar
Adam Caprez committed
68
Where `<NODE_ID>` is replaced by the node name that the monitored
69
job is running. This information can be found out by looking at the
Adam Caprez's avatar
Adam Caprez committed
70
squeue output under the `NODELIST` column.
71

Adam Caprez's avatar
Adam Caprez committed
72
{{< figure src="/images/21070055.png" width="700" >}}
73
74
75
76

Once the interactive job begins, you can run top to view the processes
on the node you are on:

Adam Caprez's avatar
Adam Caprez committed
77
{{< figure src="/images/21070056.png" height="400" >}}
78
79
80

Output for top displays each running process on the node. From the above
image, we can see the various MATLAB processes being run by user
Adam Caprez's avatar
Adam Caprez committed
81
cathrine98. To filter the list of processes, you can type `u` followed
82
by the username of the user who owns the processes. To exit this screen,
Adam Caprez's avatar
Adam Caprez committed
83
press `q`.
84
85
86
87
88
89
90
91
92
93

During a running job, the cgroup folder is created which contains much
of the information used by sacct. These files can provide a live
overview of resources used for a running job. To access the cgroup
files, you will need to be in an interactive job on the same node as the
monitored job. To view specific files, and information, use one of the
following commands:

##### To view current memory usage:

Adam Caprez's avatar
Adam Caprez committed
94
{{< highlight bash >}}
95
less /cgroup/memory/slurm/uid_<UID>/job_<SLURM_JOB_ID>/memory.usage_in_bytes
Adam Caprez's avatar
Adam Caprez committed
96
{{< /highlight >}}
97

Adam Caprez's avatar
Adam Caprez committed
98
Where `<UID>` is replaced by your UID and `<SLURM_JOB_ID>` is
99
100
replaced by the monitored job's Job ID as assigned by Slurm.

Adam Caprez's avatar
Adam Caprez committed
101
102
103
104
{{% notice note %}}
To find your uid, use the command `id -u`. Your UID never changes and is
the same on all HCC clusters (*not* on Anvil, however!).
{{% /notice %}}
105
106
107

##### To view maximum memory usage from start of job to current point:

Adam Caprez's avatar
Adam Caprez committed
108
{{< highlight bash >}}
109
cat /cgroup/memory/slurm/uid_${UID}/job_${SLURM_JOBID}/memory.max_usage_in_bytes
Adam Caprez's avatar
Adam Caprez committed
110
{{< /highlight >}}