monitoring_jobs.md 7.23 KB
Newer Older
Adam Caprez's avatar
Adam Caprez committed
1
2
3
+++
title = "Monitoring Jobs"
description =  "How to find out information about running and completed jobs."
4
weight=55
Adam Caprez's avatar
Adam Caprez committed
5
+++
6
7
8
9
10
11
12
13
14
15

Careful examination of running times, memory usage and output files will
allow you to ensure the job completed correctly and give you a good idea
of what memory and time limits to request in the future.

###  Monitoring Completed Jobs:

To see the runtime and memory usage of a job that has completed, use the
sacct command:

Adam Caprez's avatar
Adam Caprez committed
16
{{< highlight bash >}}
17
sacct
Adam Caprez's avatar
Adam Caprez committed
18
{{< /highlight >}}
19
20
21
22

Lists all jobs by the current user and displays information such as
JobID, JobName, State, and ExitCode.

Adam Caprez's avatar
Adam Caprez committed
23
{{< figure src="/images/21070053.png" height="150" >}}
24
25
26
27
28
29
30

Coupling this command with the --format flag will allow you to see more
than the default information about a job. Fields to display should be
listed as a comma separated list after the --format flag (without
spaces). For example, to see the Elapsed time and Maximum used memory by
a job, this command can be used:

Adam Caprez's avatar
Adam Caprez committed
31
{{< highlight bash >}}
32
sacct --format JobID,JobName,Elapsed,MaxRSS
Adam Caprez's avatar
Adam Caprez committed
33
{{< /highlight >}}
34

Adam Caprez's avatar
Adam Caprez committed
35
{{< figure src="/images/21070054.png" height="150" >}}
36
37

Additional arguments and format field information can be found in
Adam Caprez's avatar
Adam Caprez committed
38
[the SLURM documentation](https://slurm.schedmd.com/sacct.html).
39
40

### Monitoring Running Jobs:
Natasha Pavlovikj's avatar
Natasha Pavlovikj committed
41
42
43
44
45
46
There are two ways to monitor running jobs, the `top` command and
monitoring the `cgroup` files using the utility`cgget`. `top` is helpful 
when monitoring multi-process jobs, whereas the `cgroup` files provide 
information on memory usage. Both of these tools require the use of an 
interactive job on the same node as the job to be monitored while the job 
is running.
47

Adam Caprez's avatar
Adam Caprez committed
48
{{% notice warning %}}
49
50
If the job to be monitored is using all available resources for a node,
the user will not be able to obtain a simultaneous interactive job.
Adam Caprez's avatar
Adam Caprez committed
51
{{% /notice %}}
52
53
54
55

After the job to be monitored is submitted and has begun to run, request
an interactive job on the same node using the srun command:

Adam Caprez's avatar
Adam Caprez committed
56
{{< highlight bash >}}
57
srun --jobid=<JOB_ID> --pty bash
Adam Caprez's avatar
Adam Caprez committed
58
{{< /highlight >}}
59

Natasha Pavlovikj's avatar
Natasha Pavlovikj committed
60
where `<JOB_ID>` is replaced by the job id for the monitored job as
61
62
63
64
assigned by SLURM.

Alternately, you can request the interactive job by nodename as follows:

Adam Caprez's avatar
Adam Caprez committed
65
{{< highlight bash >}}
66
srun --nodelist=<NODE_ID> --pty bash
Adam Caprez's avatar
Adam Caprez committed
67
{{< /highlight >}}
68

Natasha Pavlovikj's avatar
Natasha Pavlovikj committed
69
where `<NODE_ID>` is replaced by the name of the node where the monitored
70
job is running. This information can be found out by looking at the
Adam Caprez's avatar
Adam Caprez committed
71
squeue output under the `NODELIST` column.
72

Adam Caprez's avatar
Adam Caprez committed
73
{{< figure src="/images/21070055.png" width="700" >}}
74

Natasha Pavlovikj's avatar
Natasha Pavlovikj committed
75
76
### Using `top` to monitor running jobs
Once the interactive job begins, you can run `top` to view the processes
77
78
on the node you are on:

Adam Caprez's avatar
Adam Caprez committed
79
{{< figure src="/images/21070056.png" height="400" >}}
80

Natasha Pavlovikj's avatar
Natasha Pavlovikj committed
81
Output for `top` displays each running process on the node. From the above
82
image, we can see the various MATLAB processes being run by user
Adam Caprez's avatar
Adam Caprez committed
83
cathrine98. To filter the list of processes, you can type `u` followed
84
by the username of the user who owns the processes. To exit this screen,
Adam Caprez's avatar
Adam Caprez committed
85
press `q`.
86

Natasha Pavlovikj's avatar
Natasha Pavlovikj committed
87
88
89
90
91
92
93
### Using `cgget` to monitor running jobs
During a running job, the `cgroup` folder is created on the node where the job
is running. This folder contains much of the information used by `sacct`.
However, while `sacct` reports information gathered every 30 seconds, the 
`cgroup` files are updated more frequently and can detect quick spikes in
resource usage missed by `sacct`. Thus, using the `cgroup` files can give more
accurate information, especially regarding the RAM usage.
94

Natasha Pavlovikj's avatar
Natasha Pavlovikj committed
95
96
97
One way to access the `cgroup` files with `cgget`, is to start an interactive job 
on the same node as the monitored job. Then, to view specific files and information, 
use one of the following commands:
98

Natasha Pavlovikj's avatar
Natasha Pavlovikj committed
99
##### To view current memory usage:
Adam Caprez's avatar
Adam Caprez committed
100
{{< highlight bash >}}
Natasha Pavlovikj's avatar
Natasha Pavlovikj committed
101
cgget -r memory.usage_in_bytes /slurm/uid_<UID>/job_<SLURM_JOBID>/
Adam Caprez's avatar
Adam Caprez committed
102
{{< /highlight >}}
103

Natasha Pavlovikj's avatar
Natasha Pavlovikj committed
104
105
where `<UID>` is replaced by your UID and `<SLURM_JOBID>` is
replaced by the monitored job's Job ID as assigned by SLURM.
106

Adam Caprez's avatar
Adam Caprez committed
107
{{% notice note %}}
Natasha Pavlovikj's avatar
Natasha Pavlovikj committed
108
To find your `uid`, use the command `id -u`. Your UID never changes and is
Adam Caprez's avatar
Adam Caprez committed
109
110
the same on all HCC clusters (*not* on Anvil, however!).
{{% /notice %}}
111

Natasha Pavlovikj's avatar
Natasha Pavlovikj committed
112
113
114
115
116
117
118
119
##### To view the total CPU time, in nanoseconds, consummed by the job:
{{< highlight bash >}}
cgget -r cpuacct.usage /slurm/uid_<UID>/job_<SLURM_JOBID>/
{{< /highlight >}}


Since the `cgroup` files are available only during the job is running, another
way of accessing the information from these files is through the submit job.
120

Natasha Pavlovikj's avatar
Natasha Pavlovikj committed
121
To track for example, the maximum memory usage of a job, you can add
Adam Caprez's avatar
Adam Caprez committed
122
{{< highlight bash >}}
Natasha Pavlovikj's avatar
Natasha Pavlovikj committed
123
cgget -r memory.max_usage_in_bytes /slurm/uid_${UID}/job_${SLURM_JOBID}/
Adam Caprez's avatar
Adam Caprez committed
124
{{< /highlight >}}
Natasha Pavlovikj's avatar
Natasha Pavlovikj committed
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
at the end of your submit file. Unlike the previous examples, you do not need to 
modify this command - here `UID` and `SLURM_JOBID` are variables that will be set 
when the job is submitted.

For information on more variables that can be used with `cgget`, please check [here](https://reposcope.com/man/en/1/cgget).

We also provide a sciprt, `mem_report`, that reports the current and maximum
memory usages for a job. This script is wrapper for the `cgget` commands shown above
and generates user-friendly output. To use this script, you need to add
```
mem_report
```
at the end of your submit script.

`mem_report` can also be run as part of an interactive job:

{{< highlight bash >}}
[demo13@c0218.crane ~]$ mem_report 
Current memory usage for job 25745709 is: 2.57 MBs
Maximum memory usage for job 25745709 is: 3.27 MBs
{{< /highlight >}}

When `cgget` and `mem_report` are used as part of the submit script, the respective output
is printed in the generated SLURM log files, unless otherwise specified.
Mohammed Tanash's avatar
Mohammed Tanash committed
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171

### Monitoring queued Jobs:

The queue on our HCC is a fair-share, which means your jobs priority depends on how long the job has been waiting in the queue, past usage of the cluster, your job size, memory and time requested, etc. Also this will be affected by the amount of jobs waiting on the queue and how much resources are available on the cluster. The more you submitted jobs on the queue the lower priority to run your jobs on the cluster will increase.

You can check when your jobs will be running on the cluster using the command:
{{< highlight bash >}}
sacct -u <user_id> --format=start
{{< /highlight >}}

To check the start running time for a specific job then you can use the following command:
{{< highlight bash >}}
sacct -u <user_id> --job=<job_id> --format=start
{{< /highlight >}}

Finally, To check your fairsahre score by running the following command:
{{< highlight bash >}}
sshare --account=<group_name> -a
{{< /highlight >}}

After you run the above command you will be able to see your fair-share score. If your fairshare score is 1.0, then it is indicate that your account has not run any jobs recently (unused). If your faireshare score is 0.5, then that means (Average utilization). The Account on average is using exactly as much as their granted Share. If your fairshae score is between 0.5 > fairshare > 0, that means (Over-utilization). The Account has overused their granted Share. Finally, if your fairshare score is 0. That means (No share left). The Account has vastly overused their granted Share. If there is no contention for resources, the jobs will still start.

There is another way to run your job faster which is by having [Priority Access](https://hcc.unl.edu/priority-access-pricing).