monitoring_jobs.md 5.56 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
1.  [HCC-DOCS](index.html)
2.  [HCC-DOCS Home](HCC-DOCS-Home_327685.html)
3.  [HCC Documentation](HCC-Documentation_332651.html)
4.  [Submitting Jobs](Submitting-Jobs_332222.html)

<span id="title-text"> HCC-DOCS : Monitoring Jobs </span>
=========================================================

Created by <span class="author"> Carrie Brown</span>, last modified by
<span class="editor"> Natasha Pavlovikj</span> on Sep 19, 2018

Careful examination of running times, memory usage and output files will
allow you to ensure the job completed correctly and give you a good idea
of what memory and time limits to request in the future.

###  Monitoring Completed Jobs:

To see the runtime and memory usage of a job that has completed, use the
sacct command:

``` syntaxhighlighter-pre
sacct
```

Lists all jobs by the current user and displays information such as
JobID, JobName, State, and ExitCode.

<span
class="confluence-embedded-file-wrapper confluence-embedded-manual-size"><img src="assets/images/21070052/21070053.png" class="confluence-embedded-image" height="150" /></span>

  

Coupling this command with the --format flag will allow you to see more
than the default information about a job. Fields to display should be
listed as a comma separated list after the --format flag (without
spaces). For example, to see the Elapsed time and Maximum used memory by
a job, this command can be used:

``` syntaxhighlighter-pre
sacct --format JobID,JobName,Elapsed,MaxRSS
```

<span
class="confluence-embedded-file-wrapper confluence-embedded-manual-size"><img src="assets/images/21070052/21070054.png" class="confluence-embedded-image" height="150" /></span>

Additional arguments and format field information can be found in
<a href="https://slurm.schedmd.com/sacct.html" class="external-link">the SLURM documentation.</a>

### Monitoring Running Jobs:

There are two ways to monitor running jobs, the top command and
monitoring the cgroup files. Top is helpful when monitoring
multi-process jobs, whereas the cgroup files provide information on
memory usage. Both of these tools require the use of an interactive job
on the same node as the job to be monitored.

<span
class="aui-icon aui-icon-small aui-iconfont-error confluence-information-macro-icon"></span>

If the job to be monitored is using all available resources for a node,
the user will not be able to obtain a simultaneous interactive job.

After the job to be monitored is submitted and has begun to run, request
an interactive job on the same node using the srun command:

``` syntaxhighlighter-pre
srun --jobid=<JOB_ID> --pty bash
```

Where &lt;JOB\_ID&gt; is replaced by the job id for the monitored job as
assigned by SLURM.

Alternately, you can request the interactive job by nodename as follows:

``` syntaxhighlighter-pre
srun --nodelist=<NODE_ID> --pty bash
```

Where &lt;NODE\_ID&gt; is replaced by the node name that the monitored
job is running. This information can be found out by looking at the
squeue output under the NODELIST column.

<span
class="confluence-embedded-file-wrapper confluence-embedded-manual-size"><img src="assets/images/21070052/21070055.png" class="confluence-embedded-image" width="700" /></span>

Once the interactive job begins, you can run top to view the processes
on the node you are on:

<span
class="confluence-embedded-file-wrapper confluence-embedded-manual-size"><img src="assets/images/21070052/21070056.png" class="confluence-embedded-image" height="400" /></span>

Output for top displays each running process on the node. From the above
image, we can see the various MATLAB processes being run by user
cathrine98. To filter the list of processes, you can type \`u\` followed
by the username of the user who owns the processes. To exit this screen,
press \`q\`.

During a running job, the cgroup folder is created which contains much
of the information used by sacct. These files can provide a live
overview of resources used for a running job. To access the cgroup
files, you will need to be in an interactive job on the same node as the
monitored job. To view specific files, and information, use one of the
following commands:

##### To view current memory usage:

``` syntaxhighlighter-pre
less /cgroup/memory/slurm/uid_<UID>/job_<SLURM_JOB_ID>/memory.usage_in_bytes
```

Where &lt;UID&gt; is replaced by your UID and &lt;SLURM\_JOB\_ID&gt; is
replaced by the monitored job's Job ID as assigned by Slurm.

<span
class="aui-icon aui-icon-small aui-iconfont-info confluence-information-macro-icon"></span>

To find your uid, use the command \`id -u\`. Your UID never changes but
is cluster specific (ie, your UID on Crane will always be the same but
will differ from your UID on the other clusters).

##### To view maximum memory usage from start of job to current point:

``` syntaxhighlighter-pre
cat /cgroup/memory/slurm/uid_${UID}/job_${SLURM_JOBID}/memory.max_usage_in_bytes
```

Attachments:
------------

<img src="assets/images/icons/bullet_blue.gif" width="8" height="8" />
[image2017-7-10\_11-16-5.png](attachments/21070052/21070053.png)
(image/png)  
<img src="assets/images/icons/bullet_blue.gif" width="8" height="8" />
[image2017-7-10\_11-16-36.png](attachments/21070052/21070054.png)
(image/png)  
<img src="assets/images/icons/bullet_blue.gif" width="8" height="8" />
[image2017-7-10\_12-14-1.png](attachments/21070052/21070055.png)
(image/png)  
<img src="assets/images/icons/bullet_blue.gif" width="8" height="8" />
[image2017-7-10\_12-23-27.png](attachments/21070052/21070056.png)
(image/png)  
<img src="assets/images/icons/bullet_blue.gif" width="8" height="8" />
[image2017-7-10\_12-27-45.png](attachments/21070052/21070057.png)
(image/png)