diff --git a/content/submitting_jobs/monitoring_jobs.md b/content/submitting_jobs/monitoring_jobs.md index b6b458eb6ced01143f60d2755f583e74f3a6eec1..99b26910c6248796b01ee1ec2d93753c89e1b396 100644 --- a/content/submitting_jobs/monitoring_jobs.md +++ b/content/submitting_jobs/monitoring_jobs.md @@ -38,12 +38,12 @@ Additional arguments and format field information can be found in [the SLURM documentation](https://slurm.schedmd.com/sacct.html). ### Monitoring Running Jobs: - -There are two ways to monitor running jobs, the top command and -monitoring the cgroup files. Top is helpful when monitoring -multi-process jobs, whereas the cgroup files provide information on -memory usage. Both of these tools require the use of an interactive job -on the same node as the job to be monitored. +There are two ways to monitor running jobs, the `top` command and +monitoring the `cgroup` files using the utility`cgget`. `top` is helpful +when monitoring multi-process jobs, whereas the `cgroup` files provide +information on memory usage. Both of these tools require the use of an +interactive job on the same node as the job to be monitored while the job +is running. {{% notice warning %}} If the job to be monitored is using all available resources for a node, @@ -57,7 +57,7 @@ an interactive job on the same node using the srun command: srun --jobid=<JOB_ID> --pty bash {{< /highlight >}} -Where `<JOB_ID>` is replaced by the job id for the monitored job as +where `<JOB_ID>` is replaced by the job id for the monitored job as assigned by SLURM. Alternately, you can request the interactive job by nodename as follows: @@ -66,46 +66,83 @@ Alternately, you can request the interactive job by nodename as follows: srun --nodelist=<NODE_ID> --pty bash {{< /highlight >}} -Where `<NODE_ID>` is replaced by the node name that the monitored +where `<NODE_ID>` is replaced by the name of the node where the monitored job is running. This information can be found out by looking at the squeue output under the `NODELIST` column. {{< figure src="/images/21070055.png" width="700" >}} -Once the interactive job begins, you can run top to view the processes +### Using `top` to monitor running jobs +Once the interactive job begins, you can run `top` to view the processes on the node you are on: {{< figure src="/images/21070056.png" height="400" >}} -Output for top displays each running process on the node. From the above +Output for `top` displays each running process on the node. From the above image, we can see the various MATLAB processes being run by user cathrine98. To filter the list of processes, you can type `u` followed by the username of the user who owns the processes. To exit this screen, press `q`. -During a running job, the cgroup folder is created which contains much -of the information used by sacct. These files can provide a live -overview of resources used for a running job. To access the cgroup -files, you will need to be in an interactive job on the same node as the -monitored job. To view specific files, and information, use one of the -following commands: +### Using `cgget` to monitor running jobs +During a running job, the `cgroup` folder is created on the node where the job +is running. This folder contains much of the information used by `sacct`. +However, while `sacct` reports information gathered every 30 seconds, the +`cgroup` files are updated more frequently and can detect quick spikes in +resource usage missed by `sacct`. Thus, using the `cgroup` files can give more +accurate information, especially regarding the RAM usage. -##### To view current memory usage: +One way to access the `cgroup` files with `cgget`, is to start an interactive job +on the same node as the monitored job. Then, to view specific files and information, +use one of the following commands: +##### To view current memory usage: {{< highlight bash >}} -less /cgroup/memory/slurm/uid_<UID>/job_<SLURM_JOB_ID>/memory.usage_in_bytes +cgget -r memory.usage_in_bytes /slurm/uid_<UID>/job_<SLURM_JOBID>/ {{< /highlight >}} -Where `<UID>` is replaced by your UID and `<SLURM_JOB_ID>` is -replaced by the monitored job's Job ID as assigned by Slurm. +where `<UID>` is replaced by your UID and `<SLURM_JOBID>` is +replaced by the monitored job's Job ID as assigned by SLURM. {{% notice note %}} -To find your uid, use the command `id -u`. Your UID never changes and is +To find your `uid`, use the command `id -u`. Your UID never changes and is the same on all HCC clusters (*not* on Anvil, however!). {{% /notice %}} -##### To view maximum memory usage from start of job to current point: +##### To view the total CPU time, in nanoseconds, consummed by the job: +{{< highlight bash >}} +cgget -r cpuacct.usage /slurm/uid_<UID>/job_<SLURM_JOBID>/ +{{< /highlight >}} + + +Since the `cgroup` files are available only during the job is running, another +way of accessing the information from these files is through the submit job. +To track for example, the maximum memory usage of a job, you can add {{< highlight bash >}} -cat /cgroup/memory/slurm/uid_${UID}/job_${SLURM_JOBID}/memory.max_usage_in_bytes +cgget -r memory.max_usage_in_bytes /slurm/uid_${UID}/job_${SLURM_JOBID}/ {{< /highlight >}} +at the end of your submit file. Unlike the previous examples, you do not need to +modify this command - here `UID` and `SLURM_JOBID` are variables that will be set +when the job is submitted. + +For information on more variables that can be used with `cgget`, please check [here](https://reposcope.com/man/en/1/cgget). + +We also provide a sciprt, `mem_report`, that reports the current and maximum +memory usages for a job. This script is wrapper for the `cgget` commands shown above +and generates user-friendly output. To use this script, you need to add +``` +mem_report +``` +at the end of your submit script. + +`mem_report` can also be run as part of an interactive job: + +{{< highlight bash >}} +[demo13@c0218.crane ~]$ mem_report +Current memory usage for job 25745709 is: 2.57 MBs +Maximum memory usage for job 25745709 is: 3.27 MBs +{{< /highlight >}} + +When `cgget` and `mem_report` are used as part of the submit script, the respective output +is printed in the generated SLURM log files, unless otherwise specified.