Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found
Select Git revision
  • FAQ
  • RDPv10
  • UNL_OneDrive
  • atticguidelines
  • data_share
  • globus-auto-backups
  • good-hcc-practice-rep-workflow
  • hchen2016-faq-home-is-full
  • ipynb-doc
  • master
  • rclone-fix
  • sislam2-master-patch-51693
  • sislam2-master-patch-86974
  • site_url
  • test
15 results

Target

Select target project
  • dweitzel2/hcc-docs
  • OMCCLUNG2/hcc-docs
  • salmandjing/hcc-docs
  • hcc/hcc-docs
4 results
Select Git revision
  • 26-add-screenshots-for-newer-rdp-v10-client
  • 28-overview-page-for-connecting-2
  • AddExamples
  • OMCCLUNG2-master-patch-74599
  • RDPv10
  • globus-auto-backups
  • gpu_update
  • master
  • mtanash2-master-patch-75717
  • mtanash2-master-patch-83333
  • mtanash2-master-patch-87890
  • mtanash2-master-patch-96320
  • patch-1
  • patch-2
  • patch-3
  • runTime
  • submitting-jobs-overview
  • tharvill1-master-patch-26973
18 results
Show changes
Showing
with 1652 additions and 96 deletions
+++
title = "Submitting Jobs"
description = "How to submit jobs to HCC resources"
weight = "10"
+++
Crane and Rhino are managed by
the [SLURM](https://slurm.schedmd.com) resource manager.
In order to run processing on Crane or Rhino, you
---
title: Submitting Jobs
summary: "How to submit jobs to HCC resources"
weight: 5
---
Swan is managed by
the [SLURM](https://slurm.schedmd.com) resource manager.
In order to run processing on Swan, you
must create a SLURM script that will run your processing. After
submitting the job, SLURM will schedule your processing on an available
worker node.
Before writing a submit file, you may need to
[compile your application]({{< relref "/guides/running_applications/compiling_source_code" >}}).
[compile your application](/applications/user_software/).
- [Ensure proper working directory for job output](#ensure-proper-working-directory-for-job-output)
- [Creating a SLURM Submit File](#creating-a-slurm-submit-file)
- [Submitting the job](#submitting-the-job)
- [Checking Job Status](#checking-job-status)
- [Checking Job Start](#checking-job-start)
- [Removing the Job](#removing-the-job)
- [Next Steps](#next-steps)
### Ensure proper working directory for job output
{{% notice info %}}
All SLURM job output should be directed to your /work path.
{{% /notice %}}
{{% panel theme="info" header="Manual specification of /work path" %}}
{{< highlight bash >}}
!!! note "Manual specification of /work path"
```bash
$ cd /work/[groupname]/[username]
{{< /highlight >}}
{{% /panel %}}
```
The environment variable `$WORK` can also be used.
{{% panel theme="info" header="Using environment variable for /work path" %}}
{{< highlight bash >}}
!!! note "Using environment variable for /work path"
```bash
$ cd $WORK
$ pwd
/work/[groupname]/[username]
{{< /highlight >}}
{{% /panel %}}
```
Review how /work differs from /home [here.]({{< relref "/guides/handling_data/_index.md" >}})
Review how /work differs from /home [here.](/handling_data)
### Creating a SLURM Submit File
{{% notice info %}}
The below example is for a serial job. For submitting MPI jobs, please
look at the [MPI Submission Guide.]({{< relref "submitting_an_mpi_job" >}})
{{% /notice %}}
!!! note
The below example is for a serial job. For submitting MPI jobs, please look at the [MPI Submission Guide.](submitting_an_mpi_job/)
A SLURM submit file is broken into 2 sections, the job description and
the processing. SLURM job description are prepended with `#SBATCH` in
the processing. SLURM job description are prepended with `#SBATCH` in
the submit file.
**SLURM Submit File**
{{< highlight batch >}}
#!/bin/sh
```bat
#!/bin/bash
#SBATCH --time=03:15:00 # Run time in hh:mm:ss
#SBATCH --mem-per-cpu=1024 # Maximum memory required per CPU (in megabytes)
#SBATCH --job-name=hello-world
......@@ -70,62 +66,59 @@ module load example/test
hostname
sleep 60
{{< /highlight >}}
```
- **time**
Maximum walltime the job can run. After this time has expired, the
Maximum walltime the job can run. After this time has expired, the
job will be stopped.
- **mem-per-cpu**
Memory that is allocated per core for the job. If you exceed this
Memory that is allocated per core for the job. If you exceed this
memory limit, your job will be stopped.
- **mem**
Specify the real memory required per node in MegaBytes. If you
exceed this limit, your job will be stopped. Note that for you
should ask for less memory than each node actually has. For
instance, Rhino has 1TB, 512GB, 256GB, and 192GB of RAM per node. You may
only request 1000GB of RAM for the 1TB node, 500GB of RAM for the
512GB nodes, 250GB of RAM for the 256GB nodes, and 187.5GB for the 192 nodes.
For Crane, the max is 500GB.
should ask for less memory than each node actually has. For Swan, the
max is 2000GB.
- **job-name**
The name of the job. Will be reported in the job listing.
The name of the job. Will be reported in the job listing.
- **partition**
The partition the job should run in. Partitions determine the job's
priority and on what nodes the partition can run on. See the
[Partitions]({{< relref "partitions" >}}) page for a list of possible partitions.
The partition the job should run in. Partitions determine the job's
priority and on what nodes the partition can run on. See the
[Partitions](/submitting_jobs/partitions) page for a list of possible partitions.
- **error**
Location of the stderr will be written for the job. `[groupname]`
and `[username]` should be replaced your group name and username.
Your username can be retrieved with the command `id -un` and your
group with `id -ng`.
Location of the stderr will be written for the job. `[groupname]`
and `[username]` should be replaced your group name and username.
Your username can be retrieved with the command `id -un` and your
group with `id -ng`.
- **output**
Location of the stdout will be written for the job.
More advanced submit commands can be found on the [SLURM Docs](https://slurm.schedmd.com/sbatch.html).
You can also find an example of a MPI submission on [Submitting an MPI Job]({{< relref "submitting_an_mpi_job" >}}).
You can also find an example of a MPI submission on [Submitting an MPI Job](submitting_an_mpi_job).
### Submitting the job
Submitting the SLURM job is done by command `sbatch`. SLURM will read
Submitting the SLURM job is done by command `sbatch`. SLURM will read
the submit file, and schedule the job according to the description in
the submit file.
Submitting the job described above is:
{{% panel theme="info" header="SLURM Submission" %}}
{{< highlight batch >}}
!!! note "SLURM Submission"
```bash
$ sbatch example.slurm
Submitted batch job 24603
{{< /highlight >}}
{{% /panel %}}
```
The job was successfully submitted.
The job was successfully submitted.
### Checking Job Status
Job status is found with the command `squeue`. It will provide
Job status is found with the command `squeue`. It will provide
information such as:
- The State of the job:
- The State of the job:
- **R** - Running
- **PD** - Pending - Job is awaiting resource allocation.
- Additional codes are available
......@@ -136,63 +129,82 @@ information such as:
- Nodes running the job
Checking the status of the job is easiest by filtering by your username,
using the `-u` option to squeue.
using the `-u` option to squeue.
{{< highlight batch >}}
```bat
$ squeue -u <username>
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
24605 batch hello-wo <username> R 0:56 1 b01
{{< /highlight >}}
```
Additionally, if you want to see the status of a specific partition, for
example if you are part of a [partition]({{< relref "partitions" >}}),
you can use the `-p` option to `squeue`:
example if you are part of a [partition](/submitting_jobs/partitions),
you can use the `-p` option to `squeue`:
{{< highlight batch >}}
$ squeue -p esquared
```bat
$ squeue -p guest
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
73435 esquared MyRandom tingting R 10:35:20 1 ri19n10
73436 esquared MyRandom tingting R 10:35:20 1 ri19n12
73735 esquared SW2_driv hroehr R 10:14:11 1 ri20n07
73736 esquared SW2_driv hroehr R 10:14:11 1 ri20n07
{{< /highlight >}}
73435 guest MyRandom demo01 R 10:35:20 1 ri19n10
73436 guest MyRandom demo01 R 10:35:20 1 ri19n12
73735 guest SW2_driv demo02 R 10:14:11 1 ri20n07
73736 guest SW2_driv demo02 R 10:14:11 1 ri20n07
```
#### Checking Job Start
You may view the start time of your job with the
command `squeue --start`. The output of the command will show the
command `squeue --start`. The output of the command will show the
expected start time of the jobs.
{{< highlight batch >}}
$ squeue --start --user lypeng
```bat
$ squeue --start --user demo03
JOBID PARTITION NAME USER ST START_TIME NODES NODELIST(REASON)
5822 batch Starace lypeng PD 2013-06-08T00:05:09 3 (Priority)
5823 batch Starace lypeng PD 2013-06-08T00:07:39 3 (Priority)
5824 batch Starace lypeng PD 2013-06-08T00:09:09 3 (Priority)
5825 batch Starace lypeng PD 2013-06-08T00:12:09 3 (Priority)
5826 batch Starace lypeng PD 2013-06-08T00:12:39 3 (Priority)
5827 batch Starace lypeng PD 2013-06-08T00:12:39 3 (Priority)
5828 batch Starace lypeng PD 2013-06-08T00:12:39 3 (Priority)
5829 batch Starace lypeng PD 2013-06-08T00:13:09 3 (Priority)
5830 batch Starace lypeng PD 2013-06-08T00:13:09 3 (Priority)
5831 batch Starace lypeng PD 2013-06-08T00:14:09 3 (Priority)
5832 batch Starace lypeng PD N/A 3 (Priority)
{{< /highlight >}}
5822 batch python demo03 PD 2013-06-08T00:05:09 3 (Priority)
5823 batch python demo03 PD 2013-06-08T00:07:39 3 (Priority)
5824 batch python demo03 PD 2013-06-08T00:09:09 3 (Priority)
5825 batch python demo03 PD 2013-06-08T00:12:09 3 (Priority)
5826 batch python demo03 PD 2013-06-08T00:12:39 3 (Priority)
5827 batch python demo03 PD 2013-06-08T00:12:39 3 (Priority)
5828 batch python demo03 PD 2013-06-08T00:12:39 3 (Priority)
5829 batch python demo03 PD 2013-06-08T00:13:09 3 (Priority)
5830 batch python demo03 PD 2013-06-08T00:13:09 3 (Priority)
5831 batch python demo03 PD 2013-06-08T00:14:09 3 (Priority)
5832 batch python demo03 PD N/A 3 (Priority)
```
The output shows the expected start time of the jobs, as well as the
reason that the jobs are currently idle (in this case, low priority of
the user due to running numerous jobs already).
#### Removing the Job
Removing the job is done with the `scancel` command. The only argument
to the `scancel` command is the job id. For the job above, the command
Removing the job is done with the `scancel` command. The only argument
to the `scancel` command is the job id. For the job above, the command
is:
{{< highlight batch >}}
```bat
$ scancel 24605
{{< /highlight >}}
```
### Next Steps
!!! tip "Looking to reduce your wait time on Swan?"
HCC wants to hear more about your research! If you acknowledge HCC in your publications, posters, or journal articles, you can receive a boost in priority on Swan!
Details on the process and requirements are available in the [HCC Acknowledgement Credit](./hcc_acknowledgment_credit.md) documentation page.
- [Application Specific Guides](./app_specific)
- [Monitoring Jobs](./monitoring_jobs.md)
- [Creating an Interactive Job](./creating_an_interactive_job.md)
- [Submitting a GPU Job](./submitting_gpu_jobs.md)
- [GPU Job Monitoring and Optimization](./monitoring_GPU_usage.md)
- [Submitting an MPI Job](./submitting_an_mpi_job.md)
- [Submitting an OpenMP Job](./submitting_an_openmp_job.md)
- [Submitting a Job Array](./submitting_a_job_array.md)
- [Setting up Dependant Jobs](./job_dependencies.md)
- [Available Partitions on Swan](./partitions/swan_available_partitions.md)
{{% children %}}
+++
title = "Job Dependencies"
description = "How to use job dependencies with the SLURM scheduler."
+++
---
title: Job Dependencies
summary: "How to use job dependencies with the SLURM scheduler."
weight: 55
---
The job dependency feature of SLURM is useful when you need to run
multiple jobs in a particular order. A standard example of this is a
......@@ -27,13 +28,12 @@ This example is usually referred to as a "diamond" workflow.  There are
B and C both depend on Job A completing before they can run. Job D then
depends on Jobs B and C completing.
{{< figure src="/images/4980738.png" width="400" >}}
<img src="/images/4980738.png" width="400">
The SLURM submit files for each step are below.
{{%expand "JobA.submit" %}}
{{< highlight batch >}}
#!/bin/sh
!!! note "JobA.submit"
```bat
#!/bin/bash
#SBATCH --job-name=JobA
#SBATCH --time=00:05:00
#SBATCH --ntasks=1
......@@ -42,13 +42,13 @@ The SLURM submit files for each step are below.
echo "I'm job A"
echo "Sample job A output" > jobA.out
sleep 120
{{< /highlight >}}
{{% /expand %}}
```
{{%expand "JobB.submit" %}}
{{< highlight batch >}}
#!/bin/sh
!!! note "JobB.submit"
```bat
#!/bin/bash
#SBATCH --job-name=JobB
#SBATCH --time=00:05:00
#SBATCH --ntasks=1
......@@ -60,12 +60,12 @@ cat jobA.out >> jobB.out
echo "" >> jobB.out
echo "Sample job B output" >> jobB.out
sleep 120
{{< /highlight >}}
{{% /expand %}}
```
{{%expand "JobC.submit" %}}
{{< highlight batch >}}
#!/bin/sh
!!! note "JobC.submit"
```bat
#!/bin/bash
#SBATCH --job-name=JobC
#SBATCH --time=00:05:00
#SBATCH --ntasks=1
......@@ -77,12 +77,12 @@ cat jobA.out >> jobC.out
echo "" >> jobC.out
echo "Sample job C output" >> jobC.out
sleep 120
{{< /highlight >}}
{{% /expand %}}
```
{{%expand "JobC.submit" %}}
{{< highlight batch >}}
#!/bin/sh
!!! note "JobD.submit"
```bat
#!/bin/bash
#SBATCH --job-name=JobD
#SBATCH --time=00:05:00
#SBATCH --ntasks=1
......@@ -96,53 +96,53 @@ cat jobC.out >> jobD.out
echo "" >> jobD.out
echo "Sample job D output" >> jobD.out
sleep 120
{{< /highlight >}}
{{% /expand %}}
```
To start the workflow, submit Job A first:
{{% panel theme="info" header="Submit Job A" %}}
{{< highlight batch >}}
[demo01@login.crane demo01]$ sbatch JobA.submit
!!! note "Submit Job A"
```bat
[demo01@login.swan demo01]$ sbatch JobA.submit
Submitted batch job 666898
{{< /highlight >}}
{{% /panel %}}
```
Now submit jobs B and C, using the job id from Job A to indicate the
dependency:
{{% panel theme="info" header="Submit Jobs B and C" %}}
{{< highlight batch >}}
[demo01@login.crane demo01]$ sbatch -d afterok:666898 JobB.submit
!!! note "Submit Jobs B and C"
```bat
[demo01@login.swan demo01]$ sbatch -d afterok:666898 JobB.submit
Submitted batch job 666899
[demo01@login.crane demo01]$ sbatch -d afterok:666898 JobC.submit
[demo01@login.swan demo01]$ sbatch -d afterok:666898 JobC.submit
Submitted batch job 666900
{{< /highlight >}}
{{% /panel %}}
```
Finally, submit Job D as depending on both jobs B and C:
{{% panel theme="info" header="Submit Job D" %}}
{{< highlight batch >}}
[demo01@login.crane demo01]$ sbatch -d afterok:666899:666900 JobD.submit
!!! note "Submit Job D"
```bat
[demo01@login.swan demo01]$ sbatch -d afterok:666899:666900 JobD.submit
Submitted batch job 666901
{{< /highlight >}}
{{% /panel %}}
```
Running `squeue` will now show all four jobs. The output from `squeue`
will also indicate that Jobs B, C, and D are in a pending state because
of the dependency.
{{% panel theme="info" header="Squeue Output" %}}
{{< highlight batch >}}
[demo01@login.crane demo01]$ squeue -u demo01
!!! note "Squeue Output"
```bat
[demo01@login.swan demo01]$ squeue -u demo01
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
666899 batch JobB demo01 PD 0:00 1 (Dependency)
666900 batch JobC demo01 PD 0:00 1 (Dependency)
666901 batch JobD demo01 PD 0:00 1 (Dependency)
666898 batch JobA demo01 R 0:52 1 c2409
{{< /highlight >}}
{{% /panel %}}
```
As the each job completes successfully, SLURM will run the job(s) in the
workflow as resources become available.
---
title: GPU Monitoring and Optimizing
summary: "How to monitor GPU usage in real time and optimize GPU performance."
weight: 60
---
This document provides a comprehensive guide to monitoring GPU usage and optimizing GPU performance on the HCC. Its goal is to help you identify GPU bottlenecks in your jobs and offer instructions for optimizing GPU resource utilization.
### Table of Contents
- [Measuring GPU Utilization in Real Time](#measuring-gpu-utilization-in-real-time)
- [Logging and Reporting GPU Utilization](#logging-and-reporting-gpu-utilization)
- [nvidia-smi](#nvidia-smi)
- [TensorBoard](#tensorboard)
- [How to Improve Your GPU Utilization](#how-to-improve-your-gpu-utilization)
- [Maximize Parallelism](#maximize-parallelism)
- [Memory Management and Optimization](#memory-management-and-optimization)
- [Use Shared Memory Effectively](#use-shared-memory-effectively)
- [Avoid Memory Divergence](#avoid-memory-divergence)
- [Reduce Memory Footprint](#reduce-memory-footprint)
- [Minimize CPU-GPU Memory Transferring Overhead](#minimize-cpu-gpu-memory-transferring-overhead)
- [How to Improve Your GPU Utilization for Deep Learning Jobs](#how-to-improve-your-gpu-utilization-for-deep-learning-jobs)
- [Maximize Batch Size](#maximize-batch-size)
- [Optimize Data Loading and Preprocessing](#optimize-data-loading-and-preprocessing)
- [Optimize Model Architecture](#optimize-model-architecture)
- [Common Oversights](#common-oversights)
- [Overlooking GPU-CPU Memory Transfer Costs](#overlooking-gpu-cpu-memory-transfer-costs)
- [Not Leveraging GPU Libraries](#not-leveraging-gpu-libraries)
- [Not Handling GPU-Specific Errors](#not-handling-gpu-specific-errors)
- [Neglecting Multi-GPU Scalability](#neglecting-multi-gpu-scalability)
### Measuring GPU Utilization in Real Time
You can use the `nvidia-smi` command to monitor GPU usage in real time. This tool provides details on GPU memory usage and utilization. To monitor a job, you need access to the same node where the job is running.
!!! warning
If the job to be monitored is using all available resources for a node, the user will not be able to obtain a simultaneous interactive job.
Once the job has been submitted and is running, you can request an interactive session on the same node using the following srun command:
```bash
srun --jobid=<JOB_ID> --pty bash
```
where `<JOB_ID>` is replaced by the job ID for the monitored job as assigned by SLURM.
After getting access to the node, use the following command to monitor GPU performance in real time:
```bash
watch -n 1 nvidia-smi
```
<img src="/images/nvidia-smi_example.png" width="700">
Note that `nvidia-smi` only shows the process ID (`PID`) of the running GPU jobs. If multiple jobs are running on the same node, you'll need to match the `PID` to your job using the top command. Start the top command as follows:
```bash
top
```
In top, the `PID` appears in the first column, and your login ID is shown in the `USER` column. Use this to identify the process corresponding to your job.
<img src="/images/srun_top.png" width="700">
### Logging and Reporting GPU Utilization
#### nvidia-smi
You can use `nvidia-smi` to periodically log GPU usage in CSV files for later analysis. This is convenient to be added in the SLURM submit script instead of running it interactively as shown above. To do this, wrap your job command with the following in your SLURM submission script. This will generate three files in your `$WORK` directory:
1. **`gpu_usage_log.csv`**: contains overall GPU performance data, including GPU utilization, memory utilization, and total GPU memory.
2. **`pid_gpu_usage_log.csv`**: logs GPU usage for each process, including the process ID (PID) and GPU memory used by each process. Note that, to match a specific PID with overall GPU performance in the generated file, use the GPU bus ID.
3. **`pid_lookup.txt`**: provides the process ID to help identify which one corresponds to your job in the GPU records.
Note that the job ID will be appended to the file names to help match the logs with your specific job.
```bash
curpath=`pwd`
cd $WORK
nohup nvidia-smi --query-gpu=timestamp,index,gpu_bus_id,utilization.gpu,utilization.memory,memory.used,memory.total --format=csv -f gpu_usage_log.csv-$SLURM_JOB_ID -l 1 > /dev/null 2>&1 &
gpumonpid=$!
nohup nvidia-smi --query-compute-apps=timestamp,gpu_bus_id,pid,used_memory --format=csv -f pid_gpu_usage_log-$SLURM_JOB_ID.csv -l 1 > /dev/null 2>&1 &
gpumonprocpid=$!
nohup top -u <LOGIN-ID> -d 10 -c -b -n 2 > pid_lookup-$SLURM_JOB_ID.txt 2>&1 &
cd $curpath
<YOUR_JOB_COMMAND>
kill $gpumonpid
kill $gpumonprocpid
```
where `<LOGIN-ID>` is replaced by your HCC login ID and `<YOUR_JOB_COMMAND>` is replaced by your job command. A complete example SLURM submit script that utilizes this approach can be found [here](https://github.com/unlhcc/job-examples/tree/master/tensorflow_gpu_tracking).
#### TensorBoard
If your deep learning job utilizes libraries such as `TensorFlow` or `PyTorch`, you can use TensorBoard to monitor and visualize GPU usage metrics, including GPU utilization, memory consumption, and model performance. TensorBoard provides real-time insights into how your job interacts with the GPU, helping you optimize performance and identify bottlenecks.
To monitor GPU usage with `TensorBoard`, refer to the specific instructions of `TensorFlow` or `PyTorch` to enable logging with `TensorBoard` in your job code:
1. **`TensorFlow`** - [TensorFlow Profiler Guide](https://www.tensorflow.org/tensorboard/tensorboard_profiling_keras)
2. **`PyTorch`** - [PyTorch Profiler with TensorBoard](https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html)
On Swan, TensorBoard is available as [Open OnDemand App](https://hcc.unl.edu/docs/open_ondemand/virtual_desktop_and_interactive_apps/).
### How to Improve Your GPU Utilization
Improving GPU utilization means maximizing both the computational and memory usage of the GPU to ensure your program fully utilizes GPU's processing power. Low utilization can result from various bottlenecks, including improper parallelism, insufficient memory management, or CPU-GPU communication overhead.
#### Maximize Parallelism
The GPU is powerful because its parallel processing capabilities. Your job should leverage parallelism effectively:
1. **Optimize grid and block dimensions**: configure your thread and block settings to match your job's data size to fully utilize GPU cores.
2. **Occupancy**: use tools like CUDA’s occupancy calculator to determine the best number of threads per block that maximizes utilization.
3. **Streamlining parallel tasks**: CUDA streams can be used to execute multiple operations concurrently. This allows for overlapping computation on the GPU with data transfers, improving efficiency.
#### Memory Management and Optimization
##### Use Shared Memory Effectively
Shared memory is a small, high-speed memory located on the GPU. It can be used to reduce global memory access latency by storing frequently used data. Use shared memory to cache data that is repeatedly accessed by multiple threads.
##### Avoid Memory Divergence
Memory divergence occurs when threads in a warp access non-contiguous memory locations, resulting in multiple memory transactions. To minimize divergence:
- **Align memory access**: ensure that threads in a warp access contiguous memory addresses.
- **Use memory coalescing**: organize memory access patterns to allow for coalesced memory transactions, reducing the number of memory accesses required.
##### Reduce Memory Footprint
Excessive memory use can lead to spills into slower global memory. Minimize your program’s memory footprint by:
- **Freeing unused memory**: always release memory that is no longer needed.
- **Optimizing data structures**: use more compact data structures and reduce precision when possible (e.g., using floats instead of doubles).
#### Minimize CPU-GPU Memory Transferring Overhead
Data transfer between the CPU and GPU is often a bottleneck in scientific programs. It is essential to minimize these transfers to improve overall GPU performance. Here are some tips:
1. **Batch data transfers**: transfer large chunks of data at once rather than sending small bits frequently.
2. **Asynchronous memory transfers**: use non-blocking memory transfer operations (e.g., cudaMemcpyAsync for CUDA) to allow computation and data transfer to overlap.
3. **Pin memory**: use pinned (page-locked) memory on the CPU for faster transfer of data to and from the GPU.
### How to Improve Your GPU Utilization for Deep Learning Jobs
In deep learning, GPUs are a key component for accelerating model training and inference due to their ability to handle large-scale matrix operations and parallelism. Below are tips to maximize GPU utilization in deep learning jobs.
#### Maximize Batch Size
Batch size refers to the number of training samples processed simultaneously. Larger batch sizes improve GPU utilization by increasing the workload per step. The batch size should fit within the GPU’s memory constraints:
- **Larger batch sizes**: result in better utilization but require more memory.
- **Gradient accumulation**: if GPU memory limits are reached, you can accumulate gradients over several smaller batches before performing a parameter update, effectively simulating larger batch sizes.
#### Optimize Data Loading and Preprocessing
Data loading can become a bottleneck, causing the GPU to idle while waiting for data.
- **Parallel data loading**: load data in parallel to speed up (e.g., using libraries like PyTorch’s `DataLoader` or TensorFlow’s `tf.data` pipeline).
- **Prefetch data**: use techniques (e.g., double-buffering) to overlap data preprocessing and augmentation with model computation, enabling data to be fetched in advance. This helps reduce the GPU idle time.
#### Optimize Model Architecture
Model architecture impacts the GPU utilization. Here are some optimization tips:
- **Reduce memory bottlenecks**: avoiding excessive use of operations that cause memory overhead (e.g., deep recursive layers).
- **Imporve parallelism**: using layers that can exploit parallelism (e.g., convolutions, matrix multiplications).
- **Prune unnecessary layers**: prune your model by removing layers or neurons that don’t contribute significantly to the output, reducing computation time and improving efficiency.
### Common Oversights
#### Overlooking GPU-CPU Memory Transfer Costs
Memory transfers between CPU and GPU can be expensive, and excessive data movement can reverse the performance gains offered by parallelism on GPU.
#### Not Leveraging GPU Libraries
There are highly optimized libraries available for GPU-accelerated algorithms, such as linear algebra and FFTs. Always check for these libraries before implementing your own solution, as they are often more efficient and reliable.
#### Not Handling GPU-Specific Errors
GPU computation errors can lead to silent failures, making debugging extremely difficult. For example, insufficient memory on the GPU or illegal memory access can go undetected without proper error handling.
#### Neglecting Multi-GPU Scalability
Many programs are initially designed for single-GPU execution and lack support for multiple GPUs. Make sure your program is optimized for multi-GPU execution before scaling up to request multiple GPU resources.
---
title: Monitoring Jobs
summary: "How to find out information about running and completed jobs."
weight: 55
---
Careful examination of running times, memory usage and output files will
allow you to ensure the job completed correctly and give you a good idea
of what memory and time limits to request in the future.
### Monitoring Completed Jobs:
#### seff
The `seff` command provides a quick summary of a single job's resource utilization and efficiency after it has been completed, including status, wall usage, runtime, and memory usage of a job:
```bash
seff <JOB_ID>
```
<img src="/images/slurm_seff_1.png" height="250">
!!! note
1. `seff` gathers resource utilization every 30 seconds, so it is possible for some peak utilization to be missed in the report.
2. For multi node jobs, the `Memory Utilized` reported by `seff` is for **one node only**.
For more accurate report, please use `sacct` instead.
#### sacct
To see the runtime and memory usage of a job that has completed, use the
sacct command:
```bash
sacct
```
Lists all jobs by the current user and displays information such as
JobID, JobName, State, and ExitCode.
<img src="/images/sacct_generic.png" height="150">
Coupling this command with the --format flag will allow you to see more
than the default information about a job. Fields to display should be
listed as a comma separated list after the --format flag (without
spaces). For example, to see the Elapsed time and Maximum used memory by
a job, this command can be used:
```bash
sacct --format JobID,JobName,Elapsed,MaxRSS
```
<img src="/images/sacct_format.png" height="150">
Additional arguments and format field information can be found in
[the SLURM documentation](https://slurm.schedmd.com/sacct.html).
### Monitoring Running Jobs:
There are two ways to monitor running jobs, the `top` command and
monitoring the `cgroup` files using the utility`cgget`. `top` is helpful
when monitoring multi-process jobs, whereas the `cgroup` files provide
information on memory usage. Both of these tools require the use of an
interactive job on the same node as the job to be monitored while the job
is running.
!!! warning
If the job to be monitored is using all available resources for a node,
the user will not be able to obtain a simultaneous interactive job.
After the job to be monitored is submitted and has begun to run, request
an interactive job on the same node using the srun command:
```bash
srun --jobid=<JOB_ID> --pty bash
```
where `<JOB_ID>` is replaced by the job id for the monitored job as
assigned by SLURM.
Alternately, you can request the interactive job by nodename as follows:
```bash
srun --nodelist=<NODE_ID> --pty bash
```
where `<NODE_ID>` is replaced by the name of the node where the monitored
job is running. This information can be found out by looking at the
squeue output under the `NODELIST` column.
<img src="/images/srun_node_id.png" width="700">
### Using `top` to monitor running jobs
Once the interactive job begins, you can run `top` to view the processes
on the node you are on:
<img src="/images/srun_top.png" height="400">
Output for `top` displays each running process on the node. From the above
image, we can see the various MATLAB processes being run by user
hccdemo. To filter the list of processes, you can type `u` followed
by the username of the user who owns the processes. To exit this screen,
press `q`.
### Using `cgget` to monitor running jobs
During a running job, the `cgroup` folder is created on the node where the job
is running. This folder contains much of the information used by `sacct`.
However, while `sacct` reports information gathered every 30 seconds, the
`cgroup` files are updated more frequently and can detect quick spikes in
resource usage missed by `sacct`. Thus, using the `cgroup` files can give more
accurate information, especially regarding the RAM usage.
One way to access the `cgroup` files with `cgget`, is to start an interactive job
on the same node as the monitored job. Then, to view specific files and information,
use one of the following commands:
##### To view current memory usage:
```bash
cgget -r memory.usage_in_bytes /slurm/uid_<UID>/job_<SLURM_JOBID>/
```
where `<UID>` is replaced by your UID and `<SLURM_JOBID>` is
replaced by the monitored job's Job ID as assigned by SLURM.
!!! note
To find your `uid`, use the command `id -u`. Your UID never changes and is the same on all HCC clusters (*not* on Anvil, however!).
##### To view the total CPU time, in nanoseconds, consummed by the job:
```bash
cgget -r cpuacct.usage /slurm/uid_<UID>/job_<SLURM_JOBID>/
```
Since the `cgroup` files are available only during the job is running, another
way of accessing the information from these files is through the submit job.
To track for example, the maximum memory usage of a job, you can add
```bash
cgget -r memory.max_usage_in_bytes /slurm/uid_${UID}/job_${SLURM_JOBID}/
```
at the end of your submit file. Unlike the previous examples, you do not need to
modify this command - here `UID` and `SLURM_JOBID` are variables that will be set
when the job is submitted.
For information on more variables that can be used with `cgget`, please check [here](https://reposcope.com/man/en/1/cgget).
We also provide a sciprt, `mem_report`, that reports the current and maximum
memory usages for a job. This script is wrapper for the `cgget` commands shown above
and generates user-friendly output. To use this script, you need to add
```
mem_report
```
at the end of your submit script.
`mem_report` can also be run as part of an interactive job:
```bash
[demo13@c0218.swan ~]$ mem_report
Current memory usage for job 25745709 is: 2.57 MBs
Maximum memory usage for job 25745709 is: 3.27 MBs
```
When `cgget` and `mem_report` are used as part of the submit script, the respective output
is printed in the generated SLURM log files, unless otherwise specified.
### Monitoring queued Jobs:
The queue on Swan is a fair-share, which means your jobs priority depends on how long the job has been waiting in the queue, past usage of the cluster, your job size, memory and time requested, etc. Also this will be affected by the amount of jobs waiting on the queue and how much resources are available on the cluster. The more you submitted jobs on the queue the lower priority to run your jobs on the cluster will increase.
You can check when your jobs will be running on the cluster using the command:
```bash
sacct -u <user_id> --format=start
```
To check the start running time for a specific job then you can use the following command:
```bash
sacct -u <user_id> --job=<job_id> --format=start
```
Finally, To check your fairsahre score by running the following command:
```bash
sshare --account=<group_name> -a
```
After you run the above command you will be able to see your fair-share score.
- If your fairshare score is 1.0, then it is indicate that your account has not run any jobs recently (unused).
- If your faireshare score is 0.5, then that means (Average utilization). The Account on average is using exactly as much as their granted Share.
- If your fairshae score is between 0.5 > fairshare > 0, that means (Higher than average utilization). The Account has overused their granted Share.
- Finally, if your fairshare score is 0. That means (No share left). The Account has vastly overused their granted Share. If there is no contention for resources, the jobs will still start.
!!! note "Job Wait Time"
Fairshare priority is not the only factor in how long a job takes to start. The SLURM scheduler needs to find a time where resources are available.
Larger jobs or jobs requiring GPUs may take longer to start in queue while SLURM waits for resources to be available.
There is another way to run your job faster which is by having [Priority Access](https://hcc.unl.edu/priority-access-pricing).
---
title: Available Partitions
summary: "Listing of partitions on Swan."
weight: 70
---
Partitions are used on Swan to distinguish different
resources. You can view the partitions with the command `sinfo`.
### Swan:
Swan has a two shared public partitions available for use. The default partition `{{ hcc.swan.partition.default }}` and the GPU enabled partition, `{{ hcc.swan.partition.gpu }}`.
When you submit a job on Swan without specifying a partition, it will automatically use the `{{ hcc.swan.partition.default }}`
| Partition Name | Notes |
|----------------------------------|--------------------------------------------------|
| {{ hcc.swan.partition.default }} | Default Paritition </br></br> Does not have GPUs |
| {{ hcc.swan.partition.gpu }} | Shared partition with GPUs |
On Swan jobs have a maximum runtime of 7 days, can request up to 2000 cores per user, and run up to 1000 jobs.
#### Worker Node Configuration
The standard configuration of a Swan worker node is:
| Configuration | Value |
|-----------------|--------|
| Cores | 56 |
| Memory | 250 GB |
| Scratch Storage | 3.5 TB |
Some Swan worker nodes are equipped with additional memory, with up to 2TB of memory available in some nodes.
##### GPU Enabled Worker Nodes
For GPU enabled worker nodes in the {{ hcc.swan.partition.gpu }} partition, the following GPUs are available:
{%
include-markdown "../submitting_gpu_jobs.md"
start="requirements if necessary."
end="### Specifying GPU memory (optional)"
%}
Additional GPUs are available in the `guest_gpu` partition, but jobs running on this partition will be preemptable. Details on how the partition operates is available below in [Guest Partition(s)](#guest-partitions). The GPUs in this partition are listed in the [partition list](swan_available_partitions/) for Swan for the [priority access partitions](#ownedpriority-access-partitions) .
!!! warning "Resource requests and utilization"
Please make sure your applications and software support the resources you are requesting.
Many applications are only able to use a single worker node and may not scale well with large numbers of cores.
Please review our information on how many resources to request in our [FAQ](/FAQ/#how-many-nodesmemorytime-should-i-request)
For GPU monitoring and resource requests, please review our page on [monitoring and optimizing GPU resources](submitting_jobs/monitoring_GPU_usage/)
[A full list of partitions is available for Swan](swan_available_partitions/)
### SLURM Quality of Service
Swan has two available Quality of Service types available which help manage how the job gets scheduled.
Overall limitations of maximum job wall time. CPUs, etc. are set for
all jobs with the default setting (when thea "–qos=" section is omitted)
and "short" jobs (described as above) on Swan.
The limitations are shown in the following form.
| | SLURM Specification | Max Job Run Time | Max CPUs per User | Max Jobs per User |
| ------- | -------------------- | ---------------- | ----------------- | ----------------- |
| Default | Leave blank | 7 days | 2000 | 1000 |
| Short | #SBATCH --qos=short | 6 hours | 16 | 2 |
Please also note that the memory and
local hard drive limits are subject to the physical limitations of the
nodes, described in the resources capabilities section of the
[HCC Documentation](/#resource-capabilities)
and the partition sections above.
#### Priority for short jobs
To run short jobs for testing and development work, a job can specify a
different quality of service (QoS). The *short* QoS increases a jobs
priority so it will run as soon as possible.
| SLURM Specification |
|----------------------- |
| `#SBATCH --qos=short` |
!!! warning "Limits per user for 'short' QoS"
- 6 hour job run time
- 2 jobs of 16 CPUs or fewer
- No more than 256 CPUs in use for *short* jobs from all users
### Owned/Priority Access Partitions
Partitions marked as owned by a group means only specific groups are
allowed to submit jobs to that partition. Groups are manually added to
the list allowed to submit jobs to the partition. If you are unable to
submit jobs to a partition, and you feel that you should be, please
contact [hcc-support@unl.edu](mailto:hcc-support@unl.edu).
To submit jobs to an owned partition, use the SLURM `--partition` option. Jobs
can either be submitted *only* to an owned partition, or to *both* the owned
partition and the general access queue. For example, assuming a partition
named `mypartition`:
!!! note "Submit only to an owned partition"
```bash
#SBATCH --partition=mypartition
```
Submitting solely to an owned partition means jobs will start immediately until
the resources on the partition are full, then queue until prior jobs finish and
resources become available.
!!! note "Submit to both an owned partition and general queue"
```bash
#SBATCH --partition=mypartition,batch
```
Submitting to both an owned partition and `batch` means jobs will run on both the owned
partition and the general batch queue. Jobs will start immediately until the resources
on the partition are full, then queue. Pending jobs will then start either on the owned partition
or in the general queue, wherever resources become available first
(taking into account FairShare). Unless there are specific reasons to limit jobs
to owned resources, this method is recommended to maximize job throughput.
[A full list of partitions is available for Swan](swan_available_partitions/)
### Guest Partition(s)
The `guest` partition can be used by users and groups that do not own
dedicated resources on Swan. Jobs running in the `guest` partition
will run on the owned resources with Intel OPA interconnect. The jobs
are preempted when the resources are needed by the resource owners:
guest jobs will be killed and returned to the queue in a pending state
until they can be started on another node.
HCC recommends verifying job behavior will support the restart and
modifying job scripts if necessary.
To submit your job to the guest partition add the line
!!! note "Submit to guest partition"
```bash
#SBATCH --partition=guest
```
to your submit script.
Owned GPU resources may also be accessed in an opportunistic manner by
submitting to the `guest_gpu` partition. Similar to `guest`, jobs are
preempted when the GPU resources are needed by the owners. To submit
your job to the `guest_gpu` partition, add the lines
!!! note "Submit to guest_gpu partition"
```bash
#SBATCH --partition=guest_gpu
#SBATCH --gres=gpu
```
to your SLURM script.
#### Preventing job restart
By default, jobs on the `guest` partition will be restarted elsewhere when they
are preempted. To prevent preempted jobs from being restarted add the line
!!! note "Prevent job restart on guest partition"
```bash
#SBATCH --no-requeue
```
to your SLURM submit file.
---
title: Available Partitions for Swan
summary: "List of available partitions for swan.unl.edu."
---
### Swan:
{{ json_table("docs/static/json/swan_partitions.json") }}
+++
title = "Submitting a Job Array"
description = "How to use job arrays with the SLURM scheduler."
+++
---
title: Submitting a Job Array
summary: "How to use job arrays with the SLURM scheduler."
weight: 30
---
A job array is a set of jobs that share the same submit file, but will
run multiple copies with a environment variable incremented. These are
......@@ -11,11 +12,11 @@ the same application multiple times.
### Creating a Array Submit File
An array submit file is very similar to the example submit files
in [Submitting Jobs]({{< relref "/guides/submitting_jobs/_index.md" >}}).
in [Submitting Jobs](/submitting_jobs/).
{{% panel theme="info" header="example.slurm" %}}
{{< highlight batch >}}
#!/bin/sh
!!! note "example.slurm"
```bat
#!/bin/bash
#SBATCH --array=0-31
#SBATCH --time=03:15:00 # Run time in hh:mm:ss
#SBATCH --mem-per-cpu=1024 # Minimum memory required per CPU (in megabytes)
......@@ -27,8 +28,8 @@ module load example/test
echo "I am task $SLURM_ARRAY_TASK_ID on node `hostname`"
sleep 60
{{< /highlight >}}
{{% /panel %}}
```
The submit file above will output the `$SLURM_ARRAY_TASK_ID`, which will
be different for every one of the 32 (0-31) jobs, to the output files.
......
---
title: Submitting an MPI Job
summary: "How to submit an MPI job on HCC resources."
weight: 40
---
This script requests 16 cores on nodes with InfiniBand:
!!! note "mpi.submit"
```bat
#!/bin/bash
#SBATCH --ntasks=16
#SBATCH --mem-per-cpu=1024
#SBATCH --time=03:15:00
#SBATCH --error=/work/[groupname]/[username]/job.%J.err
#SBATCH --output=/work/[groupname]/[username]/job.%J.out
module load compiler/gcc/8.2 openmpi/2.1
mpirun /home/[groupname]/[username]/mpiprogram
```
The above job will allocate 16 cores on the default partition. The 16
cores could be on any of the nodes in the partition, even split between
multiple nodes.
### Advanced Submission
Some users may prefer to specify more details. This will allocate 32
tasks, 16 on each of two nodes:
!!! note "mpi.submit"
```bat
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=16
#SBATCH --mem-per-cpu=1024
#SBATCH --time=03:15:00
#SBATCH --error=/work/[groupname]/[username]/job.%J.err
#SBATCH --output=/work/[groupname]/[username]/job.%J.out
module load compiler/gcc/8.2 openmpi/2.1
mpirun /home/[groupname]/[username]/mpiprogram
```
+++
title = "Submitting an OpenMP Job"
description = "How to submit an OpenMP job on HCC resources."
+++
---
title: Submitting an OpenMP Job
summary: "How to submit an OpenMP job on HCC resources."
weight: 45
---
Submitting an OpenMP job is different from
[Submitting an MPI Job]({{< relref "submitting_an_mpi_job" >}})
[Submitting an MPI Job](../submitting_an_mpi_job/)
since you must request multiple cores from a single node.
{{% panel theme="info" header="OpenMP example submission" %}}
{{< highlight batch >}}
#!/bin/sh
!!! note "OpenMP example submission"
```bat
#!/bin/bash
#SBATCH --ntasks-per-node=16 # 16 cores
#SBATCH --nodes=1 # 1 node
#SBATCH --mem-per-cpu=1024 # Minimum memory required per CPU (in megabytes)
......@@ -19,8 +20,8 @@ since you must request multiple cores from a single node.
export OMP_NUM_THREADS=${SLURM_NTASKS_PER_NODE}
./openmp-app.exe
{{< /highlight >}}
{{% /panel %}}
```
Notice that we used `ntasks-per-node` to specify the number of cores we
want on a single node. Additionally, we specify that we only want
......@@ -33,8 +34,7 @@ automatically match the `ntasks-per-node` value (in this example 16).
### Compiling
Directions to compile OpenMP can be found on
[Compiling an OpenMP Application]
({{< relref "/guides/running_applications/compiling_source_code/compiling_an_openmp_application" >}}).
[Compiling an OpenMP Application](/applications/user_software/compiling_an_openmp_application/).
### Further Documentation
......
---
title: Submitting GPU Jobs
summary: "How to submit GPU (CUDA/OpenACC) jobs on HCC resources."
weight: 35
---
### Available GPUs
Swan has two types of GPUs available in the `gpu` partition. The
type of GPU is configured as a SLURM feature, so you can specify a type
of GPU in your job resource requirements if necessary.
| Description | SLURM Feature | Available Hardware |
| -------------------- | ------------- | ---------------------------- |
| Tesla V100, with 10GbE | gpu_v100 | 1 node - 4 GPUs with 16 GB per node |
| Tesla V100, with OPA | gpu_v100 | 21 nodes - 2 GPUs with 32GB per node |
| Tesla V100S | gpu_v100 | 4 nodes - 2 GPUs with 32GB per node |
| Tesla T4 | gpu_t4 | 12 nodes - 2 GPUs with 16GB per node |
| NVIDIA A30 | gpu_a30 | 2 nodes - 4 GPUs with 24GB per node |
### Specifying GPU memory (optional)
You may optionally specify a GPU memory amount via the use of an additional feature statement.
The available memory specifcations are:
| Description | SLURM Feature |
| -------------- | ------------- |
| 12 GB RAM | gpu_12gb |
| 16 GB RAM | gpu_16gb |
| 24 GB RAM | gpu_24gb |
| 32 GB RAM | gpu_32gb |
### Requesting GPU resources in your SLURM script
To run your job on the next available GPU regardless of type, add the
following options to your srun or sbatch command:
```bat
--partition=gpu --gres=gpu
```
To run on a specific type of GPU, you can constrain your job to require
a feature. To run on P100 GPUs for example:
```bat
--partition=gpu --gres=gpu --constraint=gpu_p100
```
!!! note
You may request multiple GPUs by changing the` --gres` value to
-`-gres=gpu:2`. Note that this value is **per node**. For example,
`--nodes=2 --gres=gpu:2 `will request 2 nodes with 2 GPUs each, for a
total of 4 GPUs.
The GPU memory feature may be used to specify a GPU RAM amount either
independent of architecture, or in combination with it.
For example, using
```bat
--partition=gpu --gres=gpu --constraint=gpu_16gb
```
will request a GPU with 16GB of RAM, independent of the type of card
(P100, T4, etc.). You may also request both a GPU type _and_
memory amount using the `&` operator (single quotes are used because
`&` is a special character).
For example,
```bat
--partition=gpu --gres=gpu --constraint='gpu_32gb&gpu_v100'
```
will request a V100 GPU with 32GB RAM.
!!! warning
You must verify the GPU type and memory combination is valid based on the
[available GPU types.](../submitting_gpu_jobs/#available-gpus).
Requesting a nonexistent combination will cause your job to be rejected with
a `Requested node configuration is not available` error.
### Compiling
Compilation of CUDA or OpenACC jobs must be performed on the GPU nodes.
Therefore, you must run an [interactive job](../creating_an_interactive_job/)
to compile. An example command to compile in the `gpu` partition could be:
```bat
$ srun --partition=gpu --gres=gpu --mem=4gb --ntasks-per-node=2 --nodes=1 --pty $SHELL
```
The above command will start a shell on a GPU node with 2 cores and 4GB
of RAM in order to compile a GPU job. The above command could also be
useful if you want to run a test GPU job interactively.
### Submitting Jobs
CUDA and OpenACC submissions require running on GPU nodes.
!!! note "cuda.submit"
```bat
#!/bin/bash
#SBATCH --time=03:15:00
#SBATCH --mem-per-cpu=1024
#SBATCH --job-name=cuda
#SBATCH --partition=gpu
#SBATCH --gres=gpu
#SBATCH --error=/work/[groupname]/[username]/job.%J.err
#SBATCH --output=/work/[groupname]/[username]/job.%J.out
module load cuda
./cuda-app.exe
```
OpenACC submissions require loading the PGI compiler (which is currently
required to compile as well).
!!! note "openacc.submit"
```bat
#!/bin/bash
#SBATCH --time=03:15:00
#SBATCH --mem-per-cpu=1024
#SBATCH --job-name=cuda-acc
#SBATCH --partition=gpu
#SBATCH --gres=gpu
#SBATCH --error=/work/[groupname]/[username]/job.%J.err
#SBATCH --output=/work/[groupname]/[username]/job.%J.out
module load cuda/8.0 compiler/pgi/16
./acc-app.exe
```
### Submitting Pre-emptable Jobs
Some GPU hardware is reserved by various groups for priority access. While the group that has
purchased the priority access will always have immediate access, HCC makes these nodes
available opportunistically. When not otherwise utilized, **jobs can run on these resources with the
limitation that they may be pre-empted (i.e. killed) at any time**.
To submit jobs to these resources, add the following to your srun or sbatch command:
```bat
--partition=guest_gpu --gres=gpu
```
**In order to properly utilize pre-emptable resources, your job must be able to support
some type of checkpoint/resume functionality.**
<footer class=" footline" >
{{ $footer := print "_footer." .Lang }}
{{ range where .Site.Pages "File.BaseFileName" $footer }}
{{ .Content }}
{{else}}
{{ if .Site.GetPage "page" "_footer.md" }}
{{(.Site.GetPage "page" "_footer.md").Content}}
{{else}}
{{ T "create-footer-md" }}
{{end}}
{{end}}
</footer>
<link rel="stylesheet" href="/css/custom.css">
<link href="//cloud.typography.com/7717652/616662/css/fonts.css" type="text/css" rel="stylesheet">
{{ if isset .Params "scripts" }}{{ range .Params.scripts }}<script src="{{ printf "%s" . }}"></script>{{ end }}{{ end }}
{{ if isset .Params "css" }}{{ range .Params.css }}<link rel="stylesheet" href="{{ printf "%s" . }}">{{ end }}{{ end }}
{{$file := .Get "file"}}
{{- if eq (.Get "markdown") "true" -}}
{{- $file | readFile | markdownify -}}
{{- else if (.Get "highlight") -}}
{{- highlight ($file | readFile) (.Get "highlight") "" -}}
{{- else -}}
{{ $file | readFile | safeHTML }}
{{- end -}}
{{ $url := .Get "url" }}
{{ $json := getJSON $url }}
{{ if $json.table_generated }}
<p><em>last generated {{ $json.table_generated }}</em></p>
{{ end }}
<div class="pager">
<img src="http://mottie.github.com/tablesorter/addons/pager/icons/first.png" class="first"/>
<img src="http://mottie.github.com/tablesorter/addons/pager/icons/prev.png" class="prev"/>
<!-- the "pagedisplay" can be any element, including an input -->
<span class="pagedisplay" data-pager-output-filtered="{startRow:input} &ndash; {endRow} / {filteredRows} of {totalRows} total rows"></span>
<img src="http://mottie.github.com/tablesorter/addons/pager/icons/next.png" class="next"/>
<img src="http://mottie.github.com/tablesorter/addons/pager/icons/last.png" class="last"/>
<select class="pagesize">
<option value="5">5</option>
<option value="10">10</option>
<option value="20">20</option>
<option value="30">30</option>
<option value="40">40</option>
<option value="all">All Rows</option>
</select>
<select class="gotoPage" title="Select page number">
<option value="1">1</option>
<option value="2">2</option>
<option value="3">3</option>
<option value="4">4</option>
<option value="5">5</option>
</select>
</div>
<table class="sorttable">
<thead>
<tr>
{{ range $table_header := $json.table_header }}
<th>{{ $table_header }}</th>
{{ end }}
</tr>
</thead>
<tbody>
{{ range $table_row := $json.table_data }}
<tr>
{{ range $table_data := $table_row }}
<td>{{ $table_data }}</td>
{{ end }}
</tr>
{{ end }}
</tbody>
</table>
<div class="pager">
<img src="http://mottie.github.com/tablesorter/addons/pager/icons/first.png" class="first"/>
<img src="http://mottie.github.com/tablesorter/addons/pager/icons/prev.png" class="prev"/>
<!-- the "pagedisplay" can be any element, including an input -->
<span class="pagedisplay" data-pager-output-filtered="{startRow:input} &ndash; {endRow} / {filteredRows} of {totalRows} total rows"></span>
<img src="http://mottie.github.com/tablesorter/addons/pager/icons/next.png" class="next"/>
<img src="http://mottie.github.com/tablesorter/addons/pager/icons/last.png" class="last"/>
<select class="pagesize">
<option value="10">10</option>
<option value="20">20</option>
<option value="30">30</option>
<option value="40">40</option>
<option value="all">All Rows</option>
</select>
<select class="gotoPage" title="Select page number">
<option value="1">1</option>
<option value="2">2</option>
<option value="3">3</option>
<option value="4">4</option>
<option value="5">5</option>
</select>
</div>
import json
import requests
import markdown
import os
import glob
def define_env(env):
@env.macro
def children(path):
# Use: {{ children('path') }}
# Replace path with the directory path after docs/
# For example docs/handling_data would be a value of 'handling_data'
output = """
"""
dir_path = os.getcwd()+'/docs/'+path
for file in sorted(glob.glob(f'{dir_path}/*')):
# Handle sub-directories
if os.path.isdir(file):
file_path = file+'/index.md'
else:
file_path = file
if os.path.exists(file_path) and not file.endswith(f'{dir_path}/index.md'):
with open(file_path, 'r') as f:
title = ""
summary = ""
for line in f.readlines():
if line.strip().startswith('title:'):
title = line.split(':')[1].replace('"','')
if line.strip().startswith('summary:'):
summary = line.split(':')[1].replace('"','')
if summary:
summary = f'- Description: {summary}'
page_title = title.strip(' ').strip('\n')
# Handle sub-directories
if os.path.isdir(file):
url = file_path.split(path+'/')[-1]
else:
url = file_path.split(path+'/')[-1]
if title and not os.path.isdir(file): # If its the index page of a child dir
output += f"""### [{page_title}]({url.replace('.md','/')})
{summary}
"""
else: # If its the index page of a child dir
output += f"""### [{page_title}]({url})
{summary}
"""
### Full Content Return
return output.replace('`','')
@env.macro
def youtube(youtube_url):
# Based on https://github.com/UBCSailbot/sailbot_workspace/pull/374/files#diff-dd05ba889655ed64b86d9ffe222960b781cda2b9ec094f40d4744050eb6c0b2b
youtube_link = youtube_url if 'https' in youtube_url else f'https://www.youtube.com/embed/{youtube_url}'
return f'''<div class="video-wrapper">
<iframe width="560" height="315" src="{youtube_link}" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</div>'''
@env.macro
def json_table(table_url):
if 'http' in table_url:
table_data = requests.get(table_url).json()
else:
with open(table_url, 'r') as f:
table_data = json.load(f)
#table_data = requests.get(table_url).json()
table_generated = table_data['table_generated']
table_html = '''<p><em>last generated table_generated </em></p>
<div class="pager">
<img src="/images/tablesorter/first.png" class="first"/>
<img src="/images/tablesorter/prev.png" class="prev"/>
<!-- the "pagedisplay" can be any element, including an input -->
<span class="pagedisplay" data-pager-output-filtered="{startRow:input} &ndash; {endRow} / {filteredRows} of {totalRows} total rows"></span>
<img src="/images/tablesorter/next.png" class="next"/>
<img src="/images/tablesorter/last.png" class="last"/>
<select class="pagesize">
<option value="5">5</option>
<option value="10">10</option>
<option value="20">20</option>
<option value="30">30</option>
<option value="40">40</option>
<option value="all">All Rows</option>
</select>
<select class="gotoPage" title="Select page number">
<option value="1">1</option>
<option value="2">2</option>
<option value="3">3</option>
<option value="4">4</option>
<option value="5">5</option>
</select>
</div>
<table class="sorttable">
<thead>
<tr>
'''.replace('table_generated',table_generated)
# Add Headers
for header in table_data['table_header']:
table_html += f'<th>{header}</th>'
table_html += '</tr></thead><tbody>'
# Generate Rows
for row in table_data['table_data']:
table_html += '<tr>'
for entry in row:
table_html += f'<td>{entry}</td>'
table_html += '</tr>'
table_html += '</tbody></table>'
# Add Ending HTML
table_html += '''<div class="pager">
<img src="/images/tablesorter/first.png" class="first"/>
<img src="/images/tablesorter/prev.png" class="prev"/>
<!-- the "pagedisplay" can be any element, including an input -->
<span class="pagedisplay" data-pager-output-filtered="{startRow:input} &ndash; {endRow} / {filteredRows} of {totalRows} total rows"></span>
<img src="/images/tablesorter/next.png" class="next"/>
<img src="/images/tablesorter/last.png" class="last"/>
<select class="pagesize">
<option value="10">10</option>
<option value="20">20</option>
<option value="30">30</option>
<option value="40">40</option>
<option value="all">All Rows</option>
</select>
<select class="gotoPage" title="Select page number">
<option value="1">1</option>
<option value="2">2</option>
<option value="3">3</option>
<option value="4">4</option>
<option value="5">5</option>
</select>
</div>
'''
return table_html
@env.macro
def md_table(table_url):
if 'http' in table_url:
table_data = requests.get(table_url).content.decode("utf-8")
else:
with open(table_url, 'r') as f:
table_data = f.read()
table_html = '''<div class="pager">
<img src="/images/tablesorter/first.png" class="first"/>
<img src="/images/tablesorter/prev.png" class="prev"/>
<!-- the "pagedisplay" can be any element, including an input -->
<span class="pagedisplay" data-pager-output-filtered="{startRow:input} &ndash; {endRow} / {filteredRows} of {totalRows} total rows"></span>
<img src="/images/tablesorter/next.png" class="next"/>
<img src="/images/tablesorter/last.png" class="last"/>
<select class="pagesize">
<option value="5">5</option>
<option value="10">10</option>
<option value="20">20</option>
<option value="30">30</option>
<option value="40">40</option>
<option value="all">All Rows</option>
</select>
<select class="gotoPage" title="Select page number">
<option value="1">1</option>
<option value="2">2</option>
<option value="3">3</option>
<option value="4">4</option>
<option value="5">5</option>
</select>
</div>
'''
table_html += markdown.markdown(table_data, extensions=['markdown.extensions.tables']).replace('<table>','<table class="sorttable">')
# Add Ending HTML
table_html += '''<div class="pager">
<img src="/images/tablesorter/first.png" class="first"/>
<img src="/images/tablesorter/prev.png" class="prev"/>
<!-- the "pagedisplay" can be any element, including an input -->
<span class="pagedisplay" data-pager-output-filtered="{startRow:input} &ndash; {endRow} / {filteredRows} of {totalRows} total rows"></span>
<img src="/images/tablesorter/next.png" class="next"/>
<img src="/images/tablesorter/last.png" class="last"/>
<select class="pagesize">
<option value="10">10</option>
<option value="20">20</option>
<option value="30">30</option>
<option value="40">40</option>
<option value="all">All Rows</option>
</select>
<select class="gotoPage" title="Select page number">
<option value="1">1</option>
<option value="2">2</option>
<option value="3">3</option>
<option value="4">4</option>
<option value="5">5</option>
</select>
</div>
'''
return table_html
\ No newline at end of file
site_name: HCC-DOCS
dev_addr: 0.0.0.0:8080
site_url: https://hcc.unl.edu/docs/
copyright: Holland Computing Center | 118 Schorr Center, Lincoln NE 68588
repo_url: https://git.unl.edu/hcc/hcc-docs
repo_name: hcc/hcc-docs
edit_uri: edit/master/docs/
theme:
name: material
custom_dir: overrides
logo: images/site/UNMasterwhite.gif
favicon: images/site/favicon.png
icon:
repo: fontawesome/brands/gitlab
features:
- content.code.copy
- content.action.edit
- content.code.select
- navigation.indexes
- navigation.tracking
- navigation.instant: true
- navigation.indexes
- search.suggest
- announce.dismiss
#- toc.integrate # Moves right ToC under page in nav
- navigation.top
#- navigation.tabs -- We might use this for main HCC links like OOD
palette:
# Palette toggle for light mode
- scheme: default
toggle:
icon: material/brightness-7
name: Switch to dark mode
# Palette toggle for dark mode
- scheme: slate
toggle:
icon: material/brightness-4
name: Switch to light mode
plugins:
- search
- macros
- social
- awesome-nav
# - table-reader
- include-markdown
- mkdocs-nav-weight:
section_renamed: false
index_weight: -10
warning: true
reverse: false
headless_included: false
markdown_extensions:
- toc:
permalink: true
- abbr
- attr_list
- md_in_html
- def_list
- pymdownx.tasklist:
custom_checkbox: true
- admonition
- footnotes
- pymdownx.details
- pymdownx.superfences
- pymdownx.critic
- pymdownx.caret
- pymdownx.magiclink
- pymdownx.keys
- pymdownx.mark
- pymdownx.tilde
- pymdownx.tabbed:
alternate_style: true
slugify: !!python/object/apply:pymdownx.slugs.slugify
kwds:
case: lower
- tables
extra:
hcc:
# General HCC Links and Info
support_email: "hcc-support@unl.edu"
website: "https://hcc.unl.edu"
status_page: "https://status.hcc.unl.edu"
ooh:
page: "https://hcc.unl.edu/ooh"
zoom: "http://go.unl.edu/HCChelp"
hours: "every Tuesday and Thursday from 2-3 PM central"
# Swan Cluster Specific Information
swan:
partition:
default: batch
gpu: gpu
qos_limits:
normal:
time: 7 days
cpu_per_user: 2000
jobs_per_user: 1000
short:
name: short
time: 6 hours
cpu_per_user: 16
jobs_per_user: 2
xfer: "swan-xfer.unl.edu"
ood:
name: "Swan Open OnDemand"
url: "https://swan-ood.unl.edu"
work:
block: "100 TiB"
inode: "5 Million"
purge: "6 Months"
path: "/work"
variable: "$WORK"
home:
block: "20 GiB"
inode: "1 Million"
purge: "None"
path: "/home"
variable: "$HOME"
# Anvil Cloud Information
anvil:
dashboard: "https://anvil.unl.edu/"
limits:
standard:
instance_count: "10"
cores: "20"
ram: "60 GB"
volume_count: "10"
volume_quota: "100 GB"
extended:
instance_count: "20"
cores: "40"
ram: "120 GB"
volume_count: "20"
volume_quota: "200 GB"
# Attic Archive Information
attic:
xfer: "swan-xfer.unl.edu"
price: "$28 / TB / Year"
# Common Filesystem Information
common:
block: "50 TiB"
inode: "5 Million"
purge: "None"
path: "/common"
variable: "$COMMON"
retirement_date: "July 1st, 2025"
# NRDStor Filesystem Information
nrdstor:
block: "50 TiB"
inode: "5 Million"
purge: "None"
path: "/mnt/nrdstor/"
variable: "$NRDSTOR"
mac_address: "smb://nrdstor.unl.edu/nrdstor"
windows_address: "\\\\nrdstor.unl.edu\\nrdstor"
linux_address: "\\\\\\\\nrdstor.unl.edu\\\\nrdstor"
nrp:
base_url: https://nrp.ai
docs_url: https://nrp.ai/documentation
resources:
url: https://portal.nrp.ai/resources
generator: true
social:
- icon: material/home
name: Website
link: https://hcc.unl.edu
- icon: fontawesome/regular/envelope
name: Send us an email or ticket!
link: mailto:hcc-support@unl.edu
- icon: material/phone-classic
name: Call Us
link: tel:4024725041
- icon: simple/zoom
name: Open Office Hours Every Tuesday and Thursday from 2 to 3PM Central
link: http://go.unl.edu/HCChelp
- icon: simple/github
name: HCC Github
link: https://github.com/unlhcc
- icon: simple/gitlab
name: HCC UNL Gitlab
link: https://git.unl.edu/hcc
extra_javascript:
- https://cdnjs.cloudflare.com/ajax/libs/jquery/3.6.0/jquery.min.js
- https://cdnjs.cloudflare.com/ajax/libs/jquery.tablesorter/2.31.1/js/jquery.tablesorter.min.js
- https://cdnjs.cloudflare.com/ajax/libs/jquery.tablesorter/2.31.1/js/widgets/widget-pager.min.js
- https://cdnjs.cloudflare.com/ajax/libs/jquery.tablesorter/2.31.1/js/widgets/widget-filter.min.js
- js/sort-table.js # For the generated tables
- https://unpkg.com/tablesort@5.3.0/dist/tablesort.min.js
#- js/tablesort.js # For the standard markdown tables
extra_css:
- https://mottie.github.io/tablesorter/css/theme.dropbox.css
- https://cdnjs.cloudflare.com/ajax/libs/jquery.tablesorter/2.31.1/css/jquery.tablesorter.pager.min.css
- https://cdnjs.cloudflare.com/ajax/libs/jquery.tablesorter/2.31.1/css/filter.formatter.min.css
- stylesheets/extra.css
{% extends "base.html" %}
{% block announce %}
<h4>Filesystem Retirement - July 1, 2025 - More information on the <a href = "https://hcc.unl.edu/docs/FAQ/common_retirement/">Common Retirement FAQ page</a></h4>
{% endblock %}
#!/bin/bash
# Make the facilities document use Times New Roman, for NSF reasons
# Create a pandoc document, and change the font
PANTMP=$(mktemp -d)
pushd ${PANTMP}
echo "hello world" | pandoc -o reference.docx
unzip -q reference.docx word/theme/theme1.xml
sed -i 's/Calibri/Times New Roman/' word/theme/theme1.xml
sed -i 's/Cambria/Times New Roman/' word/theme/theme1.xml
zip -q -r --move reference.docx word
popd
pandoc --reference-doc=${PANTMP}/reference.docx -s docs/facilities.md -o site/facilities.docx
pandoc --reference-doc=${PANTMP}/reference.docx -s docs/static/markdown/class-guidelines.md -o site/class-guidelines.docx
rm -rf ${PANTMP}
static/images/11637370.png

16.1 KiB