Add GPU monitoring documentation.
Compare changes
+ 171
− 0
You can use `nvidia-smi` to periodically log GPU usage in CSV files for later analysis. This is convenient to be added in the SLURM submit script instead of running it interactively as shown above. To do this, wrap your job command with the following in your SLURM submission script. This will generate three files in your `$WORK` directory:
If your deep learning job utilizes libraries such as `TensorFlow` or `PyTorch`, you can use TensorBoard to monitor and visualize GPU usage metrics, including GPU utilization, memory consumption, and model performance. TensorBoard provides real-time insights into how your job interacts with the GPU, helping you optimize performance and identify bottlenecks.
Improving GPU utilization means maximizing both the computational and memory usage of the GPU to ensure your program fully utilizes GPU's processing power. Low utilization can result from various bottlenecks, including improper parallelism, insufficient memory management, or CPU-GPU communication overhead.