Update docs

23ff55c9 · Natasha Pavlovikj · 78540cea · 23ff55c9 · 23ff55c9
Commit 23ff55c9 authored 1 year ago by Natasha Pavlovikj
--- a/content/FAQ/_index.md
+++ b/content/FAQ/_index.md
@@ -12,6 +12,8 @@ weight = "95"
 - [How many nodes/memory/time should I request?](#how-many-nodes-memory-time-should-i-request)
 - [I am trying to run a job but nothing happens?](#i-am-trying-to-run-a-job-but-nothing-happens)
 - [I keep getting the error "slurmstepd: error: Exceeded step memory limit at some point." What does this mean and how do I fix it?](#i-keep-getting-the-error-slurmstepd-error-exceeded-step-memory-limit-at-some-point-what-does-this-mean-and-how-do-i-fix-it)
+- [I keep getting the error "Some of your processes may have been killed by the cgroup out-of-memory handler." What does this mean and how do I fix it?](#i-keep-getting-the-error-some-of-your-processes-may-have-been-killed-by-the-cgroup-out-of-memory-handler-what-does-this-mean-and-how-do-i-fix-it)
+- [I keep getting the error "Job cancelled due to time limit." What does this mean and how do I fix it?](#i-keep-getting-the-error-job-cancelled-due-to-time-limit-what-does-this-mean-and-how-do-i-fix-it)
 - [I want to talk to a human about my problem. Can I do that?](#i-want-to-talk-to-a-human-about-my-problem-can-i-do-that)
 - [My submitted job takes long time waiting in the queue or it is not running?](#my-submitted-job-takes-long-time-waiting-in-the-queue-or-it-is-not-running)
 - [What IP's do I use to allow connections to/from HCC resources?](#what-ip-s-do-i-use-to-allow-connections-to-from-hcc-resources)
@@ -136,7 +138,7 @@ with your login, the name of the cluster you are running on, and the
 full path to your submit script and we will be happy to help solve the
 issue.

-##### I keep getting the error "slurmstepd: error: Exceeded step memory limit at some point." What does this mean and how do I fix it?
+#### I keep getting the error "slurmstepd: error: Exceeded step memory limit at some point." What does this mean and how do I fix it?

 This error occurs when the job you are running uses more memory than was
 requested in your submit script.
@@ -162,6 +164,57 @@ If you continue to run into issues, please contact us at
 {{< icon name="envelope" >}}[hcc-support@unl.edu](mailto:hcc-support@unl.edu)
 for additional assistance.

+#### I keep getting the error "Some of your processes may have been killed by the cgroup out-of-memory handler." What does this mean and how do I fix it?
+
+This is another error that occurs when the job you are running uses more memory than was
+requested in your submit script.
+
+If you specified `--mem` or `--mem-per-cpu` in your submit script, try
+increasing this value and resubmitting your job.
+
+If you did not specify `--mem` or `--mem-per-cpu` in your submit script,
+chances are the default amount allotted is not sufficient. Add the line
+
+{{< highlight batch >}}
+#SBATCH --mem=<memory_amount>
+{{< /highlight >}}
+
+to your script with a reasonable amount of memory and try running it again. If you keep
+getting this error, continue to increase the requested memory amount and
+resubmit the job until it finishes successfully.
+
+For additional details on how to monitor usage on jobs, check out the
+documentation on [Monitoring Jobs]({{< relref "monitoring_jobs" >}}).
+
+If you continue to run into issues, please contact us at
+{{< icon name="envelope" >}}[hcc-support@unl.edu](mailto:hcc-support@unl.edu)
+for additional assistance.
+
+#### I keep getting the error "Job cancelled due to time limit." What does this mean and how do I fix it?
+
+This error occurs when the job you are running reached the time limit than was
+requested in your submit script without finishing successfully.
+
+If you specified `--time` in your submit script, try
+increasing this value and resubmitting your job.
+
+If you did not specify `--time` in your submit script,
+chances are the default runtime of 1 hour is not sufficient. Add the line
+
+{{< highlight batch >}}
+#SBATCH --time=<runtime>
+{{< /highlight >}}
+
+to your script with increased runtime value and try running it again. The maximum runtime on Swan 
+is 7 days (168 hours).
+
+For additional details on how to monitor usage on jobs, check out the
+documentation on [Monitoring Jobs]({{< relref "monitoring_jobs" >}}).
+
+If you continue to run into issues, please contact us at
+{{< icon name="envelope" >}}[hcc-support@unl.edu](mailto:hcc-support@unl.edu)
+for additional assistance.
+
 #### I want to talk to a human about my problem. Can I do that?

 Of course! We have an open door policy and invite you to ~~stop by

--- a/content/good_hcc_practices/_index.md
+++ b/content/good_hcc_practices/_index.md
@@ -80,6 +80,17 @@ the memory and time requirements appropriately.
 tools such as [Allinea Performance Reports]({{< relref "/applications/app_specific/allinea_profiling_and_debugging/allinea_performance_reports" >}}) 
 and [mem_report]({{< relref "monitoring_jobs" >}}). While these tools can not predict the needed resources, they can provide 
 useful information the researcher can use the next time that particular application is run.
+* **Before you request GPU in your submit script, make sure that the application and code you are 
+using supports that.** Only code that is written to support GPU can take advantage of GPU nodes. Please read the documentation of 
+the application and code you are using to see if GPU can be used. Misusing this information may harm the researcher's waiting 
+time in queue and result in underused resources. It is very important to request GPUs only when your code and application can 
+efficiently utilize them.
+* **Before you request multiple GPUs in your submit script, make sure that the application and code you are 
+using supports that.** Only code that is written to support multiple GPUs can take advantage of multiple GPUs. Please read the 
+documentation of the application and code you are using to see if multiple GPUs can be used. 
+Misusing this information may harm the researcher's waiting time in queue and result in underused resources. 
+It is very important to request multiple GPUs only when your code and application can efficiently utilize them.
+

 We strongly recommend you to read and follow this guidance. If you have any concerns about your workflows or need any 
 assistance, please contact HCC Support at {{< icon name="envelope" >}}[hcc-support@unl.edu](mailto:hcc-support@unl.edu).