diff --git a/content/FAQ/_index.md b/content/FAQ/_index.md index 6f9f00a4c8a8ad6e830c85e2a30dd758052a05e4..2f0385175773edc91c96b23875b26f032f97ed28 100644 --- a/content/FAQ/_index.md +++ b/content/FAQ/_index.md @@ -12,6 +12,8 @@ weight = "95" - [How many nodes/memory/time should I request?](#how-many-nodes-memory-time-should-i-request) - [I am trying to run a job but nothing happens?](#i-am-trying-to-run-a-job-but-nothing-happens) - [I keep getting the error "slurmstepd: error: Exceeded step memory limit at some point." What does this mean and how do I fix it?](#i-keep-getting-the-error-slurmstepd-error-exceeded-step-memory-limit-at-some-point-what-does-this-mean-and-how-do-i-fix-it) +- [I keep getting the error "Some of your processes may have been killed by the cgroup out-of-memory handler." What does this mean and how do I fix it?](#i-keep-getting-the-error-some-of-your-processes-may-have-been-killed-by-the-cgroup-out-of-memory-handler-what-does-this-mean-and-how-do-i-fix-it) +- [I keep getting the error "Job cancelled due to time limit." What does this mean and how do I fix it?](#i-keep-getting-the-error-job-cancelled-due-to-time-limit-what-does-this-mean-and-how-do-i-fix-it) - [I want to talk to a human about my problem. Can I do that?](#i-want-to-talk-to-a-human-about-my-problem-can-i-do-that) - [My submitted job takes long time waiting in the queue or it is not running?](#my-submitted-job-takes-long-time-waiting-in-the-queue-or-it-is-not-running) - [What IP's do I use to allow connections to/from HCC resources?](#what-ip-s-do-i-use-to-allow-connections-to-from-hcc-resources) @@ -136,7 +138,7 @@ with your login, the name of the cluster you are running on, and the full path to your submit script and we will be happy to help solve the issue. -##### I keep getting the error "slurmstepd: error: Exceeded step memory limit at some point." What does this mean and how do I fix it? +#### I keep getting the error "slurmstepd: error: Exceeded step memory limit at some point." What does this mean and how do I fix it? This error occurs when the job you are running uses more memory than was requested in your submit script. @@ -162,6 +164,57 @@ If you continue to run into issues, please contact us at {{< icon name="envelope" >}}[hcc-support@unl.edu](mailto:hcc-support@unl.edu) for additional assistance. +#### I keep getting the error "Some of your processes may have been killed by the cgroup out-of-memory handler." What does this mean and how do I fix it? + +This is another error that occurs when the job you are running uses more memory than was +requested in your submit script. + +If you specified `--mem` or `--mem-per-cpu` in your submit script, try +increasing this value and resubmitting your job. + +If you did not specify `--mem` or `--mem-per-cpu` in your submit script, +chances are the default amount allotted is not sufficient. Add the line + +{{< highlight batch >}} +#SBATCH --mem=<memory_amount> +{{< /highlight >}} + +to your script with a reasonable amount of memory and try running it again. If you keep +getting this error, continue to increase the requested memory amount and +resubmit the job until it finishes successfully. + +For additional details on how to monitor usage on jobs, check out the +documentation on [Monitoring Jobs]({{< relref "monitoring_jobs" >}}). + +If you continue to run into issues, please contact us at +{{< icon name="envelope" >}}[hcc-support@unl.edu](mailto:hcc-support@unl.edu) +for additional assistance. + +#### I keep getting the error "Job cancelled due to time limit." What does this mean and how do I fix it? + +This error occurs when the job you are running reached the time limit than was +requested in your submit script without finishing successfully. + +If you specified `--time` in your submit script, try +increasing this value and resubmitting your job. + +If you did not specify `--time` in your submit script, +chances are the default runtime of 1 hour is not sufficient. Add the line + +{{< highlight batch >}} +#SBATCH --time=<runtime> +{{< /highlight >}} + +to your script with increased runtime value and try running it again. The maximum runtime on Swan +is 7 days (168 hours). + +For additional details on how to monitor usage on jobs, check out the +documentation on [Monitoring Jobs]({{< relref "monitoring_jobs" >}}). + +If you continue to run into issues, please contact us at +{{< icon name="envelope" >}}[hcc-support@unl.edu](mailto:hcc-support@unl.edu) +for additional assistance. + #### I want to talk to a human about my problem. Can I do that? Of course! We have an open door policy and invite you to ~~stop by diff --git a/content/good_hcc_practices/_index.md b/content/good_hcc_practices/_index.md index b1b1bf15926f47318cd08d743e09c68dc6eba1f9..192312f787762f81103eb59a2c792765585c411b 100644 --- a/content/good_hcc_practices/_index.md +++ b/content/good_hcc_practices/_index.md @@ -80,6 +80,17 @@ the memory and time requirements appropriately. tools such as [Allinea Performance Reports]({{< relref "/applications/app_specific/allinea_profiling_and_debugging/allinea_performance_reports" >}}) and [mem_report]({{< relref "monitoring_jobs" >}}). While these tools can not predict the needed resources, they can provide useful information the researcher can use the next time that particular application is run. +* **Before you request GPU in your submit script, make sure that the application and code you are +using supports that.** Only code that is written to support GPU can take advantage of GPU nodes. Please read the documentation of +the application and code you are using to see if GPU can be used. Misusing this information may harm the researcher's waiting +time in queue and result in underused resources. It is very important to request GPUs only when your code and application can +efficiently utilize them. +* **Before you request multiple GPUs in your submit script, make sure that the application and code you are +using supports that.** Only code that is written to support multiple GPUs can take advantage of multiple GPUs. Please read the +documentation of the application and code you are using to see if multiple GPUs can be used. +Misusing this information may harm the researcher's waiting time in queue and result in underused resources. +It is very important to request multiple GPUs only when your code and application can efficiently utilize them. + We strongly recommend you to read and follow this guidance. If you have any concerns about your workflows or need any assistance, please contact HCC Support at {{< icon name="envelope" >}}[hcc-support@unl.edu](mailto:hcc-support@unl.edu).