Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
H
HCC docs
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Deploy
Releases
Container registry
Model registry
Monitor
Incidents
Service Desk
Analyze
Value stream analytics
Contributor analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
GitLab community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
Salman Djingueinabaye
HCC docs
Commits
8963e94a
Commit
8963e94a
authored
5 years ago
by
Natasha Pavlovikj
Browse files
Options
Downloads
Patches
Plain Diff
Add Good HCC Practices page
parent
e6fbe4c4
Branches
Branches containing commit
Tags
Tags containing commit
No related merge requests found
Changes
1
Show whitespace changes
Inline
Side-by-side
Showing
1 changed file
content/good_hcc_practices/_index.md
+75
-0
75 additions, 0 deletions
content/good_hcc_practices/_index.md
with
75 additions
and
0 deletions
content/good_hcc_practices/_index.md
0 → 100644
+
75
−
0
View file @
8963e94a
+++
title = "Good HCC Practices"
description = "Guidelines for good HCC practices"
weight = "95"
+++
Crane and Rhino, our two high-performance clusters, are shared among all our users.
Sometimes, some users' activities may negatively impact the clusters and the users.
To avoid this, we provide the following guidelines for good HCC practices.
## Login Node
*
**Do not run jobs on the login node.**
The login node is shared among all users and it
should be used only for light tasks, such as moving and editing files, compiling programs,
and submitting and monitoring jobs. If a researcher runs a computationally intensive task
on the login node, that will negatively impact the performance for other users. For any CPU
or memory intensive operations, such as testing and running applications, one should use an
[
interactive session
](
{{
<
relref
"
creating_an_interactive_job
"
>
}}), or
[
submit a job to the batch queue
](
{{
<
relref
"
submitting_jobs
"
>
}}).
*
**Do not launch multiple simultaneous processes on the login node.**
This may include using
lots of threads for compiling applications, or checking the job status multiple times a minute.
## File Systems
*
Some I/O intensive jobs may benefit from
**
copying the data to the fast, temporary /scratch
file system local to each worker nodes
**
. The
*/scratch*
directories are unique per job, and
are deleted when the job finishes. Thus, the last step of the batch script should copy the
needed output files from
*/scratch*
to either
*/work*
or
*/common*
. Please see the
[
Running BLAST Alignment
](
{{
<
relref
"
running_blast_alignment
"
>
}}) page for an example.
Currently, we do not have quota on the
*/scratch*
file system.
*
*/work*
has two quotas - one for
**file count**
and the second one for
**disk space**
.
Reaching these quotas may additionally stress the file system. Therefore, please make sure you
monitor these quotas regularly, and delete all the files that are not needed or copy them to more permanent location.
*
*/work*
is intended to be
**temporary location for storing job outputs and files**
. After that,
all the necessary files need to be either moved to a permanent storage, or deleted.
*
**
Avoid rapidly opening and closing many files, as well as frequently reading and writing to
disk, in your program.
**
This approach stresses the file system and may cause general issues.
Instead, consider reading and writing large blocks of data in memory over time, or
utilizing more advanced parallel I/O libraries, such as
*parallel hdf5*
and
*parallel netcdf*
.
## Internal and External Networks
*
**Transferring many files to/from/within the cluster can harm the file system.**
If you are
performing file transfer of many small files, please put these files in an archive file format,
such that the many files are replaced by a single file. We recommend using
*zip*
as the archive format
as zip files keep an index of the files. Moreover, zip files can be quickly indexed by the various zip tools, and allow
extraction of all files or a subset of files. The
*tar*
formats are stream oriented, and a full decompression is required
for the tools to know if the requested files have been found.
## Running Applications
*
**
Before you request multiple nodes and cores in your submit script, make sure that the application you are
using supports that.
**
MPI applications can utilize multiple nodes and cores, while threaded or OpenMP applications are
limited to a single node. Misusing this information may harm the researcher's waiting time in queue, as well as the application performance.
*
Threaded and OpenMP applications can utilize multiple cores within a node. However, most of the applications
**
do not
perform significantly better when more than 16 cores are used
**
. On the other hand, requesting more cores increases the
waiting time for resources in queue, so please make sure you request a reasonable number of cores.
*
If an application uses multiple threads or cores, that number needs to be specified with the
*"--ntasks-per-node"*
or
*"--ntasks"*
options of SLURM. If you use multiple threads or cores with your application, but you don't specify
the respective SLURM options, your application will use only 1 core by default.
*
**Do not submit large number of short (less than half an hour of running time) SLURM jobs.**
The scheduler spends more
time and memory in processing those jobs, which may cause problems and reduce the scheduler's responsiveness for everyone.
Instead, group the short tasks into jobs that will run longer.
*
**The maximum running time on our clusters is 7 days.**
If your job needs more time than that, please consider
improving the code, splitting the job into smaller tasks, or using checkpointing tools such as
[
DMTCP
](
{{
<
relref
"
dmtcp_checkpointing
"
>
}}).
*
Before submitting a job, it is recommended to make sure that
**
you are executing the application correctly, you are
passing the right arguments, and you don't have typos
**
. You can do this using an
[
interactive session
](
{{
<
relref
"
creating_an_interactive_job
"
>
}}).
Otherwise, your job may be waiting for resources to only immediately fail because of typo or missing argument.
*
If no memory, time, and core requirements are specified in your submit SLURM script, the
**
default resources allocated are
1GB of RAM, 1 hour of running time, and a single CPU core
**
respectively. Oftentimes, these resources are not enough. If the job
is terminated, there is a high chance that the reason is exceeded resources, so please make sure you set
the memory and time requirements appropriately.
*
The run time and memory usage depend heavily on the application and the data used. You can monitor your application's needs with
tools such as
[
Allinea Performance Reports
](
{{
<
relref
"/
applications
/
app_specific
/
allinea_profiling_and_debugging
/
allinea_performance_reports
"
>
}})
and
[
mem_report
](
{{
<
relref
"
monitoring_jobs
"
>
}}). While these tools can not predict the needed resources, they can provide
useful information the researcher can use the next time that particular application is run.
We strongly recommend you to read and follow this guidance. If you have any concerns about your workflows or need any
assistance, please contact HCC Support at {{
<
icon
name=
"envelope"
>
}}[hcc-support@unl.edu] (mailto:hcc-support@unl.edu).
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment