Skip to content
Snippets Groups Projects

Add page for Good HCC practices

Merged Natasha Pavlovikj requested to merge practices into master
1 file
+ 76
0
Compare changes
  • Side-by-side
  • Inline
+ 76
0
 
+++
 
title = "Good HCC Practices"
 
description = "Guidelines for good HCC practices"
 
weight = "95"
 
+++
 
 
Crane and Rhino, our two high-performance clusters, are shared among all our users.
 
Sometimes, some users' activities may negatively impact the clusters and the users.
 
To avoid this, we provide the following guidelines for good HCC practices.
 
 
## Login Node
 
* **Be kind to the login node.** The login node is shared among all users and it
 
should be used only for light tasks, such as moving and editing files, compiling programs,
 
and submitting and monitoring jobs. If a researcher runs a computationally intensive task
 
on the login node, that will negatively impact the performance for other users. Moreover, the
 
resources on the login node are limited, so any lengthy or intensive task will highly likely
 
exceed these resources and be terminated. For any CPU or memory intensive
 
operations, such as testing and running applications, one should use an
 
[interactive session]({{< relref "creating_an_interactive_job" >}}), or
 
[submit a job to the batch queue]({{< relref "submitting_jobs" >}}).
 
* **Avoid launching multiple simultaneous processes on the login node.** This may include using
 
lots of threads for compiling applications, or checking the job status multiple times a minute.
 
 
## File Systems
 
* Some I/O intensive jobs may benefit from **copying the data to the fast, temporary /scratch
 
file system local to each worker nodes**. The */scratch* directories are unique per job, and
 
are deleted when the job finishes. Thus, the last step of the batch script should copy the
 
needed output files from */scratch* to either */work* or */common*. Please see the
 
[Running BLAST Alignment]({{< relref "running_blast_alignment" >}}) page for an example.
 
* */work* has two quotas - one for **file count** and the second one for **disk space**.
 
Reaching these quotas may additionally stress the file system. Therefore, please make sure you
 
monitor these quotas regularly, and delete all the files that are not needed or copy them to more permanent location.
 
* */work* is intended to be **temporary location for storing job outputs and files**. After that,
 
all the necessary files need to be either moved to a permanent storage, or deleted.
 
* **Avoid rapidly opening and closing many files, as well as frequently reading and writing to
 
disk, in your program.** This approach stresses the file system and may cause general issues.
 
Instead, consider reading and writing large blocks of data in memory over time, or
 
utilizing more advanced parallel I/O libraries, such as *parallel hdf5* and *parallel netcdf*.
 
 
## Internal and External Networks
 
* **Use archives to transfer large number of files.** If you are performing file transfer of
 
many small files, please put these files in an archive file format, such that the many files are
 
replaced by a single file. We recommend using *zip* as the archive format as zip files keep an
 
index of the files. Moreover, zip files can be quickly indexed by the various zip tools, and allow
 
extraction of all files or a subset of files. The *tar* formats are stream oriented, and a full decompression
 
is required for the tools to know if the requested files have been found.
 
 
## Running Applications
 
* **Before you request multiple nodes and cores in your submit script, make sure that the application you are
 
using supports that.** MPI applications can utilize multiple nodes and cores, while threaded or OpenMP applications are
 
limited to a single node. Misusing this information may harm the researcher's waiting time in queue, as well as the application performance.
 
* Threaded and OpenMP applications can utilize multiple cores within a node. However, most of the applications **do not
 
perform significantly better when more than 16 cores are used**. On the other hand, requesting more cores increases the
 
waiting time for resources in queue, so please make sure you request a reasonable number of cores.
 
* If an application uses multiple threads or cores, that number needs to be specified with the *"--ntasks-per-node"*
 
or *"--ntasks"* options of SLURM. If you use multiple threads or cores with your application, but you don't specify
 
the respective SLURM options, your application will use only 1 core by default.
 
* **Avoid submitting large number of short (less than half an hour of running time) SLURM jobs.** The scheduler spends more
 
time and memory in processing those jobs, which may cause problems and reduce the scheduler's responsiveness for everyone.
 
Instead, group the short tasks into jobs that will run longer.
 
* **The maximum running time on our clusters is 7 days.** If your job needs more time than that, please consider
 
improving the code, splitting the job into smaller tasks, or using checkpointing tools such as [DMTCP]({{< relref "dmtcp_checkpointing" >}}).
 
* Before submitting a job, it is recommended to make sure that **you are executing the application correctly, you are
 
passing the right arguments, and you don't have typos**. You can do this using an [interactive session]({{< relref "creating_an_interactive_job" >}}).
 
Otherwise, your job may be waiting for resources to only immediately fail because of typo or missing argument.
 
* If no memory, time, and core requirements are specified in your submit SLURM script, the **default resources allocated are
 
1GB of RAM, 1 hour of running time, and a single CPU core** respectively. Oftentimes, these resources are not enough. If the job
 
is terminated, there is a high chance that the reason is exceeded resources, so please make sure you set
 
the memory and time requirements appropriately.
 
* The run time and memory usage depend heavily on the application and the data used. You can monitor your application's needs with
 
tools such as [Allinea Performance Reports]({{< relref "/applications/app_specific/allinea_profiling_and_debugging/allinea_performance_reports" >}})
 
and [mem_report]({{< relref "monitoring_jobs" >}}). While these tools can not predict the needed resources, they can provide
 
useful information the researcher can use the next time that particular application is run.
 
 
We strongly recommend you to read and follow this guidance. If you have any concerns about your workflows or need any
 
assistance, please contact HCC Support at {{< icon name="envelope" >}}[hcc-support@unl.edu] (mailto:hcc-support@unl.edu).
Loading