diff --git a/content/good_hcc_practices/_index.md b/content/good_hcc_practices/_index.md new file mode 100644 index 0000000000000000000000000000000000000000..1ee1df04e9ecae8889490e1acf6a88cb97e4a5ec --- /dev/null +++ b/content/good_hcc_practices/_index.md @@ -0,0 +1,75 @@ ++++ +title = "Good HCC Practices" +description = "Guidelines for good HCC practices" +weight = "95" ++++ + +Crane and Rhino, our two high-performance clusters, are shared among all our users. +Sometimes, some users' activities may negatively impact the clusters and the users. +To avoid this, we provide the following guidelines for good HCC practices. + +## Login Node +* **Do not run jobs on the login node.** The login node is shared among all users and it +should be used only for light tasks, such as moving and editing files, compiling programs, +and submitting and monitoring jobs. If a researcher runs a computationally intensive task +on the login node, that will negatively impact the performance for other users. For any CPU +or memory intensive operations, such as testing and running applications, one should use an +[interactive session]({{< relref "creating_an_interactive_job" >}}), or +[submit a job to the batch queue]({{< relref "submitting_jobs" >}}). +* **Do not launch multiple simultaneous processes on the login node.** This may include using +lots of threads for compiling applications, or checking the job status multiple times a minute. + +## File Systems +* Some I/O intensive jobs may benefit from **copying the data to the fast, temporary /scratch +file system local to each worker nodes**. The */scratch* directories are unique per job, and +are deleted when the job finishes. Thus, the last step of the batch script should copy the +needed output files from */scratch* to either */work* or */common*. Please see the +[Running BLAST Alignment]({{< relref "running_blast_alignment" >}}) page for an example. +Currently, we do not have quota on the */scratch* file system. +* */work* has two quotas - one for **file count** and the second one for **disk space**. +Reaching these quotas may additionally stress the file system. Therefore, please make sure you +monitor these quotas regularly, and delete all the files that are not needed or copy them to more permanent location. +* */work* is intended to be **temporary location for storing job outputs and files**. After that, +all the necessary files need to be either moved to a permanent storage, or deleted. +* **Avoid rapidly opening and closing many files, as well as frequently reading and writing to +disk, in your program.** This approach stresses the file system and may cause general issues. +Instead, consider reading and writing large blocks of data in memory over time, or +utilizing more advanced parallel I/O libraries, such as *parallel hdf5* and *parallel netcdf*. + +## Internal and External Networks +* **Transferring many files to/from/within the cluster can harm the file system.** If you are +performing file transfer of many small files, please put these files in an archive file format, +such that the many files are replaced by a single file. We recommend using *zip* as the archive format +as zip files keep an index of the files. Moreover, zip files can be quickly indexed by the various zip tools, and allow +extraction of all files or a subset of files. The *tar* formats are stream oriented, and a full decompression is required +for the tools to know if the requested files have been found. + +## Running Applications +* **Before you request multiple nodes and cores in your submit script, make sure that the application you are +using supports that.** MPI applications can utilize multiple nodes and cores, while threaded or OpenMP applications are +limited to a single node. Misusing this information may harm the researcher's waiting time in queue, as well as the application performance. +* Threaded and OpenMP applications can utilize multiple cores within a node. However, most of the applications **do not +perform significantly better when more than 16 cores are used**. On the other hand, requesting more cores increases the +waiting time for resources in queue, so please make sure you request a reasonable number of cores. +* If an application uses multiple threads or cores, that number needs to be specified with the *"--ntasks-per-node"* +or *"--ntasks"* options of SLURM. If you use multiple threads or cores with your application, but you don't specify +the respective SLURM options, your application will use only 1 core by default. +* **Do not submit large number of short (less than half an hour of running time) SLURM jobs.** The scheduler spends more +time and memory in processing those jobs, which may cause problems and reduce the scheduler's responsiveness for everyone. +Instead, group the short tasks into jobs that will run longer. +* **The maximum running time on our clusters is 7 days.** If your job needs more time than that, please consider +improving the code, splitting the job into smaller tasks, or using checkpointing tools such as [DMTCP]({{< relref "dmtcp_checkpointing" >}}). +* Before submitting a job, it is recommended to make sure that **you are executing the application correctly, you are +passing the right arguments, and you don't have typos**. You can do this using an [interactive session]({{< relref "creating_an_interactive_job" >}}). +Otherwise, your job may be waiting for resources to only immediately fail because of typo or missing argument. +* If no memory, time, and core requirements are specified in your submit SLURM script, the **default resources allocated are +1GB of RAM, 1 hour of running time, and a single CPU core** respectively. Oftentimes, these resources are not enough. If the job +is terminated, there is a high chance that the reason is exceeded resources, so please make sure you set +the memory and time requirements appropriately. +* The run time and memory usage depend heavily on the application and the data used. You can monitor your application's needs with +tools such as [Allinea Performance Reports]({{< relref "/applications/app_specific/allinea_profiling_and_debugging/allinea_performance_reports" >}}) +and [mem_report]({{< relref "monitoring_jobs" >}}). While these tools can not predict the needed resources, they can provide +useful information the researcher can use the next time that particular application is run. + +We strongly recommend you to read and follow this guidance. If you have any concerns about your workflows or need any +assistance, please contact HCC Support at {{< icon name="envelope" >}}[hcc-support@unl.edu] (mailto:hcc-support@unl.edu).