Add Good HCC Practices page

8963e94a · Natasha Pavlovikj · e6fbe4c4 · 8963e94a
Commit 8963e94a authored 5 years ago by Natasha Pavlovikj
--- a/content/good_hcc_practices/_index.md
+++ b/content/good_hcc_practices/_index.md
+++
+title = "Good HCC Practices"
+description = "Guidelines for good HCC practices"
+weight = "95"
+++
+
+Crane and Rhino, our two high-performance clusters, are shared among all our users. 
+Sometimes, some users' activities may negatively impact the clusters and the users. 
+To avoid this, we provide the following guidelines for good HCC practices.
+
+## Login Node
+* **Do not run jobs on the login node.** The login node is shared among all users and it 
+should be used only for light tasks, such as moving and editing files, compiling programs, 
+and submitting and monitoring jobs. If a researcher runs a computationally intensive task 
+on the login node, that will negatively impact the performance for other users. For any CPU 
+or memory intensive operations, such as testing and running applications, one should use an 
+[interactive session]({{< relref "creating_an_interactive_job" >}}), or 
+[submit a job to the batch queue]({{< relref "submitting_jobs" >}}).
+* **Do not launch multiple simultaneous processes on the login node.** This may include using 
+lots of threads for compiling applications, or checking the job status multiple times a minute.
+
+## File Systems
+* Some I/O intensive jobs may benefit from **copying the data to the fast, temporary /scratch 
+file system local to each worker nodes**. The */scratch* directories are unique per job, and 
+are deleted when the job finishes. Thus, the last step of the batch script should copy the 
+needed output files from */scratch* to either */work* or */common*. Please see the 
+[Running BLAST Alignment]({{< relref "running_blast_alignment" >}}) page for an example. 
+Currently, we do not have quota on the */scratch* file system.
+* */work* has two quotas - one for **file count** and the second one for **disk space**. 
+Reaching these quotas may additionally stress the file system. Therefore, please make sure you 
+monitor these quotas regularly, and delete all the files that are not needed or copy them to more permanent location. 
+* */work* is intended to be **temporary location for storing job outputs and files**. After that, 
+all the necessary files need to be either moved to a permanent storage, or deleted.
+* **Avoid rapidly opening and closing many files, as well as frequently reading and writing to 
+disk, in your program.** This approach stresses the file system and may cause general issues. 
+Instead, consider reading and writing large blocks of data in memory over time, or 
+utilizing more advanced parallel I/O libraries, such as *parallel hdf5* and *parallel netcdf*.
+
+## Internal and External Networks
+* **Transferring many files to/from/within the cluster can harm the file system.** If you are 
+performing file transfer of many small files, please put these files in an archive file format, 
+such that the many files are replaced by a single file. We recommend using *zip* as the archive format 
+as zip files keep an index of the files. Moreover, zip files can be quickly indexed by the various zip tools, and allow 
+extraction of all files or a subset of files. The *tar* formats are stream oriented, and a full decompression is required 
+for the tools to know if the requested files have been found.
+
+## Running Applications
+* **Before you request multiple nodes and cores in your submit script, make sure that the application you are 
+using supports that.** MPI applications can utilize multiple nodes and cores, while threaded or OpenMP applications are 
+limited to a single node. Misusing this information may harm the researcher's waiting time in queue, as well as the application performance.
+* Threaded and OpenMP applications can utilize multiple cores within a node. However, most of the applications **do not 
+perform significantly better when more than 16 cores are used**. On the other hand, requesting more cores increases the 
+waiting time for resources in queue, so please make sure you request a reasonable number of cores.
+* If an application uses multiple threads or cores, that number needs to be specified with the *"--ntasks-per-node"* 
+or *"--ntasks"* options of SLURM. If you use multiple threads or cores with your application, but you don't specify 
+the respective SLURM options, your application will use only 1 core by default.
+* **Do not submit large number of short (less than half an hour of running time) SLURM jobs.** The scheduler spends more 
+time and memory in processing those jobs, which may cause problems and reduce the scheduler's responsiveness for everyone. 
+Instead, group the short tasks into jobs that will run longer.
+* **The maximum running time on our clusters is 7 days.** If your job needs more time than that, please consider 
+improving the code, splitting the job into smaller tasks, or using checkpointing tools such as [DMTCP]({{< relref "dmtcp_checkpointing" >}}).
+* Before submitting a job, it is recommended to make sure that **you are executing the application correctly, you are 
+passing the right arguments, and you don't have typos**. You can do this using an [interactive session]({{< relref "creating_an_interactive_job" >}}). 
+Otherwise, your job may be waiting for resources to only immediately fail because of typo or missing argument.
+* If no memory, time, and core requirements are specified in your submit SLURM script, the **default resources allocated are 
+1GB of RAM, 1 hour of running time, and a single CPU core** respectively. Oftentimes, these resources are not enough. If the job 
+is terminated, there is a high chance that the reason is exceeded resources, so please make sure you set 
+the memory and time requirements appropriately.
+* The run time and memory usage depend heavily on the application and the data used. You can monitor your application's needs with 
+tools such as [Allinea Performance Reports]({{< relref "/applications/app_specific/allinea_profiling_and_debugging/allinea_performance_reports" >}}) 
+and [mem_report]({{< relref "monitoring_jobs" >}}). While these tools can not predict the needed resources, they can provide 
+useful information the researcher can use the next time that particular application is run.
+
+We strongly recommend you to read and follow this guidance. If you have any concerns about your workflows or need any 
+assistance, please contact HCC Support at {{< icon name="envelope" >}}[hcc-support@unl.edu] (mailto:hcc-support@unl.edu).