Using scratch topic tweaks

8422b537 · Josh Samuelson · f1de1d4a · 8422b537 · 8422b537
Commit 8422b537 authored 1 year ago by Josh Samuelson
--- a/content/handling_data/_index.md
+++ b/content/handling_data/_index.md
@@ -7,10 +7,10 @@ weight = "30"
 {{% panel theme="danger" header="**Sensitive and Protected Data**" %}}HCC currently has *no storage* that is suitable for **HIPAA** or other **PID** data sets.  Users are not permitted to store such data on HCC machines.{{% /panel %}}

 All HCC machines have three separate areas for every user to store data,
-each intended for a different purpose. The three areas are `/common`, `/work`, and `/home`, each with different functions. `/home` is your home directory with a quota limit of **20GB** and is backed up for best-effort disaster recovery purposes. `/work` is the high performance, I/O focused directory for running jobs. `/work` has a **50TB per group quote**, is not backed-up and is subject to a [purge policy]({{<relref "data_storage/#purge-policy" >}}) of **6 months of inactivity on a file**. `/common` works similarly to `/work` and is mounted with read and write capabilities on all HCC clusters, meaning any files on `/common` can be accessed from all of HCC clusters unlike `/home` and `/work` which are cluster dependant. More information on the three storage areas on HCC's clusters are available in the [Data Storage]({{<relref "data_storage">}}) page. 
+each intended for a different purpose. The three areas are `/common`, `/work`, and `/home`, each with different functions. `/home` is your home directory with a quota limit of **20GB** and is backed up for best-effort disaster recovery purposes. `/work` is the high performance, I/O focused directory for running jobs. `/work` has a **50TB per group quote**, is not backed-up and is subject to a [purge policy]({{<relref "data_storage/#purge-policy" >}}) of **6 months of inactivity on a file**. `/common` works similarly to `/work` and is mounted with read and write capabilities on all HCC clusters, meaning any files on `/common` can be accessed from all of HCC clusters unlike `/home` and `/work` which are cluster dependant. More information on the three storage areas on HCC's clusters are available in the [Data Storage]({{<relref "data_storage">}}) page.
 {{< figure src="/images/35325560.png" height="500" class="img-border">}}

-HCC also offers a separate, near-line archive with space available for lease called Attic. Attic provides reliable large data storage that is designed to be more reliable than `/work`, and larger than `/home`. More information on Attic and how to transfer data to and from Attic can be found on the [Using Attic]({{<relref "data_storage/using_attic">}}) page. 
+HCC also offers a separate, near-line archive with space available for lease called Attic. Attic provides reliable large data storage that is designed to be more reliable than `/work`, and larger than `/home`. More information on Attic and how to transfer data to and from Attic can be found on the [Using Attic]({{<relref "data_storage/using_attic">}}) page.

 You can also use your [UNL OneDrive Account]({{< relref "data_transfer/using_rclone_with_hcc/" >}}) account to download and
 upload files from any of the HCC clusters.
@@ -24,5 +24,5 @@ If you have space requirements outside what is currently provided or any questio
 please
 email <a href="mailto:hcc-support@unl.edu" class="external-link">hcc-support@unl.edu</a>.

-### Using */scratch* storage space to improve running jobs:      
-[Using Scratch]({{<relref "using_scratch_space" >}}) 
+### Using */scratch* storage space to improve running jobs:
+[Using Scratch]({{<relref "using_scratch_space" >}})
--- a/content/handling_data/using_scratch_space.md
+++ b/content/handling_data/using_scratch_space.md
@@ -5,39 +5,41 @@ weight = "10"
 +++


-## What is Scratch? 
-*Scratch* is temporary local storage on the compute/worker node where the job is running. 
-This is the fastest storage available to an active running job. 
-The *scratch* space is temporary and accessible only while the job is running, and it is discarded after the job finishes. 
-Therefore, any important data from the *scratch* space should be moved to a permanent location on the cluster (such as *$WORK|$HOME|$COMMON*). 
+## What is Scratch?
+*Scratch* is temporary local storage on the compute/worker node where the job is running.
+Depending on the application's input/output (I/O) patterns, this may be the fastest storage available to a running job.
+The *scratch* space is temporary and accessible only while the job is running, and it is discarded after the job finishes.
+Therefore, any important data from the *scratch* space should be moved to a permanent location on the cluster (such as *$WORK|$HOME|$COMMON*).
 The *scratch* space is not backed-up.

-## When to use Scratch? 
-Using *scratch* improves the performance and is ideal for jobs that: 
- perform many rapid input/output operations
- modify and interact with many files
- create many or large temporary files
+## When to use Scratch?
+Using the correct tool for task at hand is important (tool being worker node local *scratch* vs permanent storage locations), so know your applications I/O patterns to make the correct selection.  Using *scratch* improves the performance for certain applications that cause load issues for network-attached storage (which *$WORK|$HOME|$COMMON* are), some problematic I/O patterns for network-attached storage include:
+
+- perform many rapid I/O operations (directory/file creation, renaming or removal)
+- interact with and modify many files
+- non-sequential/random seeking over file contents
+- rapid temporary file i/o patterns involving a mixture of the above

 {{% notice info %}}
-When a permanent location on the cluster (such as *$WORK|$HOME|$COMMON*) is used for the analyses from above, 
-various issues can occur that can affect the cluster and everyone using it at the moment.
+When a permanent location on the cluster (such as *$WORK|$COMMON*) is used for the analyses from above (avoid [$HOME]({{< relref "./data_storage/#home-directory" >}}) intended for code/scripts/programs only), various issues can occur that can affect the cluster and everyone using it at the moment.
 {{% /notice %}}

-## How to use Scratch? 
-*Scratch* is accessible on the compute node while the job is running and no additional permission or setup is needed for its access. 
+## How to use Scratch?
+*Scratch* is accessible on the compute node while the job is running and no additional permission or setup is needed for its access.
+
+*Scratch* can be utilized efficiently by:

-*Scratch* can be utilized efficiently by: 
 - copying all needed input data to the temporary *scratch* space at the beginning of a job to ensure fast reading
 - writing job output to *scratch* using the proper output arguments from the used program
 - copying needed output data/folder back to a permanent location on the cluster before the job finishes

-These modifications are done in the submit SLURM script. 
-To access the *scratch* storage, one can do that with using **/scratch**. 
+These modifications are done in the submit SLURM script.
+To access the *scratch* storage, one can do that with using **/scratch**.

-Below is an example SLURM submit script. 
-This script assumes that the input data is in the current directory (please change that line if different), 
-and the final output data is copied back to $WORK. *my_program -\-output* is used just as an example, 
-and it should be replaced with the program/application you use and its respective output arguments. 
+Below is an example SLURM submit script.
+This script assumes that the input data is in the current directory (please change that line if different),
+and the final output data is copied back to $WORK. *my_program -\-output* is used just as an example,
+and it should be replaced with the program/application you use and its respective output arguments.

 {{% panel header="`use_scratch_example.submit`"%}}
 {{< highlight bash >}}
@@ -52,45 +54,48 @@ and it should be replaced with the program/application you use and its respectiv

 # load necessary modules

-# copy all needed input data to /scratch 
+# copy all needed input data to /scratch [input matches problematic I/O patterns]
 cp -r input_data /scratch/

-# if needed, change current working directory, e.g., $WORK to /scratch
+# if needed, change current working directory e.g., $WORK job path to /scratch
 # pushd /scratch

-# use your program of interest and write program output to /scratch 
+# use your program of interest and write program output to /scratch
 # using the proper output arguments from the used program, e.g.,
 my_program --output /scratch/output

 # return the batch script shell to where it was at when pushd was called
+# and copy the data to that prior $WORK job path
 # popd
+# cp -r /scratch/output .

-# copy needed output to $WORK 
+# copy needed output to $WORK
 cp -r /scratch/output $WORK
+
 {{< /highlight >}}
 {{% /panel %}}

 {{% notice info %}}
-If your application requires for the input data to be in the current working directory (cwd) or the output to be stored in the current workng directory, then make sure you change the current working directory with **pushd /scratch** before you start running your application.
+If your application requires for the input data to be in the current working directory (cwd) or the output to be stored in the cwd, then make sure you change the cwd with **pushd /scratch** before you start running your application.  The **popd** command returns the cwd to where the shell was when submitted to the scheduler which is the path used when the job starts running.
 {{% /notice %}}

-Additional examples of SLURM submit scripts that use **scratch** and are used on Swan are provided for 
-[BLAST](https://hcc.unl.edu/docs/applications/app_specific/bioinformatics_tools/alignment_tools/blast/running_blast_alignment/) 
+Additional examples of SLURM submit scripts that use **scratch** and are used on Swan are provided for
+[BLAST](https://hcc.unl.edu/docs/applications/app_specific/bioinformatics_tools/alignment_tools/blast/running_blast_alignment/)
 and [Trinity](https://hcc.unl.edu/docs/applications/app_specific/bioinformatics_tools/de_novo_assembly_tools/trinity/running_trinity_in_multiple_steps/).

 {{% notice note %}}
 Please note that after the job finishes (either successfully or fails), the data in *scratch* for that job will be permanently deleted.
 {{% /notice %}}

-## Disadvantages of Scratch 
+## Disadvantages of Scratch
 - limited storage capacity
- shared with other jobs that are running on the same compute/worker node
+- capacity shared with other jobs that are running on the same compute/worker node
 - job spanning across multiple compute nodes have its own unique *scratch* storage per compute node
 - data stored in *scratch* on one compute node can not be directly accessed by a different compute node and the processes that run there
 - temporary storage while the job is running
 - if the job fails, no output is saved and checkpointing can not be used

 {{% notice note %}}
-Using *scratch* is especially recommended for many Bioinformatics applications (such as BLAST, GATK, Trinity) 
-that perform many rapid input/output operations and can affect the file system on the cluster.
+Using *scratch* is especially recommended for many Bioinformatics applications (such as BLAST, GATK, Trinity)
+that perform many rapid I/O operations and can affect the file system on the cluster.
 {{% /notice %}}