Using scratch topic tweaks

8422b537 · Josh Samuelson · f1de1d4a · 8422b537 · 8422b537
Commit 8422b537 authored 1 year ago by Josh Samuelson
--- a/content/handling_data/_index.md
+++ b/content/handling_data/_index.md
--- a/content/handling_data/using_scratch_space.md
+++ b/content/handling_data/using_scratch_space.md
@@ -7,26 +7,28 @@ weight = "10"
 ## What is Scratch?
 *Scratch* is temporary local storage on the compute/worker node where the job is running.
-This is the fastest storage available to an active running job. 
+Depending on the application's input/output (I/O) patterns, this may be the fastest storage available to a running job.
 The *scratch* space is temporary and accessible only while the job is running, and it is discarded after the job finishes.
 Therefore, any important data from the *scratch* space should be moved to a permanent location on the cluster (such as *$WORK|$HOME|$COMMON*).
 The *scratch* space is not backed-up.
 ## When to use Scratch?
-Using *scratch* improves the performance and is ideal for jobs that: 
+Using the correct tool for task at hand is important (tool being worker node local *scratch* vs permanent storage locations), so know your applications I/O patterns to make the correct selection.  Using *scratch* improves the performance for certain applications that cause load issues for network-attached storage (which *$WORK|$HOME|$COMMON* are), some problematic I/O patterns for network-attached storage include:
- perform many rapid input/output operations
- modify and interact with many files
+- perform many rapid I/O operations (directory/file creation, renaming or removal)
- create many or large temporary files
+- interact with and modify many files
+- non-sequential/random seeking over file contents
+- rapid temporary file i/o patterns involving a mixture of the above
 {{% notice info %}}
-When a permanent location on the cluster (such as *$WORK|$HOME|$COMMON*) is used for the analyses from above, 
+When a permanent location on the cluster (such as *$WORK|$COMMON*) is used for the analyses from above (avoid [$HOME]({{< relref "./data_storage/#home-directory" >}}) intended for code/scripts/programs only), various issues can occur that can affect the cluster and everyone using it at the moment.
-various issues can occur that can affect the cluster and everyone using it at the moment.
 {{% /notice %}}
 ## How to use Scratch?
 *Scratch* is accessible on the compute node while the job is running and no additional permission or setup is needed for its access.
 *Scratch* can be utilized efficiently by:
 - copying all needed input data to the temporary *scratch* space at the beginning of a job to ensure fast reading
 - writing job output to *scratch* using the proper output arguments from the used program
 - copying needed output data/folder back to a permanent location on the cluster before the job finishes
@@ -52,10 +54,10 @@ and it should be replaced with the program/application you use and its respectiv
 # load necessary modules
-# copy all needed input data to /scratch 
+# copy all needed input data to /scratch [input matches problematic I/O patterns]
 cp -r input_data /scratch/
-# if needed, change current working directory, e.g., $WORK to /scratch
+# if needed, change current working directory e.g., $WORK job path to /scratch
 # pushd /scratch
 # use your program of interest and write program output to /scratch
@@ -63,15 +65,18 @@ cp -r input_data /scratch/
 my_program --output /scratch/output
 # return the batch script shell to where it was at when pushd was called
+# and copy the data to that prior $WORK job path
 # popd
+# cp -r /scratch/output .
 # copy needed output to $WORK
 cp -r /scratch/output $WORK
 {{< /highlight >}}
 {{% /panel %}}
 {{% notice info %}}
-If your application requires for the input data to be in the current working directory (cwd) or the output to be stored in the current workng directory, then make sure you change the current working directory with **pushd /scratch** before you start running your application.
+If your application requires for the input data to be in the current working directory (cwd) or the output to be stored in the cwd, then make sure you change the cwd with **pushd /scratch** before you start running your application.  The **popd** command returns the cwd to where the shell was when submitted to the scheduler which is the path used when the job starts running.
 {{% /notice %}}
 Additional examples of SLURM submit scripts that use **scratch** and are used on Swan are provided for
@@ -84,7 +89,7 @@ Please note that after the job finishes (either successfully or fails), the data
 ## Disadvantages of Scratch
 - limited storage capacity
- shared with other jobs that are running on the same compute/worker node
+- capacity shared with other jobs that are running on the same compute/worker node
 - job spanning across multiple compute nodes have its own unique *scratch* storage per compute node
 - data stored in *scratch* on one compute node can not be directly accessed by a different compute node and the processes that run there
 - temporary storage while the job is running
@@ -92,5 +97,5 @@ Please note that after the job finishes (either successfully or fails), the data
 {{% notice note %}}
 Using *scratch* is especially recommended for many Bioinformatics applications (such as BLAST, GATK, Trinity)
-that perform many rapid input/output operations and can affect the file system on the cluster.
+that perform many rapid I/O operations and can affect the file system on the cluster.
 {{% /notice %}}