Commit 0479212b authored by Adam Caprez's avatar Adam Caprez
Browse files

Restore deleted content.

parent 84c5dcd1
+++
title = "FAQ"
description = "HCC Frequently Asked Questions"
weight = "10"
+++
- [I have an account, now what?](#i-have-an-account-now-what)
- [How do I change my password?](#how-do-i-change-my-password)
- [I forgot my password, how can I retrieve it?](#i-forgot-my-password-how-can-i-retrieve-it)
- [I just deleted some files and didn't mean to! Can I get them back?](#i-just-deleted-some-files-and-didn-t-mean-to-can-i-get-them-back)
- [How do I (re)activate Duo?](#how-do-i-re-activate-duo)
- [How many nodes/memory/time should I request?](#how-many-nodes-memory-time-should-i-request)
- [I am trying to run a job but nothing happens?](#i-am-trying-to-run-a-job-but-nothing-happens)
- [I keep getting the error "slurmstepd: error: Exceeded step memory limit at some point." What does this mean and how do I fix it?](#i-keep-getting-the-error-slurmstepd-error-exceeded-step-memory-limit-at-some-point-what-does-this-mean-and-how-do-i-fix-it)
- [I want to talk to a human about my problem. Can I do that?](#i-want-to-talk-to-a-human-about-my-problem-can-i-do-that)
---
#### I have an account, now what?
Congrats on getting an HCC account! Now you need to connect to a Holland
cluster. To do this, we use an SSH connection. SSH stands for Secure
Shell, and it allows you to securely connect to a remote computer and
operate it just like you would a personal machine.
Depending on your operating system, you may need to install software to
make this connection. Check out on Quick Start Guides for information on
how to install the necessary software for your operating system
- [For Mac/Linux Users]({{< relref "for_maclinux_users" >}})
- [For Windows Users]({{< relref "for_windows_users" >}})
#### How do I change my password?
#### I forgot my password, how can I retrieve it?
Information on how to change or retrieve your password can be found on
the documentation page: [How to change your
password]({{< relref "/accounts/how_to_change_your_password" >}})
All passwords must be at least 8 characters in length and must contain
at least one capital letter and one numeric digit. Passwords also cannot
contain any dictionary words. If you need help picking a good password,
consider using a (secure!) password generator such as
[this one provided by Random.org](https://www.random.org/passwords)
To preserve the security of your account, we recommend changing the
default password you were given as soon as possible.
#### I just deleted some files and didn't mean to! Can I get them back?
That depends. Where were the files you deleted?
**If the files were in your $HOME directory (/home/group/user/):** It's
possible.
$HOME directories are backed up daily and we can restore your files as
they were at the time of our last backup. Please note that any changes
made to the files between when the backup was made and when you deleted
them will not be preserved. To have these files restored, please contact
HCC Support at
{{< icon name="envelope" >}}[hcc-support@unl.edu] (mailto:hcc-support@unl.edu)
as soon as possible.
**If the files were in your $WORK directory (/work/group/user/):** No.
Unfortunately, the $WORK directories are created as a short term place
to hold job files. This storage was designed to be quickly and easily
accessed by our worker nodes and as such is not conducive to backups.
Any irreplaceable files should be backed up in a secondary location,
such as Attic, the cloud, or on your personal machine. For more
information on how to prevent file loss, check out [Preventing File
Loss]({{< relref "preventing_file_loss" >}}).
#### How do I (re)activate Duo?
**If you have not activated Duo before:**
Please stop by
[our offices](http://hcc.unl.edu/location)
along with a photo ID and we will be happy to activate it for you. If
you are not local to Omaha or Lincoln, contact us at
{{< icon name="envelope" >}}[hcc-support@unl.edu] (mailto:hcc-support@unl.edu)
and we will help you activate Duo remotely.
**If you have activated Duo previously but now have a different phone
number:**
Stop by our offices along with a photo ID and we can help you reactivate
Duo and update your account with your new phone number.
**If you have activated Duo previously and have the same phone number:**
Email us at
{{< icon name="envelope" >}}[hcc-support@unl.edu] (mailto:hcc-support@unl.edu)
from the email address your account is registered under and we will send
you a new link that you can use to activate Duo.
#### How many nodes/memory/time should I request?
**Short answer:** We don’t know.
**Long answer:** The amount of resources required is highly dependent on
the application you are using, the input file sizes and the parameters
you select. Sometimes it can help to speak with someone else who has
used the software before to see if they can give you an idea of what has
worked for them.
But ultimately, it comes down to trial and error; try different
combinations and see what works and what doesn’t. Good practice is to
check the output and utilization of each job you run. This will help you
determine what parameters you will need in the future.
For more information on how to determine how many resources a completed
job used, check out the documentation on [Monitoring Jobs]({{< relref "monitoring_jobs" >}}).
#### I am trying to run a job but nothing happens?
Where are you trying to run the job from? You can check this by typing
the command \`pwd\` into the terminal.
**If you are running from inside your $HOME directory
(/home/group/user/)**:
Move your files to your $WORK directory (/work/group/user) and resubmit
your job.
The worker nodes on our clusters have read-only access to the files in
$HOME directories. This means that when a job is submitted from $HOME,
the scheduler cannot write the output and error files in the directory
and the job is killed. It appears the job does nothing because no output
is produced.
**If you are running from inside your $WORK directory:**
Contact us at
{{< icon name="envelope" >}}[hcc-support@unl.edu] (mailto:hcc-support@unl.edu)
with your login, the name of the cluster you are running on, and the
full path to your submit script and we will be happy to help solve the
issue.
##### I keep getting the error "slurmstepd: error: Exceeded step memory limit at some point." What does this mean and how do I fix it?
This error occurs when the job you are running uses more memory than was
requested in your submit script.
If you specified `--mem` or `--mem-per-cpu` in your submit script, try
increasing this value and resubmitting your job.
If you did not specify `--mem` or `--mem-per-cpu` in your submit script,
chances are the default amount allotted is not sufficient. Add the line
{{< highlight batch >}}
#SBATCH --mem=<memory_amount>
{{< /highlight >}}
to your script with a reasonable amount of memory and try running it again. If you keep
getting this error, continue to increase the requested memory amount and
resubmit the job until it finishes successfully.
For additional details on how to monitor usage on jobs, check out the
documentation on [Monitoring Jobs]({{< relref "monitoring_jobs" >}}).
If you continue to run into issues, please contact us at
{{< icon name="envelope" >}}[hcc-support@unl.edu] (mailto:hcc-support@unl.edu)
for additional assistance.
#### I want to talk to a human about my problem. Can I do that?
Of course! We have an open door policy and invite you to stop by
[either of our offices](http://hcc.unl.edu/location)
anytime Monday through Friday between 9 am and 5 pm. One of the HCC
staff would be happy to help you with whatever problem or question you
have.  Alternatively, you can drop one of us a line and we'll arrange a
time to meet: [Contact Us](https://hcc.unl.edu/contact-us).
+++
title = "Jupyter Notebooks on Crane"
description = "How to access and use a Jupyter Notebook"
weight = 20
+++
- [Connecting to Crane] (#connecting-to-crane)
- [Running Code] (#running-code)
- [Opening a Terminal] (#opening-a-terminal)
- [Using Custom Packages] (#using-custom-packages)
## Connecting to Crane
-----------------------
Jupyter defines it's notebooks ("Jupyter Notebooks") as
an open-source web application that allows you to create and share documents that contain live code,
equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation,
statistical modeling, data visualization, machine learning, and much more.
1. To open a Jupyter notebook, [Sign in](https://crane.unl.edu) to crane.unl.edu using your hcc credentials (NOT your
campus credentials).
{{< figure src="/images/jupyterLogin.png" >}}
2. Select your preferred authentication method.
{{< figure src="/images/jupyterPush.png" >}}
3. Choose a job profile. Select "Noteboook via SLURM Job | Small (1 core, 4GB RAM, 8 hours)" for light tasks such as debugging or small-scale testing.
Select the other options based on your computing needs. Note that a SLURM Job will save to your "work" directory.
{{< figure src="/images/jupyterjob.png" >}}
## Running Code
1. Select the "New" dropdown menu and select the file type you want to create.
{{< figure src="/images/jupyterNew.png" >}}
2. A new tab will open, where you can enter your code. Run your code by selecting the "play" icon.
{{< figure src="/images/jupyterCode.png">}}
## Opening a Terminal
1. From your user home page, select "terminal" from the "New" drop-down menu.
{{< figure src="/images/jupyterTerminal.png">}}
2. A terminal opens in a new tab. You can enter [Linux commands] ({{< relref "basic_linux_commands" >}})
at the prompt.
{{< figure src="/images/jupyterTerminal2.png">}}
## Using Custom Packages
Many popular `python` and `R` packages are already installed and available within Jupyter Notebooks.
However, it is possible to install custom packages to be used in notebooks by creating a custom Anaconda
Environment. Detailed information on how to create such an environment can be found at
[Using an Anaconda Environment in a Jupyter Notebook on Crane]({{< relref "/applications/user_software/using_anaconda_package_manager#using-an-anaconda-environment-in-a-jupyter-notebook-on-crane" >}}).
---
 
+++
title = "BLAST with Allinea Performance Reports"
description = "Example of how to profile BLAST using Allinea Performance Reports."
+++
Simple example of using
[BLAST]({{< relref "/applications/app_specific/bioinformatics_tools/alignment_tools/blast/running_blast_alignment" >}}) 
with Allinea Performance Reports (`perf-report`) on Crane is shown below:
{{% panel theme="info" header="blastn_perf_report.submit" %}}
{{< highlight batch >}}
#!/bin/sh
#SBATCH --job-name=BlastN
#SBATCH --nodes=1
#SBATCH --ntasks=16
#SBATCH --time=20:00:00
#SBATCH --mem=50gb
#SBATCH --output=BlastN.info
#SBATCH --error=BlastN.error
module load allinea
module load blast/2.2.29
cd $WORK/<project_folder>
cp -r /work/HCC/DATA/blastdb/nt/ /tmp/
cp input_reads.fasta /tmp/
perf-report --openmp-threads=$SLURM_NTASKS_PER_NODE --nompi `which blastn` \
-query /tmp/input_reads.fasta -db /tmp/nt/nt -out \
blastn_output.alignments -num_threads $SLURM_NTASKS_PER_NODE
cp blastn\_output.alignments .
{{< /highlight >}}
{{% /panel %}}
BLAST uses OpenMP and therefore the Allinea Performance Reports options
`--openmp-threads` and `--nompi` are used. The perf-report
part, `perf-report --openmp-threads=$SLURM_NTASKS_PER_NODE --nompi`,
is placed in front of the actual `blastn` command we want
to analyze.
{{% notice info %}}
If you see the error "**Allinea Performance Reports - target file
'application' does not exist on this machine... exiting**", this means
that instead of just using the executable '*application*', the full path
to that application is required. This is the reason why in the script
above, instead of using "*blastn*", we use *\`which blastn\`* which
gives the full path of the *blastn* executable.
{{% /notice %}}
When the application finishes, the performance report is generated in
the working directory.
For the executed application, this is how the report looks like:
{{< figure src="/images/11635296.png" width="850" >}}
From the report, we can see that **blastn** is Compute-Bound
application. The difference between mean (11.1 GB) and peak (26.3 GB)
memory is significant, and this may be sign of workload imbalance or a
memory leak. Moreover, 89.6% of the time is spent in synchronizing
threads in parallel regions which can lead to workload imbalance.
Running Allinea Performance Reports and identifying application
bottlenecks is really useful for improving the application and better
utilization of the available resources.
+++
title = "Ray with Allinea Performance Reports"
description = "Example of how to profile Ray using Allinea Performance Reports"
+++
Simple example of using [Ray]({{< relref "/applications/app_specific/bioinformatics_tools/de_novo_assembly_tools/ray" >}})
with Allinea PerformanceReports (`perf-report`) on Tusker is shown below:
{{% panel theme="info" header="ray_perf_report.submit" %}}
{{< highlight batch >}}
#!/bin/sh
#SBATCH --job-name=Ray
#SBATCH --ntasks-per-node=16
#SBATCH --time=10:00:00
#SBATCH --mem=70gb
#SBATCH --output=Ray.info
#SBATCH --error=Ray.error
module load allinea
module load compiler/gcc/4.7 openmpi/1.6 ray/2.3
perf-report mpiexec -n 16 Ray -k 31 -p -p input_reads_pair_1.fasta input_reads\_pair_2.fasta -o output_directory
{{< /highlight >}}
{{% /panel %}}
Ray is MPI and therefore additional Allinea Performance Reports options
are not required. The `perf-report` command is placed in front of the
actual `Ray` command we want to analyze.
When the application finishes, the performance report is generated in
the working directory.
For the executed application, this is how the report looks like:
{{< figure src="/images/11635303.png" width="850" >}}
From the report, we can see that **Ray **is Compute-Bound application.
Most of the running time is spent in point-to-point calls with a low
transfer rate which may be caused by inefficient message sizes.
Therefore, running this application with fewer MPI processes and more
data on each process may be more efficient.
Running Allinea Performance Reports and identifying application
bottlenecks is really useful for improving the application and better
utilization of the available resources.
+++
title = " Running BLAST Alignment"
description = "How to run BLAST alignment on HCC resources"
weight = "10"
+++
Basic BLAST has the following commands:
- **blastn**: search nucleotide database using a nucleotide query
- **blastp**: search protein database using a protein query
- **blastx**: search protein database using a translated nucleotide query
- **tblastn**: search translated nucleotide database using a protein query
- **tblastx**: search translated nucleotide database using a translated nucleotide query
The basic usage of **blastn** is:
{{< highlight bash >}}
$ blastn -query input_reads.fasta -db input_reads_db -out blastn_output.alignments [options]
{{< /highlight >}}
where **input_reads.fasta** is an input file of sequence data in fasta format, **input_reads_db** is the generated BLAST database, and **blastn_output.alignments** is the output file where the alignments are stored.
Additional parameters can be found in the [BLAST manual] (https://www.ncbi.nlm.nih.gov/books/NBK279690/), or by typing:
{{< highlight bash >}}
$ blastn -help
{{< /highlight >}}
These BLAST alignment commands are multi-threaded, and therefore using the BLAST option **-num_threads <number_of_CPUs>** is recommended.
HCC hosts multiple BLAST databases and indices on Crane. In order to use these resources, the ["biodata" module] ({{<relref "/applications/app_specific/bioinformatics_tools/biodata_module">}}) needs to be loaded first. The **$BLAST** variable contains the following currently available databases:
- **16SMicrobial**
- **env_nt**
- **est**
- **est_human**
- **est_mouse**
- **est_others**
- **gss**
- **human_genomic**
- **human_genomic_transcript**
- **mouse_genomic_transcript**
- **nr**
- **nt**
- **other_genomic**
- **refseq_genomic**
- **refseq_rna**
- **sts**
- **swissprot**
- **tsa_nr**
- **tsa_nt**
If you want to create and use a BLAST database that is not mentioned above, check [Create Local BLAST Database]({{<relref "create_local_blast_database" >}}).
Basic SLURM example of nucleotide BLAST run against the non-redundant **nt** BLAST database with `8 CPUs` is provided below. When running BLAST alignment, it is recommended to first copy the query and database files to the **/scratch/** directory of the worker node. Moreover, the BLAST output is also saved in this directory (**/scratch/blastn_output.alignments**). After BLAST finishes, the output file is copied from the worker node to your current work directory.
{{% notice info %}}
**Please note that the worker nodes can not write to the */home/* directories and therefore you need to run your job from your */work/* directory.**
**This example will first copy your database to faster local storage called “scratch”. This can greatly improve performance!**
{{% /notice %}}
{{% panel header="`blastn_alignment.submit`"%}}
{{< highlight bash >}}
#!/bin/sh
#SBATCH --job-name=BlastN
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=168:00:00
#SBATCH --mem=20gb
#SBATCH --output=BlastN.%J.out
#SBATCH --error=BlastN.%J.err
module load blast/2.7
module load biodata/1.0
cd $WORK/<project_folder>
cp $BLAST/nt.* /scratch/
cp input_reads.fasta /scratch/
blastn -query /scratch/input_reads.fasta -db /scratch/nt -out /scratch/blastn_output.alignments -num_threads $SLURM_NTASKS_PER_NODE
cp /scratch/blastn_output.alignments $WORK/<project_folder>
{{< /highlight >}}
{{% /panel %}}
One important BLAST parameter is the **e-value threshold** that changes the number of hits returned by showing only those with value lower than the given. To show the hits with **e-value** lower than 1e-10, modify the given script as follows:
{{< highlight bash >}}
$ blastn -query input_reads.fasta -db input_reads_db -out blastn_output.alignments -num_threads $SLURM_NTASKS_PER_NODE -evalue 1e-10
{{< /highlight >}}
The default BLAST output is in pairwise format. However, BLAST’s parameter **-outfmt** supports output in [different formats] (https://www.ncbi.nlm.nih.gov/books/NBK279684/) that are easier for parsing.
Basic SLURM example of protein BLAST run against the non-redundant **nr **BLAST database with tabular output format and `8 CPUs` is shown below. Similarly as before, the query and database files are copied to the **/scratch/** directory. The BLAST output is also saved in this directory (**/scratch/blastx_output.alignments**). After BLAST finishes, the output file is copied from the worker node to your current work directory.
{{% notice info %}}
**Please note that the worker nodes can not write to the */home/* directories and therefore you need to run your job from your */work/* directory.**
**This example will first copy your database to faster local storage called “scratch”. This can greatly improve performance!**
{{% /notice %}}
{{% panel header="`blastx_alignment.submit`"%}}
{{< highlight bash >}}
#!/bin/sh
#SBATCH --job-name=BlastX
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=168:00:00
#SBATCH --mem=20gb
#SBATCH --output=BlastX.%J.out
#SBATCH --error=BlastX.%J.err
module load blast/2.7
module load biodata/1.0
cd $WORK/<project_folder>
cp $BLAST/nr.* /scratch/
cp input_reads.fasta /scratch/
blastx -query /scratch/input_reads.fasta -db /scratch/nr -outfmt 6 -out /scratch/blastx_output.alignments -num_threads $SLURM_NTASKS_PER_NODE
cp /scratch/blastx_output.alignments $WORK/<project_folder>
{{< /highlight >}}
{{% /panel %}}
+++
title = "Biodata Module"
description = "How to use Biodata Module on HCC machines"
scripts = ["https://cdnjs.cloudflare.com/ajax/libs/jquery.tablesorter/2.31.1/js/jquery.tablesorter.min.js", "https://cdnjs.cloudflare.com/ajax/libs/jquery.tablesorter/2.31.1/js/widgets/widget-pager.min.js","https://cdnjs.cloudflare.com/ajax/libs/jquery.tablesorter/2.31.1/js/widgets/widget-filter.min.js","/js/sort-table.js"]
css = ["http://mottie.github.io/tablesorter/css/theme.default.css","https://mottie.github.io/tablesorter/css/theme.dropbox.css", "https://cdnjs.cloudflare.com/ajax/libs/jquery.tablesorter/2.31.1/css/jquery.tablesorter.pager.min.css","https://cdnjs.cloudflare.com/ajax/libs/jquery.tablesorter/2.31.1/css/filter.formatter.min.css"]
weight = "52"
+++
HCC hosts multiple databases (BLAST, KEGG, PANTHER, InterProScan), genome files, short read aligned indices etc. on Crane.
In order to use these resources, the "**biodata**" module needs to be loaded first.
For how to load module, please check [Module Commands]({{< relref "/applications/modules/_index.md" >}}).
Loading the "**biodata**" module will pre-set many environment variables, but most likely you will only need a subset of them. Environment variables can be used in your command or script by prefixing `$` to the name.
The major environment variables are:
**$DATA** - main directory
**$BLAST** - Directory containing all available BLAST (nucleotide and protein) databases
**$KEGG** - KEGG database main entry point (requires license)
**$PANTHER** - PANTHER database main entry point (latest)
**$IPR** - InterProScan database main entry point (latest)
**$GENOMES** - Directory containing all available genomes (multiple sources, builds possible
**$INDICES** - Directory containing indices for bowtie, bowtie2, bwa for all available genomes
**$UNIPROT** - Directory containing latest release of full UniProt database
In order to check what genomes are available, you can type:
{{< highlight bash >}}
$ ls $GENOMES
{{< /highlight >}}
In order to check what BLAST databases are available, you can just type:
{{< highlight bash >}}
$ ls $BLAST
{{< /highlight >}}
An example of how to run Bowtie2 local alignment on Crane utilizing the default Horse, *Equus caballus* index (*BOWTIE2\_HORSE*) with paired-end fasta files and 8 CPUs is shown below:
{{% panel header="`bowtie2_alignment.submit`"%}}
{{< highlight bash >}}
#!/bin/sh
#SBATCH --job-name=Bowtie2
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=168:00:00
#SBATCH --mem=10gb
#SBATCH --output=Bowtie2.%J.out
#SBATCH --error=Bowtie2.%J.err
module load bowtie/2.2
module load biodata
bowtie2 -x $BOWTIE2_HORSE -f -1 input_reads_pair_1.fasta -2 input_reads_pair_2.fasta -S bowtie2_alignments.sam --local -p $SLURM_NTASKS_PER_NODE
{{< /highlight >}}
{{% /panel %}}
An example of BLAST run against the non-redundant nucleotide database available on Crane is provided below:
{{% panel header="`blastn_alignment.submit`"%}}
{{< highlight bash >}}
#!/bin/sh
#SBATCH --job-name=BlastN
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=168:00:00
#SBATCH --mem=10gb
#SBATCH --output=BlastN.%J.out
#SBATCH --error=BlastN.%J.err
module load blast/2.7
module load biodata
cp $BLAST/nt.* /scratch
cp input_reads.fasta /scratch
blastn -db /scratch/nt -query /scratch/input_reads.fasta -out /scratch/blast_nucleotide.results
cp /scratch/blast_nucleotide.results .
{{< /highlight >}}
{{% /panel %}}
### Available Organisms
The organisms and their appropriate environmental variables for all genomes and chromosome files, as well as indices are shown in the table below.
{{< table url="http://rhino-head.unl.edu:8192/bio/data/json" >}}
+++
title = "DMTCP Checkpointing"
description = "How to use the DMTCP utility to checkpoint your application."
+++
[DMTCP](http://dmtcp.sourceforge.net)
(Distributed MultiThreaded Checkpointing) is a checkpointing package for
applications. Using checkpointing allows resuming of a failing
simulation due to failing resources (e.g. hardware, software, exceeded
time and memory resources).
DMTCP supports both sequential and multi-threaded applications. Some
examples of binary programs on Linux distributions that can be used with
DMTCP are OpenMP, MATLAB, Python, Perl, MySQL, bash, gdb, X-Windows etc.
DMTCP provides support for several resource managers, including SLURM,
the resource manager used in HCC. The DMTCP module is available both on
Crane, and is enabled by typing:
{{< highlight bash >}}
module load dmtcp
{{< /highlight >}}
After the module is loaded, the first step is to run the command:
{{< highlight bash >}}
[<username>@login.crane ~]$ dmtcp_launch --new-coordinator --rm --interval <interval_time_seconds> <your_command>
{{< /highlight >}}
where `--rm` option enables SLURM support,
**\<interval_time_seconds\>** is the time in seconds between
automatic checkpoints, and **\<your_command\>** is the actual
command you want to run and checkpoint.
Beside the general options shown above, more `dmtcp_launch` options
can be seen by using: