Commit 6d868055 authored by Carrie A Brown's avatar Carrie A Brown
Browse files

clean up of reorganization efforts

parent 87f0138a
+++
title = "Quickstarts"
weight = "10"
+++
The quick start guides require that you already have a HCC account.  You
can get a HCC account by applying on the
[HCC website] (http://hcc.unl.edu/newusers/)
{{% children %}}
+++
title = "Anvil: HCC's Cloud"
description = "How to use Anvil, HCC's OpenStack-based cloud resource"
weight = "60"
+++
- [Overview](#overview)
- [Cloud Terms](#cloud-terms)
- [Steps for Access](#steps-for-access)
- [Backups](#backups)
{{% notice tip %}}
Have your account and ready to go? Visit the Anvil OpenStack web
interface at https://anvil.unl.edu.
{{% /notice %}}
---
### Overview
Anvil is the Holland Computing Center's cloud computing resource, based
on the [OpenStack](https://www.openstack.org) software.  
OpenStack is a free and open-source software platform for
cloud computing.  Anvil was created to address the needs of NU's
research community that are not well served by a traditional
batch-scheduled Linux cluster environment.  Examples of use cases that
are well suited to Anvil include:
- A highly interactive environment, especially GUI applications
- Require root-level access, such as kernel modification or
virtualization work
- Alternate operating systems, such as Windows or other distributions
of Linux
- Test cluster environments for various software frameworks, such as
[Hadoop](http://hadoop.apache.org)
or [Spark](https://spark.apache.org)
- Cluster applications that require a persistent resource, such as a
web or database server
Using Anvil, one or more virtual machines (VMs) can be easily be created
via a user-friendly web dashboard.  Created VMs are then accessible from
HCC clusters, or your own workstation once connected to the Anvil
Virtual Private Network (VPN).  Access is through standard means,
typically via SSH for Linux VMs and Remote Desktop for Windows VMs.
### Cloud Terms
There are a few terms used within the OpenStack interface and in the
instructions below that may be unfamiliar.  The following brief
definitions may be useful.  More detailed information is available in
the [OpenStack User Guide](http://docs.openstack.org/user-guide).
- **Project**:  A project is the base unit of ownership in
OpenStack.  Resources (CPUs, RAM, storage, etc.) are allocated and
user accounts are associated with a project.  Within Anvil, each HCC
research group corresponds directly to a project.  Similar to
resource allocation on HCC clusters, the members of a group share
the [project's resources]({{< relref "what_are_the_per_group_resource_limits" >}}).
 
- **Image**:  An image corresponds to everything needed to create a
virtual machine for a specific operating system (OS), such as Linux
or Windows.  HCC creates and maintains [basic Windows and Linux]({{< relref "available_images" >}})
images for convenience.
Users can also create their own images that can then be uploaded to
OpenStack and used within the project.
 
- **Flavor**:  A flavor (also known as *instance type*), defines the
parameters (i.e. resources) of the virtual machine.  This includes
things such as number of CPUs, amount of RAM, storage, etc.  There
are many instance types [available within Anvil]({{< relref "anvil_instance_types" >}}),
designed to meet a variety of needs.
 
- **Instance**:  An instance is a running virtual machine, created
by combining an image (the basic OS) with a flavor (resources).
 That is, *Image + Flavor = Instance*.
 
- **Volume**:  A volume is a means for persistent storage within
OpenStack.  When an instance is destroyed, any additional data that
was on the OS hard drive is lost.  A volume can be thought of
similar to an external hard drive.  It can be attached to an
instance and accessed as a second drive.  When the instance is
destroyed, data on the volume is retained.  It can then be attached
and accessed from another instance later.
### Steps for Access
The guide below outlines the steps needed to begin using Anvil.  Please
note that Anvil is currently in the *beta testing* phase.  While
reasonable precautions are taken against data loss, **sole copies of
precious or irreproducible data should not be placed or left on Anvil**.
1. **Request access to Anvil**
Access and resources are provided on a per-group basis, similar to
HCC clusters.  For details, please see [What are the per group
resource limits?]({{< relref "what_are_the_per_group_resource_limits" >}})
To begin using Anvil, user should fill out the short request form
at http://hcc.unl.edu/request-anvil-access.
An automated confirmation email will be sent. After group owner approves the request, an HCC staff
member will follow-up once access is available.
2. **Create SSH keys**
OpenStack uses SSH key pairs to identify users and control access to
the VMs themselves, as opposed to the traditional username/password
combination.  SSH key pairs consist of two files, a public key and a
private key.  The public file can be shared freely; this file will
be uploaded to OpenStack and associated with your account.  The
private key file should be treated the same as a password.  **Do not
share your private key and always keep it in a secure location.**
 Even if you have previously created a key pair for another purpose,
it's best practice to create a dedicated pair for use with Anvil.
 The process for creating key pairs is different between Windows and
Mac.  Follow the relevant guide below for your operating system.
1. [Creating SSH key pairs on Windows]({{< relref "creating_ssh_key_pairs_on_windows" >}})
2. [Creating SSH key pairs on Mac]({{< relref "creating_ssh_key_pairs_on_mac" >}})
3. **Connect to the Anvil VPN**
The Anvil web portal is accessible from the Internet. On the other
hand, for security reasons, the Anvil instances are not generally
accessible from the Internet. In order to access the instances from
on and off-campus, you will need to first be connected to the Anvil
VPN. Follow the instructions below to connect.
1. [Connecting to the Anvil VPN]({{< relref "connecting_to_the_anvil_vpn" >}})
4. **Add the SSH Key Pair to your account**
Before creating your first instance, you'll need to associate the
SSH key created in step 2 with your account.   Follow the guide
below to login to the web dashboard and add the key pair.
1. [Adding SSH Key Pairs]({{< relref "adding_ssh_key_pairs" >}})
5. **Create an instance**
Once the setup steps above are completed, you can create an
instance within the web dashboard.  Follow the guide below to create
an instance.
1. [Creating an Instance]({{< relref "creating_an_instance" >}})
6. **Connect to your instance**
After an instance has been created, you can connect (login) and
begin to use it.  Connecting is done via SSH or X2Go for Linux
instances and via Remote Desktop (RDP) for Windows instances.
 Follow the relevant guide below for your instance and the type of
OS you're connecting from.
1. [Connecting to Windows Instances]({{< relref "connecting_to_windows_instances" >}})
2. [Connecting to Linux Instances via SSH from Mac]({{< relref "connecting_to_linux_instances_from_mac" >}})
3. [Connecting to Linux instances via SSH from Windows]({{< relref "connecting_to_linux_instances_from_windows" >}})
4. [Connecting to Linux instances using X2Go (for images with Xfce)]({{< relref "connecting_to_linux_instances_using_x2go" >}})
7. **Create and attach a volume to your instance (optional)**
Volumes are a means within OpenStack for persistent storage.  When
an instance is destroyed, all data that was placed on the OS hard
drive is lost.  A volume can be thought of similar to an external
hard drive.  It can be attached and detached from an instance as
needed.  Data on the volume will persist until the volume itself is
destroyed.  Creating a volume is an optional step, but may be useful
in certain cases.  The process of creating and attaching a volume
from the web dashboard is the same regardless of the type (Linux or
Windows) of instance it will be attached to.  Once the volume is
attached, follow the corresponding guide for your instance's OS to
format and make the volume usable within your instance.
1. [Creating and attaching a volume]({{< relref "creating_and_attaching_a_volume" >}})
2. [Formatting and mounting a volume in Windows]({{< relref "formatting_and_mounting_a_volume_in_windows" >}})
3. [Formatting and mounting a volume in Linux]({{< relref "formatting_and_mounting_a_volume_in_linux" >}})
8. **Transferring files to or from your instance (optional)**
Transferring files to or from an instance is similar to doing so
with a personal laptop or workstation.  To transfer between an
instance and another HCC resource, both SCP and [Globus
Connect]({{< relref "/Data_Transfer/globus_connect" >}}) can be used.  For transferring
between an instance and a laptop/workstation or another instance,
standard file sharing utilities such as Dropbox or Box can be used.
 Globus may also be used, with one stipulation.  In order to
transfer files between two personal endpoints, a Globus Plus
subscription is required.  As part of HCC's Globus Provider Plan,
HCC can provide this on a per-user basis free of charge.  If you are
interested in Globus Plus, please email
{{< icon name="envelope" >}}[hcc-support@unl.edu] (mailto:hcc-support@unl.edu)
with your request and a brief explanation.
## Backups
HCC creates daily backups of images and volume snapshots for disaster
recovery. All users' images, detached volumes, and volume snapshots will
be backed up on a daily basis. The ephemeral disks of VMs and attached
volumes will NOT be backed up. If you would like your attached volumes
to be backed up, make a snapshot by going to the “Volumes” tab, click
the down arrow next to the button “Edit Volume” of the volume you want
to make a snapshot, then, select “Create Snapshot”.
Please note the backup function is for disaster recovery use only. HCC
is unable to restore single files within instances.  Further, HCC's
disaster recovery backups should not be the only source of backups for
important data. The backup policies are subject to change without prior
notice. To retrieve your backups, please contact HCC. If you have
special concerns please contact us at
{{< icon name="envelope" >}}[hcc-support@unl.edu] (mailto:hcc-support@unl.edu).
+++
title = "Jupyter Notebooks on Crane"
description = "How to access and use a Jupyter Notebook"
weight = 20
+++
- [Connecting to Crane] (#connecting-to-crane)
- [Running Code] (#running-code)
- [Opening a Terminal] (#opening-a-terminal)
- [Using Custom Packages] (#using-custom-packages)
## Connecting to Crane
-----------------------
Jupyter defines it's notebooks ("Jupyter Notebooks") as
an open-source web application that allows you to create and share documents that contain live code,
equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation,
statistical modeling, data visualization, machine learning, and much more.
1. To open a Jupyter notebook, [Sign in](https://crane.unl.edu) to crane.unl.edu using your hcc credentials (NOT your
campus credentials).
{{< figure src="/images/jupyterLogin.png" >}}
2. Select your preferred authentication method.
{{< figure src="/images/jupyterPush.png" >}}
3. Choose a job profile. Select "Noteboook via SLURM Job | Small (1 core, 4GB RAM, 8 hours)" for light tasks such as debugging or small-scale testing.
Select the other options based on your computing needs. Note that a SLURM Job will save to your "work" directory.
{{< figure src="/images/jupyterjob.png" >}}
## Running Code
1. Select the "New" dropdown menu and select the file type you want to create.
{{< figure src="/images/jupyterNew.png" >}}
2. A new tab will open, where you can enter your code. Run your code by selecting the "play" icon.
{{< figure src="/images/jupyterCode.png">}}
## Opening a Terminal
1. From your user home page, select "terminal" from the "New" drop-down menu.
{{< figure src="/images/jupyterTerminal.png">}}
2. A terminal opens in a new tab. You can enter [Linux commands] ({{< relref "basic_linux_commands" >}})
at the prompt.
{{< figure src="/images/jupyterTerminal2.png">}}
## Using Custom Packages
Many popular `python` and `R` packages are already installed and available within Jupyter Notebooks.
However, it is possible to install custom packages to be used in notebooks by creating a custom Anaconda
Environment. Detailed information on how to create such an environment can be found at
[Using an Anaconda Environment in a Jupyter Notebook on Crane]({{< relref "/Applications/Using_Your_Own_Software/using_anaconda_package_manager#using-an-anaconda-environment-in-a-jupyter-notebook-on-crane" >}}).
---
 
+++
title = "BLAST with Allinea Performance Reports"
description = "Example of how to profile BLAST using Allinea Performance Reports."
+++
Simple example of using
[BLAST]({{< relref "/Applications/Application_Specific_Guides/bioinformatics_tools/alignment_tools/blast/running_blast_alignment" >}}) 
with Allinea Performance Reports (`perf-report`) on Crane is shown below:
{{% panel theme="info" header="blastn_perf_report.submit" %}}
{{< highlight batch >}}
#!/bin/sh
#SBATCH --job-name=BlastN
#SBATCH --nodes=1
#SBATCH --ntasks=16
#SBATCH --time=20:00:00
#SBATCH --mem=50gb
#SBATCH --output=BlastN.info
#SBATCH --error=BlastN.error
module load allinea
module load blast/2.2.29
cd $WORK/<project_folder>
cp -r /work/HCC/DATA/blastdb/nt/ /tmp/
cp input_reads.fasta /tmp/
perf-report --openmp-threads=$SLURM_NTASKS_PER_NODE --nompi `which blastn` \
-query /tmp/input_reads.fasta -db /tmp/nt/nt -out \
blastn_output.alignments -num_threads $SLURM_NTASKS_PER_NODE
cp blastn\_output.alignments .
{{< /highlight >}}
{{% /panel %}}
BLAST uses OpenMP and therefore the Allinea Performance Reports options
`--openmp-threads` and `--nompi` are used. The perf-report
part, `perf-report --openmp-threads=$SLURM_NTASKS_PER_NODE --nompi`,
is placed in front of the actual `blastn` command we want
to analyze.
{{% notice info %}}
If you see the error "**Allinea Performance Reports - target file
'application' does not exist on this machine... exiting**", this means
that instead of just using the executable '*application*', the full path
to that application is required. This is the reason why in the script
above, instead of using "*blastn*", we use *\`which blastn\`* which
gives the full path of the *blastn* executable.
{{% /notice %}}
When the application finishes, the performance report is generated in
the working directory.
For the executed application, this is how the report looks like:
{{< figure src="/images/11635296.png" width="850" >}}
From the report, we can see that **blastn** is Compute-Bound
application. The difference between mean (11.1 GB) and peak (26.3 GB)
memory is significant, and this may be sign of workload imbalance or a
memory leak. Moreover, 89.6% of the time is spent in synchronizing
threads in parallel regions which can lead to workload imbalance.
Running Allinea Performance Reports and identifying application
bottlenecks is really useful for improving the application and better
utilization of the available resources.
+++
title = "Ray with Allinea Performance Reports"
description = "Example of how to profile Ray using Allinea Performance Reports"
+++
Simple example of using [Ray]({{< relref "/Applications/Application_Specific_Guides/bioinformatics_tools/de_novo_assembly_tools/ray" >}})
with Allinea PerformanceReports (`perf-report`) on Tusker is shown below:
{{% panel theme="info" header="ray_perf_report.submit" %}}
{{< highlight batch >}}
#!/bin/sh
#SBATCH --job-name=Ray
#SBATCH --ntasks-per-node=16
#SBATCH --time=10:00:00
#SBATCH --mem=70gb
#SBATCH --output=Ray.info
#SBATCH --error=Ray.error
module load allinea
module load compiler/gcc/4.7 openmpi/1.6 ray/2.3
perf-report mpiexec -n 16 Ray -k 31 -p -p input_reads_pair_1.fasta input_reads\_pair_2.fasta -o output_directory
{{< /highlight >}}
{{% /panel %}}
Ray is MPI and therefore additional Allinea Performance Reports options
are not required. The `perf-report` command is placed in front of the
actual `Ray` command we want to analyze.
When the application finishes, the performance report is generated in
the working directory.
For the executed application, this is how the report looks like:
{{< figure src="/images/11635303.png" width="850" >}}
From the report, we can see that **Ray **is Compute-Bound application.
Most of the running time is spent in point-to-point calls with a low
transfer rate which may be caused by inefficient message sizes.
Therefore, running this application with fewer MPI processes and more
data on each process may be more efficient.
Running Allinea Performance Reports and identifying application
bottlenecks is really useful for improving the application and better
utilization of the available resources.
+++
title = " Running BLAST Alignment"
description = "How to run BLAST alignment on HCC resources"
weight = "10"
+++
Basic BLAST has the following commands:
- **blastn**: search nucleotide database using a nucleotide query
- **blastp**: search protein database using a protein query
- **blastx**: search protein database using a translated nucleotide query
- **tblastn**: search translated nucleotide database using a protein query
- **tblastx**: search translated nucleotide database using a translated nucleotide query
The basic usage of **blastn** is:
{{< highlight bash >}}
$ blastn -query input_reads.fasta -db input_reads_db -out blastn_output.alignments [options]
{{< /highlight >}}
where **input_reads.fasta** is an input file of sequence data in fasta format, **input_reads_db** is the generated BLAST database, and **blastn_output.alignments** is the output file where the alignments are stored.
Additional parameters can be found in the [BLAST manual] (https://www.ncbi.nlm.nih.gov/books/NBK279690/), or by typing:
{{< highlight bash >}}
$ blastn -help
{{< /highlight >}}
These BLAST alignment commands are multi-threaded, and therefore using the BLAST option **-num_threads <number_of_CPUs>** is recommended.
HCC hosts multiple BLAST databases and indices on Crane. In order to use these resources, the ["biodata" module] ({{<relref "/Applications/Application_Specific_Guides/bioinformatics_tools/biodata_module">}}) needs to be loaded first. The **$BLAST** variable contains the following currently available databases:
- **16SMicrobial**
- **env_nt**
- **est**
- **est_human**
- **est_mouse**
- **est_others**
- **gss**
- **human_genomic**
- **human_genomic_transcript**
- **mouse_genomic_transcript**
- **nr**
- **nt**
- **other_genomic**
- **refseq_genomic**
- **refseq_rna**
- **sts**
- **swissprot**
- **tsa_nr**
- **tsa_nt**
If you want to create and use a BLAST database that is not mentioned above, check [Create Local BLAST Database]({{<relref "create_local_blast_database" >}}).
Basic SLURM example of nucleotide BLAST run against the non-redundant **nt** BLAST database with `8 CPUs` is provided below. When running BLAST alignment, it is recommended to first copy the query and database files to the **/scratch/** directory of the worker node. Moreover, the BLAST output is also saved in this directory (**/scratch/blastn_output.alignments**). After BLAST finishes, the output file is copied from the worker node to your current work directory.
{{% notice info %}}
**Please note that the worker nodes can not write to the */home/* directories and therefore you need to run your job from your */work/* directory.**
**This example will first copy your database to faster local storage called “scratch”. This can greatly improve performance!**
{{% /notice %}}
{{% panel header="`blastn_alignment.submit`"%}}
{{< highlight bash >}}
#!/bin/sh
#SBATCH --job-name=BlastN
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=168:00:00
#SBATCH --mem=20gb
#SBATCH --output=BlastN.%J.out
#SBATCH --error=BlastN.%J.err
module load blast/2.7
module load biodata/1.0
cd $WORK/<project_folder>
cp $BLAST/nt.* /scratch/
cp input_reads.fasta /scratch/
blastn -query /scratch/input_reads.fasta -db /scratch/nt -out /scratch/blastn_output.alignments -num_threads $SLURM_NTASKS_PER_NODE
cp /scratch/blastn_output.alignments $WORK/<project_folder>
{{< /highlight >}}
{{% /panel %}}
One important BLAST parameter is the **e-value threshold** that changes the number of hits returned by showing only those with value lower than the given. To show the hits with **e-value** lower than 1e-10, modify the given script as follows:
{{< highlight bash >}}
$ blastn -query input_reads.fasta -db input_reads_db -out blastn_output.alignments -num_threads $SLURM_NTASKS_PER_NODE -evalue 1e-10
{{< /highlight >}}
The default BLAST output is in pairwise format. However, BLAST’s parameter **-outfmt** supports output in [different formats] (https://www.ncbi.nlm.nih.gov/books/NBK279684/) that are easier for parsing.
Basic SLURM example of protein BLAST run against the non-redundant **nr **BLAST database with tabular output format and `8 CPUs` is shown below. Similarly as before, the query and database files are copied to the **/scratch/** directory. The BLAST output is also saved in this directory (**/scratch/blastx_output.alignments**). After BLAST finishes, the output file is copied from the worker node to your current work directory.
{{% notice info %}}
**Please note that the worker nodes can not write to the */home/* directories and therefore you need to run your job from your */work/* directory.**
**This example will first copy your database to faster local storage called “scratch”. This can greatly improve performance!**
{{% /notice %}}
{{% panel header="`blastx_alignment.submit`"%}}
{{< highlight bash >}}
#!/bin/sh
#SBATCH --job-name=BlastX
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=168:00:00
#SBATCH --mem=20gb
#SBATCH --output=BlastX.%J.out
#SBATCH --error=BlastX.%J.err
module load blast/2.7
module load biodata/1.0
cd $WORK/<project_folder>
cp $BLAST/nr.* /scratch/
cp input_reads.fasta /scratch/
blastx -query /scratch/input_reads.fasta -db /scratch/nr -outfmt 6 -out /scratch/blastx_output.alignments -num_threads $SLURM_NTASKS_PER_NODE
cp /scratch/blastx_output.alignments $WORK/<project_folder>
{{< /highlight >}}
{{% /panel %}}
+++
title = "Biodata Module"
description = "How to use Biodata Module on HCC machines"
scripts = ["https://cdnjs.cloudflare.com/ajax/libs/jquery.tablesorter/2.31.1/js/jquery.tablesorter.min.js", "https://cdnjs.cloudflare.com/ajax/libs/jquery.tablesorter/2.31.1/js/widgets/widget-pager.min.js","https://cdnjs.cloudflare.com/ajax/libs/jquery.tablesorter/2.31.1/js/widgets/widget-filter.min.js","/js/sort-table.js"]
css = ["http://mottie.github.io/tablesorter/css/theme.default.css","https://mottie.github.io/tablesorter/css/theme.dropbox.css", "https://cdnjs.cloudflare.com/ajax/libs/jquery.tablesorter/2.31.1/css/jquery.tablesorter.pager.min.css","https://cdnjs.cloudflare.com/ajax/libs/jquery.tablesorter/2.31.1/css/filter.formatter.min.css"]
weight = "52"
+++
HCC hosts multiple databases (BLAST, KEGG, PANTHER, InterProScan), genome files, short read aligned indices etc. on Crane.
In order to use these resources, the "**biodata**" module needs to be loaded first.
For how to load module, please check [Module Commands]({{< relref "module_commands" >}}).
Loading the "**biodata**" module will pre-set many environment variables, but most likely you will only need a subset of them. Environment variables can be used in your command or script by prefixing `$` to the name.
The major environment variables are:
**$DATA** - main directory
**$BLAST** - Directory containing all available BLAST (nucleotide and protein) databases
**$KEGG** - KEGG database main entry point (requires license)
**$PANTHER** - PANTHER database main entry point (latest)
**$IPR** - InterProScan database main entry point (latest)
**$GENOMES** - Directory containing all available genomes (multiple sources, builds possible
**$INDICES** - Directory containing indices for bowtie, bowtie2, bwa for all available genomes
**$UNIPROT** - Directory containing latest release of full UniProt database
In order to check what genomes are available, you can type:
{{< highlight bash >}}
$ ls $GENOMES
{{< /highlight >}}
In order to check what BLAST databases are available, you can just type:
{{< highlight bash >}}
$ ls $BLAST
{{< /highlight >}}
An example of how to run Bowtie2 local alignment on Crane utilizing the default Horse, *Equus caballus* index (*BOWTIE2\_HORSE*) with paired-end fasta files and 8 CPUs is shown below:
{{% panel header="`bowtie2_alignment.submit`"%}}
{{< highlight bash >}}
#!/bin/sh
#SBATCH --job-name=Bowtie2
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=168:00:00
#SBATCH --mem=10gb
#SBATCH --output=Bowtie2.%J.out
#SBATCH --error=Bowtie2.%J.err
module load bowtie/2.2
module load biodata
bowtie2 -x $BOWTIE2_HORSE -f -1 input_reads_pair_1.fasta -2 input_reads_pair_2.fasta -S bowtie2_alignments.sam --local -p $SLURM_NTASKS_PER_NODE
{{< /highlight >}}
{{% /panel %}}
An example of BLAST run against the non-redundant nucleotide database available on Crane is provided below:
{{% panel header="`blastn_alignment.submit`"%}}
{{< highlight bash >}}
#!/bin/sh
#SBATCH --job-name=BlastN
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=168:00:00
#SBATCH --mem=10gb
#SBATCH --output=BlastN.%J.out
#SBATCH --error=BlastN.%J.err
module load blast/2.7
module load biodata
cp $BLAST/nt.* /scratch
cp input_reads.fasta /scratch
blastn -db /scratch/nt -query /scratch/input_reads.fasta -out /scratch/blast_nucleotide.results
cp /scratch/blast_nucleotide.results .
{{< /highlight >}}
{{% /panel %}}
### Available Organisms
The organisms and their appropriate environmental variables for all genomes and chromosome files, as well as indices are shown in the table below.