Skip to content
Snippets Groups Projects
Commit 03656231 authored by npavlovikj's avatar npavlovikj
Browse files

Add part 1 for bioinformatics pages

parent 2a8466e1
No related branches found
No related tags found
1 merge request!19Add part 1 for bioinformatics pages
1. [HCC-DOCS](index.html) +++
2. [HCC-DOCS Home](HCC-DOCS-Home_327685.html) title = "Biodata Module"
3. [HCC Documentation](HCC-Documentation_332651.html) description = "How to use Biodata Module on HCC machines"
4. [Running Applications](Running-Applications_7471153.html) weight = "52"
5. [Bioinformatics Tools](Bioinformatics-Tools_8193279.html) +++
<span id="title-text"> HCC-DOCS : Biodata Module </span> HCC hosts multiple databases (BLAST, KEGG, PANTHER, InterProScan), genome files, short read aligned indices etc. on both Tusker and Crane.
======================================================== In order to use these resources, the "**biodata**" module needs to be loaded first.
For how to load module, please check [Module Commands](#module_commands).
Created by <span class="author"> Adam Caprez</span>, last modified on Loading the "**biodata**" module will pre-set many environment variables, but most likely you will only need a subset of them. Environment variables can be used in your command or script by prefixing `$` to the name.
Feb 22, 2017
| Name | Version | Resource |
|---------|---------|----------|
| biodata | 1.0 | Tusker |
| Name | Version | Resource |
|---------|---------|----------|
| biodata | 1.0 | Crane |
HCC hosts multiple databases (BLAST, KEGG, PANTHER, InterProScan),
genome files, short read aligned indices etc. on both Tusker and
Crane.
In order to use these resources, the "**biodata**" module needs to be
loaded first.
For how to load module, please check [Module
Commands](Module-Commands_332464.html).
Loading the "**biodata**" module will pre-set many environment
variables, but most likely you will only need a subset of them.
Environment variables can be used in your command or script by prefixing
a **$** sign to the name.
The major environment variables are: The major environment variables are:
**$DATA** - main directory **$DATA** - main directory
**$BLAST** - Directory containing all available BLAST (nucleotide and **$BLAST** - Directory containing all available BLAST (nucleotide and protein) databases
protein) databases
**$KEGG** - KEGG database main entry point (requires license) **$KEGG** - KEGG database main entry point (requires license)
**$PANTHER** - PANTHER database main entry point (latest) **$PANTHER** - PANTHER database main entry point (latest)
**$IPR** - InterProScan database main entry point (latest) **$IPR** - InterProScan database main entry point (latest)
**$GENOMES** - Directory containing all available genomes (multiple **$GENOMES** - Directory containing all available genomes (multiple sources, builds possible
sources, builds possible **$INDICES** - Directory containing indices for bowtie, bowtie2, bwa for all available genomes
**$INDICES** - Directory containing indices for bowtie, bowtie2, bwa for **$UNIPROT** - Directory containing latest release of full UniProt database
all available genomes \\
**$UNIPROT** - Directory containing latest release of full UniProt \\
database \\
In order to check what genomes are available, you can type:
{{< highlight bash >}}
In order to check what genomes are available, you can just type: $ ls $GENOMES
{{< /highlight >}}
**Check available GENOMES** \\
``` syntaxhighlighter-pre
ls $GENOMES
```
In order to check what BLAST databases are available, you can just type: In order to check what BLAST databases are available, you can just type:
{{< highlight bash >}}
**Check available BLAST databases** $ ls $BLAST
{{< /highlight >}}
``` syntaxhighlighter-pre \\
ls $BLAST An example of how to run Bowtie2 local alignment on Crane utilizing the default Horse, *Equus caballus* index (*BOWTIE2\_HORSE*) with paired-end fasta files and 8 CPUs is shown below:
``` {{% panel header="`bowtie2_alignment.submit`"%}}
{{< highlight bash >}}
#!/bin/sh
An example of how to run Bowtie2 local alignment on Tusker utilizing the #SBATCH --job-name=Bowtie2
default Horse, *Equus caballus* index (*BOWTIE2\_HORSE*) with paired-end #SBATCH --nodes=1
fasta files and 8 CPUs is shown below: #SBATCH --ntasks-per-node=8
#SBATCH --time=168:00:00
**bowtie2\_alignment.submit** #SBATCH --mem=10gb
#SBATCH --output=Bowtie2.%J.out
\#!/bin/sh #SBATCH --error=Bowtie2.%J.err
\#SBATCH --job-name=Bowtie2
\#SBATCH --nodes=1
\#SBATCH --ntasks-per-node=8
\#SBATCH --time=168:00:00
\#SBATCH --mem=50gb
\#SBATCH --output=Bowtie2.%J.out
\#SBATCH --error=Bowtie2.%J.err
module load biodata/1.0
module load bowtie/2.2 module load bowtie/2.2
module load biodata
bowtie2 -x $BOWTIE2\_HORSE -f -1 input\_reads\_pair\_1.fasta -2 bowtie2 -x $BOWTIE2_HORSE -f -1 input_reads_pair_1.fasta -2 input_reads_pair_2.fasta -S bowtie2_alignments.sam --local -p $SLURM_NTASKS_PER_NODE
input\_reads\_pair\_2.fasta -S bowtie2\_alignments.sam --local -p
$SLURM\_NTASKS\_PER\_NODE {{< /highlight >}}
{{% /panel %}}
\\
An example of BLAST run against the yeast nucleotide database available An example of BLAST run against the non-redundant nucleotide database available on Crane is provided below:
on Tusker is provided below: {{% panel header="`blastn_alignment.submit`"%}}
{{< highlight bash >}}
**blastn\_alignment.submit** #!/bin/sh
#SBATCH --job-name=BlastN
\#!/bin/sh #SBATCH --nodes=1
\#SBATCH --job-name=BlastN #SBATCH --ntasks-per-node=8
\#SBATCH --nodes=1 #SBATCH --time=168:00:00
\#SBATCH --ntasks-per-node=8 #SBATCH --mem=10gb
\#SBATCH --time=168:00:00 #SBATCH --output=BlastN.%J.out
\#SBATCH --mem=50gb #SBATCH --error=BlastN.%J.err
\#SBATCH --output=BlastN.%J.out
\#SBATCH --error=BlastN.%J.err module load blast/2.7
module load biodata
cp $BLAST/nt.* /scratch
cp input_reads.fasta /scratch
module load biodata/1.0
blastn -db /scratch/nt -query /scratch/input_reads.fasta -out /scratch/blast_nucleotide.results
module load blast/2.2 cp /scratch/blast_nucleotide.results .
cp $BLAST/yeast.nt.\* /tmp {{< /highlight >}}
cp yeast.query /tmp {{% /panel %}}
blastn -db /tmp/yeast.nt -query /tmp/yeast.query -out The organisms and their appropriate environmental variables for all genomes and chromosome files, as well as for short read aligned indices are shown on the link below:
/tmp/blast\_nucleotide.results [Organisms](#organisms)
cp /tmp/blast\_nucleotide.results .
The organisms and their appropriate environmental variables for all
genomes and chromosome files, as well as for short read aligned indices
are shown on the link below:
Attachments:
------------
<img src="assets/images/icons/bullet_blue.gif" width="8" height="8" />
[cb\_blastn\_biodata.xsl](attachments/15171887/15171888.xsl)
(application/octet-stream)
<img src="assets/images/icons/bullet_blue.gif" width="8" height="8" />
[cb\_bowtie2\_biodata.xsl](attachments/15171887/15171889.xsl)
(application/octet-stream)
<img src="assets/images/icons/bullet_blue.gif" width="8" height="8" />
[crane\_biodata\_version.xsl](attachments/15171887/15171890.xsl)
(application/octet-stream)
<img src="assets/images/icons/bullet_blue.gif" width="8" height="8" />
[crane\_modules.xml](attachments/15171887/15171891.xml)
(application/octet-stream)
<img src="assets/images/icons/bullet_blue.gif" width="8" height="8" />
[tusker\_biodata\_version.xsl](attachments/15171887/15171892.xsl)
(application/octet-stream)
<img src="assets/images/icons/bullet_blue.gif" width="8" height="8" />
[tusker\_modules.xml](attachments/15171887/15171893.xml)
(application/octet-stream)
1. [HCC-DOCS](index.html) +++
2. [HCC-DOCS Home](HCC-DOCS-Home_327685.html) title = "Downloading SRA data from NCBI"
3. [HCC Documentation](HCC-Documentation_332651.html) description = "How to download data from NCBI"
4. [Running Applications](Running-Applications_7471153.html) weight = "52"
5. [Bioinformatics Tools](Bioinformatics-Tools_8193279.html) +++
<span id="title-text"> HCC-DOCS : Downloading SRA data from NCBI </span>
========================================================================
Created by <span class="author"> Adam Caprez</span>, last modified by
<span class="editor"> Natasha Pavlovikj</span> on Apr 16, 2018
One way to download high-volume data from NCBI is to use command line One way to download high-volume data from NCBI is to use command line
utilities, such as **wget**, **ftp** or Aspera Connect **ascp** utilities, such as **wget**, **ftp** or Aspera Connect **ascp**
plugin. plugin. The Aspera Connect plugin is commonly used high-performance transfer
The Aspera Connect plugin is commonly used high-performance transfer
plugin that provides the best transfer speed. plugin that provides the best transfer speed.
In order to use this plugin on the HCC supercomputers, you need to This plugin is available on our clusters as a module. In order to use it, load the appropriate module first:
download the latest Linux version from the Asera web site {{< highlight bash >}}
(<a href="http://downloads.asperasoft.com/en/downloads/8?list" class="external-link">http://downloads.asperasoft.com/en/downloads/8?list</a>). $ module load aspera-cli
The current Linux version is 3.7.4, and after you login to HCC and open {{< /highlight >}}
a command line interface, just type: \\
The basic usage of the Aspera plugin is
**Download Aspera Plugin** {{< highlight bash >}}
$ ascp -i $ASPERA_PUBLIC_KEY -k 1 -T -l <max_download_rate_in_Mbps>m anonftp@ftp.ncbi.nlm.nih.gov:/<files_to_transfer> <local_work_output_directory>
``` syntaxhighlighter-pre {{< /highlight >}}
[<username>@login.tusker ~]$ wget https://download.asperasoft.com/download/sw/connect/3.7.4/aspera-connect-3.7.4.147727-linux-64.tar.gz where **-k 1** enables resume of partial transfers, **-T** disables encryption for maximum throughput, and **-l** sets the transfer rate.
[<username>@login.tusker ~]$ tar xvf aspera-connect-3.7.4.147727-linux-64.tar.gz \\
``` \\
\\
This will download and extract the plugin in your current **\<files_to_transfer\>** mentioned in the basic usage of Aspera
directory. After that, you need to install the plugin by typing:
**Install Aspera Plugin**
``` syntaxhighlighter-pre
[<username>@login.tusker ~]$ sh aspera-connect-3.7.4.147727-linux-64.sh
```
This command will install the plugin in **.aspera/** directory in your
**home/** directory
(*/home/&lt;groupname&gt;/&lt;username&gt;/.aspera/*).
The basic usage of the Aspera plugin is:
**Basic Usage of Aspera Plugin**
``` syntaxhighlighter-pre
[<username>@login.tusker ~]$ ~/.aspera/connect/bin/ascp -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh -k 1 -T -l <max_download_rate_in_Mbps>m anonftp@ftp.ncbi.nlm.nih.gov:/<files_to_transfer> <local_work_output_directory>
```
where **-k 1** enables resume of partial transfers, **-T** disables
encryption for maximum throughput, and **-l** sets the transfer rate.
**&lt;files\_to\_transfer&gt;** mentioned in the basic usage of Aspera
plugin has a specifically defined pattern that needs to be followed: plugin has a specifically defined pattern that needs to be followed:
{{< highlight bash >}}
**&lt;files\_to\_transfer&gt;**
``` syntaxhighlighter-pre
<files_to_transfer> = /sra/sra-instant/reads/ByRun/sra/SRR|ERR|DRR/<first_6_characters_of_accession>/<accession>/<accession>.sra <files_to_transfer> = /sra/sra-instant/reads/ByRun/sra/SRR|ERR|DRR/<first_6_characters_of_accession>/<accession>/<accession>.sra
``` {{< /highlight >}}
where **SRR\|ERR\|DRR** should be either **SRR**, **ERR **or **DRR** and should match the prefix of the target **.sra** file.
where **SRR\|ERR\|DRR** should be either **SRR**, **ERR **or **DRR** and \\
should match the prefix of the target **.sra** file. \\
\\
More **ascp** options can be seen by using: More **ascp** options can be seen by using:
{{< highlight bash >}}
**Additional ASCP Options** $ ascp --help
{{< /highlight >}}
``` syntaxhighlighter-pre \\
[<username>@login.tusker ~]$ ascp --help For example, if you want to download the **SRR304976** file from NCBI in your $WORK **data/** directory with downloading speed of **1000 Mbps**, you should use the following command:
``` {{< highlight bash >}}
$ ascp -i $ASPERA_PUBLIC_KEY -k 1 -T -l 1000m anonftp@ftp.ncbi.nlm.nih.gov:/sra/sra-instant/reads/ByRun/sra/SRR/SRR304/SRR304976/SRR304976.sra /work/[groupname]/[username]/data/
{{< /highlight >}}
Finally, if you want to download the **SRR304976** file from NCBI in \ No newline at end of file
your work **data/** directory with downloading speed of **1000 Mbps**,
you use:
**ASCP Command Run**
``` syntaxhighlighter-pre
[<username>@login.tusker ~]$ ~/.aspera/connect/bin/ascp -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh -k 1 -T -l 1000m anonftp@ftp.ncbi.nlm.nih.gov:/sra/sra-instant/reads/ByRun/sra/SRR/SRR304/SRR304976/SRR304976.sra /work/<groupname>/<username>/data/
```
+++
title = "QIIME"
description = "How to run QIIME jobs on HCC machines"
weight = "52"
+++
QIIME (Quantitative Insights Into Microbial Ecology) (http://qiime.org) is a bioinformatics software for conducting microbial community analysis. It is used to analyze raw DNA sequencing data generated from various sequencing technologies (Sanger, Roche/454, Illumina) from fungal, viral, bacterial and archaeal communities. As part of its analysis, QIIME produces lots of statistics, publication quality graphics and different options for viewing the outputs.
QIIME consists of a number of scripts that have different functionalities. Some of these include demultiplexing and quality filtering, OTU picking, phylogenetic reconstruction, taxonomic assignment and diversity analyses and visualizations.
\\
\\
\\
Some common QIIME scripts are:
- validate_mapping_file.py
- split_libraries.py
- pick_de_novo_otus.py
- pick_closed_reference_otus.py
- pick_open_reference_otus.py
- pick_de_novo_otus.py
- summarize_taxa_through_plots.py
- multiple_rarefactions.py
- alpha_diversity.py
- collate_alpha.py
- make_rarefaction_plots.py
- jackknifed_beta_diversity.py
- otu_category_significance.py
- group_significance.py
- summarize_taxa.py
- group_significance.py
\\
Sample QIIME submit script to run **pick_open_reference_otus.py** is:
{{% panel header="`qiime.submit`"%}}
{{< highlight bash >}}
#!/bin/sh
#SBATCH --job-name=QIIME
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=168:00:00
#SBATCH --mem=64gb
#SBATCH --output=qiime.%J.out
#SBATCH --error=qiime.%J.err
module load qiime/2.0
pick_open_reference_otus.py --parallel --jobs_to_start $SLURM_CPUS_ON_NODE -i /work/[groupname]/[username]/input.fasta -o /work/[groupname]/[username]/project/out_${SLURM_JOB_ID}
{{< /highlight >}}
{{% /panel %}}
To run QIIME with this script, update the input sequences option (**-i**) and the output directory path (**-o**).
In the example above, we use the variable **${SLURM_JOB_ID}** as part of the output directory. This ensures each QIIME run will have a unique output directory.
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment