Add part 1 for bioinformatics pages

03656231 · npavlovikj · 2a8466e1 · 03656231 · 03656231 · 03656231
Commit 03656231 authored 6 years ago by npavlovikj
--- a/content/guides/running_applications/bioinformatics_tools/biodata_module/_index.md
+++ b/content/guides/running_applications/bioinformatics_tools/biodata_module/_index.md
-1.  [HCC-DOCS](index.html)
+++
-2.  [HCC-DOCS Home](HCC-DOCS-Home_327685.html)
+title = "Biodata Module"
-3.  [HCC Documentation](HCC-Documentation_332651.html)
+description = "How to use Biodata Module on HCC machines"
-4.  [Running Applications](Running-Applications_7471153.html)
+weight = "52"
-5.  [Bioinformatics Tools](Bioinformatics-Tools_8193279.html)
+++
-<span id="title-text"> HCC-DOCS : Biodata Module </span>
+HCC hosts multiple databases (BLAST, KEGG, PANTHER, InterProScan), genome files, short read aligned indices etc. on both Tusker and Crane.  
-========================================================
+In order to use these resources, the "**biodata**" module needs to be loaded first.  
+For how to load module, please check [Module Commands](#module_commands).
-Created by <span class="author"> Adam Caprez</span>, last modified on
+Loading the "**biodata**" module will pre-set many environment variables, but most likely you will only need a subset of them. Environment variables can be used in your command or script by prefixing `$` to the name.
-Feb 22, 2017
-| Name    | Version | Resource |
-|---------|---------|----------|
-| biodata | 1.0     | Tusker   |
-| Name    | Version | Resource |
-|---------|---------|----------|
-| biodata | 1.0     | Crane    |
-HCC hosts multiple databases (BLAST, KEGG, PANTHER, InterProScan),
-genome files, short read aligned indices etc. on both Tusker and
-Crane.  
-In order to use these resources, the "**biodata**" module needs to be
-loaded first.  
-For how to load module, please check [Module
-Commands](Module-Commands_332464.html). 
-Loading the "**biodata**" module will pre-set many environment
-variables, but most likely you will only need a subset of them.  
-Environment variables can be used in your command or script by prefixing
-a **$** sign to the name.
 The major environment variables are:  
 **$DATA** - main directory  
-**$BLAST** - Directory containing all available BLAST (nucleotide and
+**$BLAST** - Directory containing all available BLAST (nucleotide and protein) databases  
-protein) databases  
 **$KEGG** - KEGG database main entry point (requires license)  
 **$PANTHER** - PANTHER database main entry point (latest)  
 **$IPR** - InterProScan database main entry point (latest)  
-**$GENOMES** - Directory containing all available genomes (multiple
+**$GENOMES** - Directory containing all available genomes (multiple sources, builds possible  
-sources, builds possible  
+**$INDICES** - Directory containing indices for bowtie, bowtie2, bwa for all available genomes  
-**$INDICES** - Directory containing indices for bowtie, bowtie2, bwa for
+**$UNIPROT** - Directory containing latest release of full UniProt database
-all available genomes  
+\\
-**$UNIPROT** - Directory containing latest release of full UniProt
+\\
-database
+\\
+In order to check what genomes are available, you can type:
+{{< highlight bash >}}
-In order to check what genomes are available, you can just type:
+$ ls $GENOMES
+{{< /highlight >}}
-**Check available GENOMES**
+\\
-``` syntaxhighlighter-pre
-ls $GENOMES
-```
 In order to check what BLAST databases are available, you can just type:
+{{< highlight bash >}}
-**Check available BLAST databases**
+$ ls $BLAST
+{{< /highlight >}}
-``` syntaxhighlighter-pre
+\\
-ls $BLAST
+An example of how to run Bowtie2 local alignment on Crane utilizing the default Horse, *Equus caballus* index (*BOWTIE2\_HORSE*) with paired-end fasta files and 8 CPUs is shown below:
-```
+{{% panel header="`bowtie2_alignment.submit`"%}}
+{{< highlight bash >}}
+#!/bin/sh
-An example of how to run Bowtie2 local alignment on Tusker utilizing the
+#SBATCH --job-name=Bowtie2
-default Horse, *Equus caballus* index (*BOWTIE2\_HORSE*) with paired-end
+#SBATCH --nodes=1
-fasta files and 8 CPUs is shown below:
+#SBATCH --ntasks-per-node=8
+#SBATCH --time=168:00:00
-**bowtie2\_alignment.submit**
+#SBATCH --mem=10gb
+#SBATCH --output=Bowtie2.%J.out
-\#!/bin/sh  
+#SBATCH --error=Bowtie2.%J.err
-\#SBATCH --job-name=Bowtie2  
-\#SBATCH --nodes=1  
-\#SBATCH --ntasks-per-node=8  
-\#SBATCH --time=168:00:00  
-\#SBATCH --mem=50gb  
-\#SBATCH --output=Bowtie2.%J.out  
-\#SBATCH --error=Bowtie2.%J.err
-module load biodata/1.0
 module load bowtie/2.2
+module load biodata
-bowtie2 -x $BOWTIE2\_HORSE -f -1 input\_reads\_pair\_1.fasta -2
+bowtie2 -x $BOWTIE2_HORSE -f -1 input_reads_pair_1.fasta -2 input_reads_pair_2.fasta -S bowtie2_alignments.sam --local -p $SLURM_NTASKS_PER_NODE
-input\_reads\_pair\_2.fasta -S bowtie2\_alignments.sam --local -p
-$SLURM\_NTASKS\_PER\_NODE 
+{{< /highlight >}}
+{{% /panel %}}
+\\
-An example of BLAST run against the yeast nucleotide database available
+An example of BLAST run against the non-redundant nucleotide database available on Crane is provided below:
-on Tusker is provided below:
+{{% panel header="`blastn_alignment.submit`"%}}
+{{< highlight bash >}}
-**blastn\_alignment.submit**
+#!/bin/sh
+#SBATCH --job-name=BlastN
-\#!/bin/sh  
+#SBATCH --nodes=1
-\#SBATCH --job-name=BlastN  
+#SBATCH --ntasks-per-node=8
-\#SBATCH --nodes=1  
+#SBATCH --time=168:00:00
-\#SBATCH --ntasks-per-node=8  
+#SBATCH --mem=10gb
-\#SBATCH --time=168:00:00  
+#SBATCH --output=BlastN.%J.out
-\#SBATCH --mem=50gb  
+#SBATCH --error=BlastN.%J.err
-\#SBATCH --output=BlastN.%J.out  
-\#SBATCH --error=BlastN.%J.err
+module load blast/2.7
+module load biodata
+cp $BLAST/nt.* /scratch
+cp input_reads.fasta /scratch
-module load biodata/1.0
+blastn -db /scratch/nt -query /scratch/input_reads.fasta -out /scratch/blast_nucleotide.results
-module load blast/2.2
+cp /scratch/blast_nucleotide.results .
-cp $BLAST/yeast.nt.\* /tmp  
+{{< /highlight >}}
-cp yeast.query /tmp
+{{% /panel %}}
-blastn -db /tmp/yeast.nt -query /tmp/yeast.query -out
+The organisms and their appropriate environmental variables for all genomes and chromosome files, as well as for short read aligned indices are shown on the link below:  
-/tmp/blast\_nucleotide.results
+[Organisms](#organisms)
-cp /tmp/blast\_nucleotide.results .
-The organisms and their appropriate environmental variables for all
-genomes and chromosome files, as well as for short read aligned indices
-are shown on the link below:  
-Attachments:
------------
-<img src="assets/images/icons/bullet_blue.gif" width="8" height="8" />
-[cb\_blastn\_biodata.xsl](attachments/15171887/15171888.xsl)
-(application/octet-stream)  
-<img src="assets/images/icons/bullet_blue.gif" width="8" height="8" />
-[cb\_bowtie2\_biodata.xsl](attachments/15171887/15171889.xsl)
-(application/octet-stream)  
-<img src="assets/images/icons/bullet_blue.gif" width="8" height="8" />
-[crane\_biodata\_version.xsl](attachments/15171887/15171890.xsl)
-(application/octet-stream)  
-<img src="assets/images/icons/bullet_blue.gif" width="8" height="8" />
-[crane\_modules.xml](attachments/15171887/15171891.xml)
-(application/octet-stream)  
-<img src="assets/images/icons/bullet_blue.gif" width="8" height="8" />
-[tusker\_biodata\_version.xsl](attachments/15171887/15171892.xsl)
-(application/octet-stream)  
-<img src="assets/images/icons/bullet_blue.gif" width="8" height="8" />
-[tusker\_modules.xml](attachments/15171887/15171893.xml)
-(application/octet-stream)  
--- a/content/guides/running_applications/bioinformatics_tools/downloading_sra_data_from_ncbi.md
+++ b/content/guides/running_applications/bioinformatics_tools/downloading_sra_data_from_ncbi.md
-1.  [HCC-DOCS](index.html)
+++
-2.  [HCC-DOCS Home](HCC-DOCS-Home_327685.html)
+title = "Downloading SRA data from NCBI"
-3.  [HCC Documentation](HCC-Documentation_332651.html)
+description = "How to download data from NCBI"
-4.  [Running Applications](Running-Applications_7471153.html)
+weight = "52"
-5.  [Bioinformatics Tools](Bioinformatics-Tools_8193279.html)
+++
-<span id="title-text"> HCC-DOCS : Downloading SRA data from NCBI </span>
-========================================================================
-Created by <span class="author"> Adam Caprez</span>, last modified by
-<span class="editor"> Natasha Pavlovikj</span> on Apr 16, 2018
 One way to download high-volume data from NCBI is to use command line
 utilities, such as **wget**, **ftp** or Aspera Connect **ascp**
-plugin.  
+plugin. The Aspera Connect plugin is commonly used high-performance transfer
-The Aspera Connect plugin is commonly used high-performance transfer
 plugin that provides the best transfer speed.
-In order to use this plugin on the HCC supercomputers, you need to
+This plugin is available on our clusters as a module. In order to use it, load the appropriate module first:
-download the latest Linux version from the Asera web site
+{{< highlight bash >}}
-(<a href="http://downloads.asperasoft.com/en/downloads/8?list" class="external-link">http://downloads.asperasoft.com/en/downloads/8?list</a>).  
+$ module load aspera-cli
-The current Linux version is 3.7.4, and after you login to HCC and open
+{{< /highlight >}}
-a command line interface, just type:
+\\
+The basic usage of the Aspera plugin is
-**Download Aspera Plugin**
+{{< highlight bash >}}
+$ ascp -i $ASPERA_PUBLIC_KEY -k 1 -T -l <max_download_rate_in_Mbps>m anonftp@ftp.ncbi.nlm.nih.gov:/<files_to_transfer> <local_work_output_directory>
-``` syntaxhighlighter-pre
+{{< /highlight >}}
-[<username>@login.tusker ~]$ wget https://download.asperasoft.com/download/sw/connect/3.7.4/aspera-connect-3.7.4.147727-linux-64.tar.gz
+where **-k 1** enables resume of partial transfers, **-T** disables encryption for maximum throughput, and **-l** sets the transfer rate.
-[<username>@login.tusker ~]$ tar xvf aspera-connect-3.7.4.147727-linux-64.tar.gz
+\\
-```
+\\
+\\
-This will download and extract the plugin in your current
+**\<files_to_transfer\>** mentioned in the basic usage of Aspera
-directory. After that, you need to install the plugin by typing:
-**Install Aspera Plugin**
-``` syntaxhighlighter-pre
-[<username>@login.tusker ~]$ sh aspera-connect-3.7.4.147727-linux-64.sh
-```
-This command will install the plugin in **.aspera/** directory in your
-**home/** directory
-(*/home/&lt;groupname&gt;/&lt;username&gt;/.aspera/*).  
-The basic usage of the Aspera plugin is:
-**Basic Usage of Aspera Plugin**
-``` syntaxhighlighter-pre
-[<username>@login.tusker ~]$ ~/.aspera/connect/bin/ascp -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh -k 1 -T -l <max_download_rate_in_Mbps>m anonftp@ftp.ncbi.nlm.nih.gov:/<files_to_transfer> <local_work_output_directory>
-```
-where **-k 1** enables resume of partial transfers, **-T** disables
-encryption for maximum throughput, and **-l** sets the transfer rate.
-**&lt;files\_to\_transfer&gt;** mentioned in the basic usage of Aspera
 plugin has a specifically defined pattern that needs to be followed:
+{{< highlight bash >}}
-**&lt;files\_to\_transfer&gt;**
-``` syntaxhighlighter-pre
 <files_to_transfer> = /sra/sra-instant/reads/ByRun/sra/SRR|ERR|DRR/<first_6_characters_of_accession>/<accession>/<accession>.sra
-```
+{{< /highlight >}}
+where **SRR\|ERR\|DRR** should be either **SRR**, **ERR **or **DRR** and should match the prefix of the target **.sra** file.
-where **SRR\|ERR\|DRR** should be either **SRR**, **ERR **or **DRR** and
+\\
-should match the prefix of the target **.sra** file.
+\\
+\\
 More **ascp** options can be seen by using:
+{{< highlight bash >}}
-**Additional ASCP Options**
+$ ascp --help
+{{< /highlight >}}
-``` syntaxhighlighter-pre
+\\
-[<username>@login.tusker ~]$ ascp --help
+For example, if you want to download the **SRR304976** file from NCBI in your $WORK **data/** directory with downloading speed of **1000 Mbps**, you should use the following command:
-```
+{{< highlight bash >}}
+$ ascp -i $ASPERA_PUBLIC_KEY -k 1 -T -l 1000m anonftp@ftp.ncbi.nlm.nih.gov:/sra/sra-instant/reads/ByRun/sra/SRR/SRR304/SRR304976/SRR304976.sra /work/[groupname]/[username]/data/
+{{< /highlight >}}
-Finally, if you want to download the **SRR304976** file from NCBI in
\ No newline at end of file
-your work **data/** directory with downloading speed of **1000 Mbps**,
-you use:
-**ASCP Command Run**
-``` syntaxhighlighter-pre
-[<username>@login.tusker ~]$ ~/.aspera/connect/bin/ascp -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh -k 1 -T -l 1000m anonftp@ftp.ncbi.nlm.nih.gov:/sra/sra-instant/reads/ByRun/sra/SRR/SRR304/SRR304976/SRR304976.sra /work/<groupname>/<username>/data/
-```
--- a/content/guides/running_applications/bioinformatics_tools/qiime.md
+++ b/content/guides/running_applications/bioinformatics_tools/qiime.md
+++
+title = "QIIME"
+description = "How to run QIIME jobs on HCC machines"
+weight = "52"
+++
+QIIME (Quantitative Insights Into Microbial Ecology) (http://qiime.org) is a bioinformatics software for conducting microbial community analysis. It is used to analyze raw DNA sequencing data generated from various sequencing technologies (Sanger, Roche/454, Illumina) from fungal, viral, bacterial and archaeal communities. As part of its analysis, QIIME produces lots of statistics, publication quality graphics and different options for viewing the outputs.
+QIIME consists of a number of scripts that have different functionalities. Some of these include demultiplexing and quality filtering, OTU picking, phylogenetic reconstruction, taxonomic assignment and diversity analyses and visualizations.
+\\
+\\
+\\
+Some common QIIME scripts are:
+- validate_mapping_file.py
+- split_libraries.py
+- pick_de_novo_otus.py
+- pick_closed_reference_otus.py
+- pick_open_reference_otus.py
+- pick_de_novo_otus.py
+- summarize_taxa_through_plots.py
+- multiple_rarefactions.py
+- alpha_diversity.py
+- collate_alpha.py
+- make_rarefaction_plots.py
+- jackknifed_beta_diversity.py
+- otu_category_significance.py
+- group_significance.py
+- summarize_taxa.py
+- group_significance.py
+\\
+Sample QIIME submit script to run **pick_open_reference_otus.py** is:
+{{% panel header="`qiime.submit`"%}}
+{{< highlight bash >}}
+#!/bin/sh
+#SBATCH --job-name=QIIME
+#SBATCH --nodes=1
+#SBATCH --ntasks-per-node=8
+#SBATCH --time=168:00:00
+#SBATCH --mem=64gb
+#SBATCH --output=qiime.%J.out
+#SBATCH --error=qiime.%J.err
+module load qiime/2.0
+pick_open_reference_otus.py --parallel --jobs_to_start $SLURM_CPUS_ON_NODE -i /work/[groupname]/[username]/input.fasta -o /work/[groupname]/[username]/project/out_${SLURM_JOB_ID}
+{{< /highlight >}}
+{{% /panel %}}
+To run QIIME with this script, update the input sequences option (**-i**) and the output directory path (**-o**).
+In the example above, we use the variable **${SLURM_JOB_ID}** as part of the output directory. This ensures each QIIME run will have a unique output directory.
\ No newline at end of file