diff --git a/content/guides/running_applications/bioinformatics_tools/biodata_module/_index.md b/content/guides/running_applications/bioinformatics_tools/biodata_module/_index.md index 6c0ffb769cd3df8897c0c92aeb18b0fc310a9c67..91fa25446604ef2bc84ae686bb7dd94bf55f7bd8 100644 --- a/content/guides/running_applications/bioinformatics_tools/biodata_module/_index.md +++ b/content/guides/running_applications/bioinformatics_tools/biodata_module/_index.md @@ -1,149 +1,79 @@ -1. [HCC-DOCS](index.html) -2. [HCC-DOCS Home](HCC-DOCS-Home_327685.html) -3. [HCC Documentation](HCC-Documentation_332651.html) -4. [Running Applications](Running-Applications_7471153.html) -5. [Bioinformatics Tools](Bioinformatics-Tools_8193279.html) ++++ +title = "Biodata Module" +description = "How to use Biodata Module on HCC machines" +weight = "52" ++++ -<span id="title-text"> HCC-DOCS : Biodata Module </span> -======================================================== +HCC hosts multiple databases (BLAST, KEGG, PANTHER, InterProScan), genome files, short read aligned indices etc. on both Tusker and Crane. +In order to use these resources, the "**biodata**" module needs to be loaded first. +For how to load module, please check [Module Commands](#module_commands). -Created by <span class="author"> Adam Caprez</span>, last modified on -Feb 22, 2017 - -| Name | Version | Resource | -|---------|---------|----------| -| biodata | 1.0 | Tusker | - -| Name | Version | Resource | -|---------|---------|----------| -| biodata | 1.0 | Crane | - - -HCC hosts multiple databases (BLAST, KEGG, PANTHER, InterProScan), -genome files, short read aligned indices etc. on both Tusker and -Crane. -In order to use these resources, the "**biodata**" module needs to be -loaded first. -For how to load module, please check [Module -Commands](Module-Commands_332464.html). - -Loading the "**biodata**" module will pre-set many environment -variables, but most likely you will only need a subset of them. -Environment variables can be used in your command or script by prefixing -a **$** sign to the name. +Loading the "**biodata**" module will pre-set many environment variables, but most likely you will only need a subset of them. Environment variables can be used in your command or script by prefixing `$` to the name. The major environment variables are: **$DATA** - main directory -**$BLAST** - Directory containing all available BLAST (nucleotide and -protein) databases +**$BLAST** - Directory containing all available BLAST (nucleotide and protein) databases **$KEGG** - KEGG database main entry point (requires license) **$PANTHER** - PANTHER database main entry point (latest) **$IPR** - InterProScan database main entry point (latest) -**$GENOMES** - Directory containing all available genomes (multiple -sources, builds possible -**$INDICES** - Directory containing indices for bowtie, bowtie2, bwa for -all available genomes -**$UNIPROT** - Directory containing latest release of full UniProt -database - - -In order to check what genomes are available, you can just type: - -**Check available GENOMES** - -``` syntaxhighlighter-pre -ls $GENOMES -``` - - +**$GENOMES** - Directory containing all available genomes (multiple sources, builds possible +**$INDICES** - Directory containing indices for bowtie, bowtie2, bwa for all available genomes +**$UNIPROT** - Directory containing latest release of full UniProt database +\\ +\\ +\\ +In order to check what genomes are available, you can type: +{{< highlight bash >}} +$ ls $GENOMES +{{< /highlight >}} +\\ In order to check what BLAST databases are available, you can just type: - -**Check available BLAST databases** - -``` syntaxhighlighter-pre -ls $BLAST -``` - - -An example of how to run Bowtie2 local alignment on Tusker utilizing the -default Horse, *Equus caballus* index (*BOWTIE2\_HORSE*) with paired-end -fasta files and 8 CPUs is shown below: - -**bowtie2\_alignment.submit** - -\#!/bin/sh -\#SBATCH --job-name=Bowtie2 -\#SBATCH --nodes=1 -\#SBATCH --ntasks-per-node=8 -\#SBATCH --time=168:00:00 -\#SBATCH --mem=50gb -\#SBATCH --output=Bowtie2.%J.out -\#SBATCH --error=Bowtie2.%J.err - - - -module load biodata/1.0 +{{< highlight bash >}} +$ ls $BLAST +{{< /highlight >}} +\\ +An example of how to run Bowtie2 local alignment on Crane utilizing the default Horse, *Equus caballus* index (*BOWTIE2\_HORSE*) with paired-end fasta files and 8 CPUs is shown below: +{{% panel header="`bowtie2_alignment.submit`"%}} +{{< highlight bash >}} +#!/bin/sh +#SBATCH --job-name=Bowtie2 +#SBATCH --nodes=1 +#SBATCH --ntasks-per-node=8 +#SBATCH --time=168:00:00 +#SBATCH --mem=10gb +#SBATCH --output=Bowtie2.%J.out +#SBATCH --error=Bowtie2.%J.err module load bowtie/2.2 - -bowtie2 -x $BOWTIE2\_HORSE -f -1 input\_reads\_pair\_1.fasta -2 -input\_reads\_pair\_2.fasta -S bowtie2\_alignments.sam --local -p -$SLURM\_NTASKS\_PER\_NODE - +module load biodata +bowtie2 -x $BOWTIE2_HORSE -f -1 input_reads_pair_1.fasta -2 input_reads_pair_2.fasta -S bowtie2_alignments.sam --local -p $SLURM_NTASKS_PER_NODE + +{{< /highlight >}} +{{% /panel %}} +\\ +An example of BLAST run against the non-redundant nucleotide database available on Crane is provided below: +{{% panel header="`blastn_alignment.submit`"%}} +{{< highlight bash >}} +#!/bin/sh +#SBATCH --job-name=BlastN +#SBATCH --nodes=1 +#SBATCH --ntasks-per-node=8 +#SBATCH --time=168:00:00 +#SBATCH --mem=10gb +#SBATCH --output=BlastN.%J.out +#SBATCH --error=BlastN.%J.err + +module load blast/2.7 +module load biodata +cp $BLAST/nt.* /scratch +cp input_reads.fasta /scratch + +blastn -db /scratch/nt -query /scratch/input_reads.fasta -out /scratch/blast_nucleotide.results +cp /scratch/blast_nucleotide.results . + +{{< /highlight >}} +{{% /panel %}} -An example of BLAST run against the yeast nucleotide database available -on Tusker is provided below: - -**blastn\_alignment.submit** - -\#!/bin/sh -\#SBATCH --job-name=BlastN -\#SBATCH --nodes=1 -\#SBATCH --ntasks-per-node=8 -\#SBATCH --time=168:00:00 -\#SBATCH --mem=50gb -\#SBATCH --output=BlastN.%J.out -\#SBATCH --error=BlastN.%J.err - - - -module load biodata/1.0 - -module load blast/2.2 - -cp $BLAST/yeast.nt.\* /tmp -cp yeast.query /tmp - -blastn -db /tmp/yeast.nt -query /tmp/yeast.query -out -/tmp/blast\_nucleotide.results - -cp /tmp/blast\_nucleotide.results . - - -The organisms and their appropriate environmental variables for all -genomes and chromosome files, as well as for short read aligned indices -are shown on the link below: - -Attachments: ------------- - -<img src="assets/images/icons/bullet_blue.gif" width="8" height="8" /> -[cb\_blastn\_biodata.xsl](attachments/15171887/15171888.xsl) -(application/octet-stream) -<img src="assets/images/icons/bullet_blue.gif" width="8" height="8" /> -[cb\_bowtie2\_biodata.xsl](attachments/15171887/15171889.xsl) -(application/octet-stream) -<img src="assets/images/icons/bullet_blue.gif" width="8" height="8" /> -[crane\_biodata\_version.xsl](attachments/15171887/15171890.xsl) -(application/octet-stream) -<img src="assets/images/icons/bullet_blue.gif" width="8" height="8" /> -[crane\_modules.xml](attachments/15171887/15171891.xml) -(application/octet-stream) -<img src="assets/images/icons/bullet_blue.gif" width="8" height="8" /> -[tusker\_biodata\_version.xsl](attachments/15171887/15171892.xsl) -(application/octet-stream) -<img src="assets/images/icons/bullet_blue.gif" width="8" height="8" /> -[tusker\_modules.xml](attachments/15171887/15171893.xml) -(application/octet-stream) - +The organisms and their appropriate environmental variables for all genomes and chromosome files, as well as for short read aligned indices are shown on the link below: +[Organisms](#organisms) diff --git a/content/guides/running_applications/bioinformatics_tools/downloading_sra_data_from_ncbi.md b/content/guides/running_applications/bioinformatics_tools/downloading_sra_data_from_ncbi.md index 8d233fe2995f11c25ffc38bc7156df6088155a96..60186ff0cd494b9a8acd7f15400a9a509c160d01 100644 --- a/content/guides/running_applications/bioinformatics_tools/downloading_sra_data_from_ncbi.md +++ b/content/guides/running_applications/bioinformatics_tools/downloading_sra_data_from_ncbi.md @@ -1,87 +1,42 @@ -1. [HCC-DOCS](index.html) -2. [HCC-DOCS Home](HCC-DOCS-Home_327685.html) -3. [HCC Documentation](HCC-Documentation_332651.html) -4. [Running Applications](Running-Applications_7471153.html) -5. [Bioinformatics Tools](Bioinformatics-Tools_8193279.html) - -<span id="title-text"> HCC-DOCS : Downloading SRA data from NCBI </span> -======================================================================== - -Created by <span class="author"> Adam Caprez</span>, last modified by -<span class="editor"> Natasha Pavlovikj</span> on Apr 16, 2018 ++++ +title = "Downloading SRA data from NCBI" +description = "How to download data from NCBI" +weight = "52" ++++ One way to download high-volume data from NCBI is to use command line utilities, such as **wget**, **ftp** or Aspera Connect **ascp** -plugin. -The Aspera Connect plugin is commonly used high-performance transfer -plugin that provides the best transfer speed. - -In order to use this plugin on the HCC supercomputers, you need to -download the latest Linux version from the Asera web site -(<a href="http://downloads.asperasoft.com/en/downloads/8?list" class="external-link">http://downloads.asperasoft.com/en/downloads/8?list</a>). -The current Linux version is 3.7.4, and after you login to HCC and open -a command line interface, just type: - -**Download Aspera Plugin** - -``` syntaxhighlighter-pre -[<username>@login.tusker ~]$ wget https://download.asperasoft.com/download/sw/connect/3.7.4/aspera-connect-3.7.4.147727-linux-64.tar.gz -[<username>@login.tusker ~]$ tar xvf aspera-connect-3.7.4.147727-linux-64.tar.gz -``` - -This will download and extract the plugin in your current -directory. After that, you need to install the plugin by typing: - -**Install Aspera Plugin** - -``` syntaxhighlighter-pre -[<username>@login.tusker ~]$ sh aspera-connect-3.7.4.147727-linux-64.sh -``` - -This command will install the plugin in **.aspera/** directory in your -**home/** directory -(*/home/<groupname>/<username>/.aspera/*). - -The basic usage of the Aspera plugin is: - -**Basic Usage of Aspera Plugin** - -``` syntaxhighlighter-pre -[<username>@login.tusker ~]$ ~/.aspera/connect/bin/ascp -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh -k 1 -T -l <max_download_rate_in_Mbps>m anonftp@ftp.ncbi.nlm.nih.gov:/<files_to_transfer> <local_work_output_directory> -``` - -where **-k 1** enables resume of partial transfers, **-T** disables -encryption for maximum throughput, and **-l** sets the transfer rate. - -**<files\_to\_transfer>** mentioned in the basic usage of Aspera +plugin. The Aspera Connect plugin is commonly used high-performance transfer +plugin that provides the best transfer speed. + +This plugin is available on our clusters as a module. In order to use it, load the appropriate module first: +{{< highlight bash >}} +$ module load aspera-cli +{{< /highlight >}} +\\ +The basic usage of the Aspera plugin is +{{< highlight bash >}} +$ ascp -i $ASPERA_PUBLIC_KEY -k 1 -T -l <max_download_rate_in_Mbps>m anonftp@ftp.ncbi.nlm.nih.gov:/<files_to_transfer> <local_work_output_directory> +{{< /highlight >}} +where **-k 1** enables resume of partial transfers, **-T** disables encryption for maximum throughput, and **-l** sets the transfer rate. +\\ +\\ +\\ +**\<files_to_transfer\>** mentioned in the basic usage of Aspera plugin has a specifically defined pattern that needs to be followed: - -**<files\_to\_transfer>** - -``` syntaxhighlighter-pre +{{< highlight bash >}} <files_to_transfer> = /sra/sra-instant/reads/ByRun/sra/SRR|ERR|DRR/<first_6_characters_of_accession>/<accession>/<accession>.sra -``` - -where **SRR\|ERR\|DRR** should be either **SRR**, **ERR **or **DRR** and -should match the prefix of the target **.sra** file. - +{{< /highlight >}} +where **SRR\|ERR\|DRR** should be either **SRR**, **ERR **or **DRR** and should match the prefix of the target **.sra** file. +\\ +\\ +\\ More **ascp** options can be seen by using: - -**Additional ASCP Options** - -``` syntaxhighlighter-pre -[<username>@login.tusker ~]$ ascp --help -``` - - -Finally, if you want to download the **SRR304976** file from NCBI in -your work **data/** directory with downloading speed of **1000 Mbps**, -you use: - -**ASCP Command Run** - -``` syntaxhighlighter-pre -[<username>@login.tusker ~]$ ~/.aspera/connect/bin/ascp -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh -k 1 -T -l 1000m anonftp@ftp.ncbi.nlm.nih.gov:/sra/sra-instant/reads/ByRun/sra/SRR/SRR304/SRR304976/SRR304976.sra /work/<groupname>/<username>/data/ -``` - - +{{< highlight bash >}} +$ ascp --help +{{< /highlight >}} +\\ +For example, if you want to download the **SRR304976** file from NCBI in your $WORK **data/** directory with downloading speed of **1000 Mbps**, you should use the following command: +{{< highlight bash >}} +$ ascp -i $ASPERA_PUBLIC_KEY -k 1 -T -l 1000m anonftp@ftp.ncbi.nlm.nih.gov:/sra/sra-instant/reads/ByRun/sra/SRR/SRR304/SRR304976/SRR304976.sra /work/[groupname]/[username]/data/ +{{< /highlight >}} \ No newline at end of file diff --git a/content/guides/running_applications/bioinformatics_tools/qiime.md b/content/guides/running_applications/bioinformatics_tools/qiime.md new file mode 100644 index 0000000000000000000000000000000000000000..b8ad2d3d8911b0dca9c90b25ebe4657e7323d3e8 --- /dev/null +++ b/content/guides/running_applications/bioinformatics_tools/qiime.md @@ -0,0 +1,54 @@ ++++ +title = "QIIME" +description = "How to run QIIME jobs on HCC machines" +weight = "52" ++++ + +QIIME (Quantitative Insights Into Microbial Ecology) (http://qiime.org) is a bioinformatics software for conducting microbial community analysis. It is used to analyze raw DNA sequencing data generated from various sequencing technologies (Sanger, Roche/454, Illumina) from fungal, viral, bacterial and archaeal communities. As part of its analysis, QIIME produces lots of statistics, publication quality graphics and different options for viewing the outputs. + +QIIME consists of a number of scripts that have different functionalities. Some of these include demultiplexing and quality filtering, OTU picking, phylogenetic reconstruction, taxonomic assignment and diversity analyses and visualizations. +\\ +\\ +\\ +Some common QIIME scripts are: + +- validate_mapping_file.py +- split_libraries.py +- pick_de_novo_otus.py +- pick_closed_reference_otus.py +- pick_open_reference_otus.py +- pick_de_novo_otus.py +- summarize_taxa_through_plots.py +- multiple_rarefactions.py +- alpha_diversity.py +- collate_alpha.py +- make_rarefaction_plots.py +- jackknifed_beta_diversity.py +- otu_category_significance.py +- group_significance.py +- summarize_taxa.py +- group_significance.py + +\\ +Sample QIIME submit script to run **pick_open_reference_otus.py** is: + +{{% panel header="`qiime.submit`"%}} +{{< highlight bash >}} +#!/bin/sh +#SBATCH --job-name=QIIME +#SBATCH --nodes=1 +#SBATCH --ntasks-per-node=8 +#SBATCH --time=168:00:00 +#SBATCH --mem=64gb +#SBATCH --output=qiime.%J.out +#SBATCH --error=qiime.%J.err + +module load qiime/2.0 +pick_open_reference_otus.py --parallel --jobs_to_start $SLURM_CPUS_ON_NODE -i /work/[groupname]/[username]/input.fasta -o /work/[groupname]/[username]/project/out_${SLURM_JOB_ID} + +{{< /highlight >}} +{{% /panel %}} + +To run QIIME with this script, update the input sequences option (**-i**) and the output directory path (**-o**). + +In the example above, we use the variable **${SLURM_JOB_ID}** as part of the output directory. This ensures each QIIME run will have a unique output directory. \ No newline at end of file