Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found
Select Git revision
  • FAQ
  • RDPv10
  • UNL_OneDrive
  • atticguidelines
  • data_share
  • globus-auto-backups
  • good-hcc-practice-rep-workflow
  • hchen2016-faq-home-is-full
  • ipynb-doc
  • master
  • rclone-fix
  • sislam2-master-patch-51693
  • sislam2-master-patch-86974
  • site_url
  • test
15 results

Target

Select target project
  • dweitzel2/hcc-docs
  • OMCCLUNG2/hcc-docs
  • salmandjing/hcc-docs
  • hcc/hcc-docs
4 results
Select Git revision
  • 26-add-screenshots-for-newer-rdp-v10-client
  • 28-overview-page-for-connecting-2
  • RDPv10
  • gpu_update
  • master
  • overview-page-for-handling-data
  • patch-1
  • patch-10
  • patch-11
  • patch-12
  • patch-2
  • patch-3
  • patch-4
  • patch-5
  • patch-6
  • patch-7
  • patch-8
  • patch-9
  • runTime
  • submitting-jobs-overview
20 results
Show changes
Showing
with 724 additions and 427 deletions
+++
title = "BLAT"
description = "How to run BLAT on HCC resources"
weight = "10"
+++
---
title: BLAT
summary: "How to run BLAT on HCC resources"
---
BLAT is a pairwise alignment tool similar to BLAST. It is more accurate and about 500 times faster than the existing tools for mRNA/DNA alignments and it is about 50 times faster with protein/protein alignments. BLAT accepts short and long query and database sequences as input files.
The basic usage of BLAT is:
{{< highlight bash >}}
```bash
$ blat database query output_alignment.txt [options]
{{< /highlight >}}
```
where **database** is the name of the database used for the alignment, **query** is the name of the input file of sequence data in `fasta/nib/2bit` format, and **output_alignment.txt** is the output alignment file.
Additional parameters for BLAT alignment can be found in the [manual](http://genome.ucsc.edu/FAQ/FAQblat), or by using:
{{< highlight bash >}}
```bash
$ blat
{{< /highlight >}}
```
Running BLAT on Crane with query file `input_reads.fasta` and database `db.fa` is shown below:
{{% panel header="`blat_alignment.submit`"%}}
{{< highlight bash >}}
#!/bin/sh
Running BLAT on Swan with query file `input_reads.fasta` and database `db.fa` is shown below:
!!! note ""
```bash
#!/bin/bash
#SBATCH --job-name=Blat
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
......@@ -36,8 +35,8 @@ Running BLAT on Crane with query file `input_reads.fasta` and database `db.fa` i
module load blat/35x1
blat db.fa input_reads.fasta output_alignment.txt
{{< /highlight >}}
{{% /panel %}}
```
Although BLAT is a single threaded program (`#SBATCH --nodes=1`, `#SBATCH --ntasks-per-node=1`) it is still much faster than the other alignment tools.
......
+++
title = "Bowtie"
description = "How to run Bowtie on HCC resources"
weight = "10"
+++
---
title: Bowtie
summary: "How to run Bowtie on HCC resources"
---
[Bowtie](http://bowtie-bio.sourceforge.net/index.shtml) is an ultrafast and memory-efficient aligner for large sets of sequencing reads to a reference genome. Bowtie indexes the genome with a Burrows-Wheeler index to keep its memory footprint small. Bowtie also supports usage of multiple processors to achieve greater alignment speed.
The first and basic step of running Bowtie is to build and format an index from the reference genome. The basic usage of this command, **bowtie-build** is:
{{< highlight bash >}}
```bash
$ bowtie-build input_reference.fasta index_prefix
{{< /highlight >}}
```
where **input_reference.fasta** is an input file of sequence reads in fasta format, and **index_prefix** is the prefix of the generated index files.
After the index of the reference genome is generated, the next step is to align the reads. The basic usage of bowtie is:
{{< highlight bash >}}
```bash
$ bowtie [-q|-f|-r|-c] index_prefix [-1 input_reads_pair_1.[fasta|fastq] -2 input_reads_pair_2.[fasta|fastq] | input_reads.[fasta|fastq]] [options]
{{< /highlight >}}
```
where **index_prefix** is the generated index using the **bowtie-build** command, and **options** are optional parameters that can be found in the [Bowtie
manual](http://bowtie-bio.sourceforge.net/manual.shtml).
......@@ -25,10 +24,10 @@ manual](http://bowtie-bio.sourceforge.net/manual.shtml).
Bowtie supports both single-end (`input_reads.[fasta|fastq]`) and paired-end (`input_reads_pair_1.[fasta|fastq]`, `input_reads_pair_2.[fasta|fastq]`) files in fasta or fastq format. The format of the input files also needs to be specified by using the following flags: **-q** (fastq files), **-f** (fasta files), **-r** (raw one-sequence per line), or **-c** (sequences given on command line).
An example of how to run Bowtie alignment on Crane with single-end fastq file and `8 CPUs` is shown below:
{{% panel header="`bowtie_alignment.submit`"%}}
{{< highlight bash >}}
#!/bin/sh
An example of how to run Bowtie alignment on Swan with single-end fastq file and `8 CPUs` is shown below:
!!! note ""
```bash
#!/bin/bash
#SBATCH --job-name=Bowtie
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
......@@ -40,8 +39,8 @@ An example of how to run Bowtie alignment on Crane with single-end fastq file an
module load bowtie/1.1
bowtie -q index_prefix input_reads.fastq -p $SLURM_NTASKS_PER_NODE > bowtie_alignments.sam
{{< /highlight >}}
{{% /panel %}}
```
### Bowtie Output
......
+++
title = "Bowtie2"
description = "How to run Bowtie2 on HCC resources"
weight = "10"
+++
---
title: Bowtie2
summary: "How to run Bowtie2 on HCC resources"
---
[Bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. Although Bowtie and Bowtie2 are both fast read aligners, there are few main differences between them:
......@@ -18,23 +17,23 @@ weight = "10"
Same as Bowtie, the first and basic step of running Bowtie2 is to build Bowtie2 index from a reference genome sequence. The basic usage of the
command **bowtie2-build** is:
{{< highlight bash >}}
```bash
$ bowtie2-build -f input_reference.fasta index_prefix
{{< /highlight >}}
```
where **input_reference.fasta** is an input file of sequence reads in fasta format, and **index_prefix** is the prefix of the generated index files. Beside the option **-f** that is used when the reference input file is a fasta file, the option **-c** can be used when the reference sequences are given on the command line.
The command **bowtie2** takes a Bowtie2 index and set of sequencing read files and outputs set of alignments in SAM format. The general **bowtie2** usage is:
{{< highlight bash >}}
```bash
$ bowtie2 -x index_prefix [-q|--qseq|-f|-r|-c] [-1 input_reads_pair_1.[fasta|fastq] -2 input_reads_pair_2.[fasta|fastq] | -U input_reads.[fasta|fastq]] -S bowtie2_alignments.sam [options]
{{< /highlight >}}
```
where **index_prefix** is the generated index using the **bowtie2-build** command, and **options** are optional parameters that can be found in the [Bowtie2 manual](http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml). Bowtie2 supports both single-end (`input_reads.[fasta|fastq]`) and paired-end (`input_reads_pair_1.[fasta|fastq]`, `input_reads_pair_2.[fasta|fastq]`) files in fasta or fastq format. The format of the input files also needs to be specified by using one of the following flags: **-q** (fastq files), **--qseq** (Illumina's qseq format), **-f** (fasta files), **-r** (raw one sequence per line), or **-c** (sequences given on command line).
An example of how to run Bowtie2 local alignment on Crane with paired-end fasta files and `8 CPUs` is shown below:
{{% panel header="`bowtie2_alignment.submit`"%}}
{{< highlight bash >}}
#!/bin/sh
An example of how to run Bowtie2 local alignment on Swan with paired-end fasta files and `8 CPUs` is shown below:
!!! note ""
```bash
#!/bin/bash
#SBATCH --job-name=Bowtie2
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
......@@ -46,8 +45,8 @@ An example of how to run Bowtie2 local alignment on Crane with paired-end fasta
module load bowtie/2.3
bowtie2 -x index_prefix -f -1 input_reads_pair_1.fasta -2 input_reads_pair_2.fasta -S bowtie2_alignments.sam --local -p $SLURM_NTASKS_PER_NODE
{{< /highlight >}}
{{% /panel %}}
```
### Bowtie2 Output
......
+++
title = "BWA"
description = "How to use BWA on HCC machines"
weight = "52"
+++
---
title: BWA
summary: "How to use BWA on HCC machines"
---
BWA (Burrows-Wheeler Aligner) is a software package for mapping relatively short nucleotide sequences against a long reference sequence. BWA is slower than Bowtie, but allows indels in the alignment.
The basic usage of BWA is:
{{< highlight bash >}}
```bash
$ bwa COMMAND [options]
{{< /highlight >}}
```
where **COMMAND** is one of the available BWA commands:
- **index**: index sequences in the FASTA format
......@@ -30,10 +29,10 @@ where **COMMAND** is one of the available BWA commands:
BWA supports three alignment algorithms, **mem**, **bwasw**, and **aln**/**samse**/**sampe**. **bwa mem** is the latest algorithm, and is faster, more accurate and has better performance than **bwa bwasw** and **bwa aln**/**samse**/**sampe**. Therefore, if there are not any specific reasons, **bwa mem** is recommended for first-time users.
For detailed description and more information on a specific command, just type:
{{< highlight bash >}}
```bash
$ bwa COMMAND
{{< /highlight >}}
```
or check the [BWA manual](http://bio-bwa.sourceforge.net/bwa.shtml).
The page [Running BWA Commands]({{<relref "running_bwa_commands" >}}) shows how to run BWA on HCC.
The page [Running BWA Commands](running_bwa_commands) shows how to run BWA on HCC.
+++
title = "Running BWA Commands"
description = "How to run BWA commands on HCC resources"
weight = "10"
+++
---
title: Running BWA Commands
summary: "How to run BWA commands on HCC resources"
---
## BWA Index
The first step of using BWA is to make an index of the reference genome in fasta format. The basic usage of the **bwa index** is:
{{< highlight bash >}}
```bash
$ bwa index [-a bwtsw|is] input_reference.fasta index_prefix
{{< /highlight >}}
```
where **input_reference.fasta** is an input file of the reference genome in fasta format, and **index_prefix** is the prefix of the generated index files. The option **-a** is required and can have two values: **bwtsw** (does not work for short genomes) and **is** (does not work for long genomes). Therefore, this value is chosen according to the length of the genome.
## BWA Mem
The **bwa mem** algorithm is one of the three algorithms provided by BWA. It performs local alignment and produces alignments for different part of the query sequence. The basic usage of **bwa mem** is:
{{< highlight bash >}}
```bash
$ bwa mem index_prefix [input_reads.fastq|input_reads_pair_1.fastq input_reads_pair_2.fastq] [options]
{{< /highlight >}}
```
where **index_prefix** is the index for the reference genome generated from **bwa index**, and **input_reads.fastq**, **input_reads_pair_1.fastq**, **input_reads_pair_2.fastq** are the input files of sequencing data that can be single-end or paired-end respectively. Additional **options** for **bwa mem** can be found in the BWA manual.
Simple SLURM script for running **bwa mem** on Crane with paired-end fastq input data, `index_prefix` as reference genome index, SAM output file and `8 CPUs` is shown below:
{{% panel header="`bwa_mem.submit`"%}}
{{< highlight bash >}}
#!/bin/sh
Simple SLURM script for running **bwa mem** on Swan with paired-end fastq input data, `index_prefix` as reference genome index, SAM output file and `8 CPUs` is shown below:
!!! note ""
```bash
#!/bin/bash
#SBATCH --job-name=Bwa_Mem
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
......@@ -37,8 +36,8 @@ Simple SLURM script for running **bwa mem** on Crane with paired-end fastq input
module load bwa/0.7
bwa mem index_prefix input_reads_pair_1.fastq input_reads_pair_2.fastq -t $SLURM_NTASKS_PER_NODE > bwa_mem_alignments.sam
{{< /highlight >}}
{{% /panel %}}
```
## BWA Bwasw
......@@ -46,96 +45,96 @@ bwa mem index_prefix input_reads_pair_1.fastq input_reads_pair_2.fastq -t $SLURM
The **bwa bwasw** algorithm is another algorithm provided by BWA. For input files with single-end reads it aligns the query sequences. For input files with paired-ends reads it performs paired-end alignment that only works for Illumina reads.
An example of **bwa bwasw** for single-end input file `input-reads.fasta` in fasta format and output file `bwa_bwasw_alignments.sam` where the alignments are stored, is shown below:
{{< highlight bash >}}
```bash
$ bwa bwasw index_prefix input_reads.fasta -t $SLURM_NTASKS_PER_NODE > bwa_bwasw_alignments.sam
{{< /highlight >}}
```
## BWA Aln
The third BWA algorithm, **bwa aln**, aligns the input file of sequence data to the reference genome. In addition, there is an example of running **bwa aln** with single-end `input_reads.fasta` input file and `8 CPUs`:
{{< highlight bash >}}
```bash
$ bwa aln index_prefix input_reads.fasta -0 -t $SLURM_NTASKS_PER_NODE > bwa_aln_alignments.sai
{{< /highlight >}}
```
## BWA Samse and BWA Sampe
The command **bwa samse** uses the `bwa_aln_alignments.sai` output from **bwa aln** in order to generate SAM file from the alignments for single-end reads.
{{% panel header="`General BWA Samse Usage`"%}}
{{< highlight bash >}}
!!! note ""
```bash
$ bwa samse -f bwa_aln_alignments.sam index_prefix bwa_aln_alignments.sai input_reads.fasta output31.preArc
{{< /highlight >}}
{{% /panel %}}
```
The command **bwa sampe** uses the `bwa_aln_alignments.sai` output form **bwa aln** in order to generate SAM file from the alignments for paired-end reads.
{{% panel header="`General BWA Sampe Usage`"%}}
{{< highlight bash >}}
!!! note ""
```bash
$ bwa samse -f bwa_aln_alignments.sam index_prefix bwa_aln_alignments_pair_1.sai bwa_aln_alignments_pair_2.sai input_reads_pair_1.fasta input_reads_pair_2.fasta
{{< /highlight >}}
{{% /panel %}}
```
## BWA Fastmap
The command **bwa fastmap** identifies and outputs super-maximal exact matches (SMEMs). The basic usage of **bwa fastmap** is:
{{< highlight bash >}}
```bash
$ bwa fastmap index_prefix input_reads.fasta > bwa_fastmap.matches
{{< /highlight >}}
```
## BWA Pemerge
The command **bwa pemerge** merges overlapping paired ends and can print either only the merged reads or the unmerged ones. An example of **bwa pemerge** of `input_reads_pair_1.fastq` and `input_reads_pair_2.fastq` with `8 CPUs` and output file `output_reads_merged.fastq` that contains only the merged reads is shown below:
{{< highlight bash >}}
```bash
$ bwa pemerge -m input_reads_pair_1.fastq input_reads_pair_2.fastq -t $SLURM_NTASKS_PER_NODE > output_reads_merged.fastq
{{< /highlight >}}
```
## BWA Fa2pac
The command **bwa fa2pac** converts fasta to pac files. The general usage of **bwa fa2pac** is:
{{< highlight bash >}}
```bash
$ bwa fa2pac input_reads.fasta pac_prefix
{{< /highlight >}}
```
## BWA Pac2bwt and BWA Pac2bwtgen
The commands **bwa pac2bwt** and **bwa pac2bwtgen** convert pac to bwt files.
{{% panel header="`General BWA Pac2bwt Usage`"%}}
{{< highlight bash >}}
!!! note ""
```bash
$ bwa pac2bwt input_reads.pac output_reads.bwt
{{< /highlight >}}
{{% /panel %}}
```
{{% panel header="`General BWA Pac2bwtgen Usage`"%}}
{{< highlight bash >}}
!!! note ""
```bash
$ bwa pac2bwtgen input_reads.pac output_reads.bwt
{{< /highlight >}}
{{% /panel %}}
```
## BWA Bwtupdate
The command **bwa bwtupdate** updates bwt files to the new format. The general usage of **bwa bwtupdate** is:
{{< highlight bash >}}
```bash
$ bwa bwtupdate input_reads.bwt
{{< /highlight >}}
```
## BWA Bwt2sa
The command **bwa bwt2sa** generates sa files from bwt and Occ files. The basic usage of **bwa bwt2sa** is:
{{< highlight bash >}}
```bash
$ bwa bwt2sa input_reads.bwt output_reads.sa
{{< /highlight >}}
```
### Useful Information
In order to test the scalability of BWA (bwa/0.7) on Crane, we used two paired-end input fastq files, `large_1.fastq` and `large_2.fastq`, and one single-end input fasta file, `large.fasta`. Some statistics about the input files and the time and memory resources used by **bwa mem** are shown on the table below:
{{< readfile file="/static/html/bwa.html" >}}
In order to test the scalability of BWA (bwa/0.7) on Swan, we used two paired-end input fastq files, `large_1.fastq` and `large_2.fastq`, and one single-end input fasta file, `large.fasta`. Some statistics about the input files and the time and memory resources used by **bwa mem** are shown on the table below:
{% include "../../../../../static/html/bwa.html"%}
+++
title = "Clustal Omega"
description = "How to run Clustal Omega on HCC resources"
weight = "10"
+++
---
title: Clustal Omega
summary: "How to run Clustal Omega on HCC resources"
---
[Clustal Omega](http://www.clustal.org/omega/) is a general purpose multiple sequence alignment (MSA) tool used mainly with protein, as well as DNA and RNA sequences. Clustal Omega is fast and scalable aligner that can align datasets of hundreds of thousands of sequences in reasonable time.
The general usage of Clustal Omega is:
{{< highlight bash >}}
```bash
$ clustalo -i input_file.fasta -o output_file.fasta [options]
{{< /highlight >}}
```
where **input_file.fasta** is the multiple sequence input file in `fasta` format, and **output_file.fasta** is the multiple sequence alignment output file in `fasta` format.
......@@ -25,15 +24,15 @@ These input files must contain at least 2 sequences and must be in one of the fo
More Clustal Omega options can be found by typing:
{{< highlight bash >}}
```bash
$ clustalo -h
{{< /highlight >}}
```
Running Clustal Omega on Crane with input file `input_reads.fasta` with `8 threads` and `10GB memory` is shown below:
{{% panel header="`clustal_omega.submit`"%}}
{{< highlight bash >}}
#!/bin/sh
Running Clustal Omega on Swan with input file `input_reads.fasta` with `8 threads` and `10GB memory` is shown below:
!!! note ""
```bash
#!/bin/bash
#SBATCH --job-name=Clustal_Omega
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
......@@ -45,15 +44,15 @@ Running Clustal Omega on Crane with input file `input_reads.fasta` with `8 threa
module load clustal-omega/1.2
clustalo -i input_reads.fasta -o output_msa.sto --outfmt=st --threads=$SLURM_NTASKS_PER_NODE
{{< /highlight >}}
{{% /panel %}}
```
The output file `output_msa.sto` contains the resulting multiple sequence alignments in Stockholm format (**--outfmt=st**).
Moreover, if you change the command above with:
{{< highlight bash >}}
```bash
$ clustalo -i input_reads.sto --dealign -v
{{< /highlight >}}
```
Clustal Omega will read the input file in Stockholm format, de-align the sequences, and then re-align them, printing progress report in meanwhile (**-v**). Because it is not specified, the output will be in the default `fasta` format.
......@@ -65,4 +64,5 @@ The basic Clustal Omega output produces one alignment file in the specified outp
### Useful Information
In order to test the Clustal Omega performance, we used three DNA and protein input fasta files, `data_1.fasta`, `data_2.fasta`, `data_3.fasta`. Some statistics about the input files and the time and memory resources used by Clustal Omega are shown on the table below:
{{< readfile file="/static/html/clustal_omega.html" >}}
{% include "../../../../static/html/clustal_omega.html"%}
---
title: Alignment Tools
summary: "How to use various alignment tools on HCC machines"
---
{{ children('applications/app_specific/bioinformatics_tools/alignment_tools') }}
+++
title = "TopHat/TopHat2"
description = "How to run TopHat/TopHat2 on HCC resources"
weight = "10"
+++
---
title: TopHat/TopHat2
summary: "How to run TopHat/TopHat2 on HCC resources"
---
[TopHat](https://ccb.jhu.edu/software/tophat/index.shtml) is a fast splice junction mapper for RNA-Seq data. It first aligns RNA-Seq reads to reference genomes using the ultra high-throughput short read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons.
......@@ -11,26 +10,26 @@ Although there is no difference between the available options for both TopHat an
The basic usage of TopHat2 is:
{{< highlight bash >}}
```bash
$ [tophat|tophat2] [options] index_prefix [input_reads_pair_1.[fasta|fastq] input_reads_pair_2.[fasta|fastq] | input_reads.[fasta|fastq]]
{{< /highlight >}}
where **index_prefix** is the basename of the genome index to be searched. This index is generated prior running TopHat/TopHat2 by using [Bowtie]({{<relref "bowtie" >}})/[Bowtie2]({{<relref "bowtie2" >}}).
```
where **index_prefix** is the basename of the genome index to be searched. This index is generated prior running TopHat/TopHat2 by using [Bowtie](../bowtie/)/[Bowtie2](../bowtie2/).
TopHat2 uses single or comma-separated list of paired-end and single-end reads in fasta or fastq format. The single-end reads need to be provided after the paired-end reads.
More advanced TopHat2 options can be found in [its manual](https://ccb.jhu.edu/software/tophat/manual.shtml), or by typing:
{{< highlight bash >}}
```bash
$ tophat2 -h
{{< /highlight >}}
```
Prior running TopHat/TopHat2, an index from the reference genome should be built using Bowtie/Bowtie2. Moreover, TopHat2 requires both, the index file and the reference file, to be in the same directory. If the reference file is not available,TopHat2 reconstructs it in its initial step using the index file.
An example of how to run TopHat2 on Crane with paired-end fastq files `input_reads_pair_1.fastq` and `input_reads_pair_2.fastq`, reference index `index_prefix` and `8 CPUs` is shown below:
{{% panel header="`tophat2_alignment.submit`"%}}
{{< highlight bash >}}
#!/bin/sh
An example of how to run TopHat2 on Swan with paired-end fastq files `input_reads_pair_1.fastq` and `input_reads_pair_2.fastq`, reference index `index_prefix` and `8 CPUs` is shown below:
!!! note ""
```bash
#!/bin/bash
#SBATCH --job-name=Tophat2
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
......@@ -42,8 +41,8 @@ An example of how to run TopHat2 on Crane with paired-end fastq files `input_rea
module load samtools/1.3 bowtie/2.3 tophat/2.0
tophat2 -p $SLURM_NTASKS_PER_NODE index_prefix input_reads_pair_1.fastq input_reads_pair_2.fastq
{{< /highlight >}}
{{% /panel %}}
```
TopHat2 generates its own output directory `tophat_output/` that contains multiple TopHat2 generated files.
......
+++
title = "Biodata Module"
description = "How to use Biodata Module on HCC machines"
scripts = ["https://cdnjs.cloudflare.com/ajax/libs/jquery.tablesorter/2.31.1/js/jquery.tablesorter.min.js", "https://cdnjs.cloudflare.com/ajax/libs/jquery.tablesorter/2.31.1/js/widgets/widget-pager.min.js","https://cdnjs.cloudflare.com/ajax/libs/jquery.tablesorter/2.31.1/js/widgets/widget-filter.min.js","/js/sort-table.js"]
css = ["http://mottie.github.io/tablesorter/css/theme.default.css","https://mottie.github.io/tablesorter/css/theme.dropbox.css", "https://cdnjs.cloudflare.com/ajax/libs/jquery.tablesorter/2.31.1/css/jquery.tablesorter.pager.min.css","https://cdnjs.cloudflare.com/ajax/libs/jquery.tablesorter/2.31.1/css/filter.formatter.min.css"]
weight = "52"
+++
---
title: Biodata Module
summary: "How to use Biodata Module on HCC machines"
---
HCC hosts multiple databases (BLAST, KEGG, PANTHER, InterProScan), genome files, short read aligned indices etc. on Crane.
HCC hosts multiple databases (BLAST, KEGG, PANTHER, InterProScan), genome files, short read aligned indices etc. on Swan.
In order to use these resources, the "**biodata**" module needs to be loaded first.
For how to load module, please check [Module Commands]({{< relref "/applications/modules/_index.md" >}}).
For how to load module, please check [Module Commands](/applications/modules/).
!!! note
The *biodata* module is maintained and updated by the Bioinformatics Core Research Facility (BCRF).
Please email bcrf-support@unl.edu or hcc-support@unl.edu with any questions or issues with the module.
Loading the "**biodata**" module will pre-set many environment variables, but most likely you will only need a subset of them. Environment variables can be used in your command or script by prefixing `$` to the name.
......@@ -23,27 +25,27 @@ The major environment variables are:
**$INDICES** - Directory containing indices for bowtie, bowtie2, bwa for all available genomes
**$UNIPROT** - Directory containing latest release of full UniProt database
{{% notice info %}}
!!! note
**To access the older format of BLAST databases that work with BLAST+ 2.9 and lower, please use the variable BLAST_V4.**
**The variable BLAST points to the directory with the new version 5 of the nucleotide and protein databases required for BLAST+ 2.10 and higher.**
{{% /notice %}}
In order to check what genomes are available, you can type:
{{< highlight bash >}}
```bash
$ ls $GENOMES
{{< /highlight >}}
```
In order to check what BLAST databases are available, you can just type:
{{< highlight bash >}}
```bash
$ ls $BLAST
{{< /highlight >}}
```
An example of how to run Bowtie2 local alignment on Crane utilizing the default Horse, *Equus caballus* index (*BOWTIE2\_HORSE*) with paired-end fasta files and 8 CPUs is shown below:
{{% panel header="`bowtie2_alignment.submit`"%}}
{{< highlight bash >}}
#!/bin/sh
An example of how to run Bowtie2 local alignment on Swan utilizing the default Horse, *Equus caballus* index (*BOWTIE2\_HORSE*) with paired-end fasta files and 8 CPUs is shown below:
!!! note ""
```bash
#!/bin/bash
#SBATCH --job-name=Bowtie2
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
......@@ -57,14 +59,14 @@ module load biodata
bowtie2 -x $BOWTIE2_HORSE -f -1 input_reads_pair_1.fasta -2 input_reads_pair_2.fasta -S bowtie2_alignments.sam --local -p $SLURM_NTASKS_PER_NODE
{{< /highlight >}}
{{% /panel %}}
```
An example of BLAST run against the non-redundant nucleotide database available on Crane is provided below:
{{% panel header="`blastn_alignment.submit`"%}}
{{< highlight bash >}}
#!/bin/sh
An example of BLAST run against the non-redundant nucleotide database available on Swan is provided below:
!!! note ""
```bash
#!/bin/bash
#SBATCH --job-name=BlastN
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
......@@ -81,12 +83,13 @@ cp input_reads.fasta /scratch
blastn -db /scratch/nt -query /scratch/input_reads.fasta -out /scratch/blast_nucleotide.results
cp /scratch/blast_nucleotide.results .
{{< /highlight >}}
{{% /panel %}}
```
### Available Organisms
The organisms and their appropriate environmental variables for all genomes and chromosome files, as well as indices are shown in the table below.
{{< table url="http://rhino-head.unl.edu:8192/bio/data/json" >}}
{{ json_table("docs/static/json/biodata.json") }}
+++
title = "BamTools"
description = "How to use BamTools on HCC machines"
weight = "52"
+++
---
title: BamTools
summary: "How to use BamTools on HCC machines"
---
The SAM/BAM format is a standard format for short read alignments. While SAM is the plain-text version of the alignments, BAM is compressed, binary format of the alignments that is used for space-saving. BamTools is a toolkit for handling BAM files. BamTools provides a powerful suite of command-lines programs for manipulating and querying BAM files for data.
The basic usage of BamTools is:
{{< highlight bash >}}
```bash
$ bamtools COMMAND [options]
{{< /highlight >}}
```
where **COMMAND** is one of the following BamTools commands:
- **convert**: Converts between BAM and a number of other formats
......@@ -30,10 +29,10 @@ where **COMMAND** is one of the following BamTools commands:
For detailed description and more information on a specific command, just type:
{{< highlight bash >}}
```bash
$ bamtools help COMMAND
{{< /highlight >}}
```
or check the BamTools web, https://github.com/pezmaster31/bamtools/wiki.
The page [Running BamTools Commands]({{<relref "running_bamtools_commands" >}}) shows how to run BamTools on HCC.
The page [Running BamTools Commands](running_bamtools_commands) shows how to run BamTools on HCC.
+++
title = "Running BamTools Commands"
description = "How to run BamTools commands on HCC resources"
weight = "10"
+++
---
title: Running BamTools Commands
summary: "How to run BamTools commands on HCC resources"
---
## BamTools Convert
......@@ -10,16 +9,16 @@ weight = "10"
One of the most frequently used BamTools command is **convert**.
The basic usage of the BamTools **convert** is:
{{< highlight bash >}}
```bash
$ bamtools convert -format [bed|fasta|fastq|json|pileup|sam|yaml] -in input_alignments.bam -out output_reads.[bed|fasta|fastq|json|pileup|sam|yaml]
{{< /highlight >}}
```
where the option **-format** specifies the type of the output file, **input_alignments.bam** is the input BAM file, and **-out** defines the name and the type of the converted file.
Running BamTools **convert** on Crane with input file `input_alignments.bam` and output file `output_reads.fastq` is shown below:
{{% panel header="`bamtools_convert.submit`"%}}
{{< highlight bash >}}
#!/bin/sh
Running BamTools **convert** on Swan with input file `input_alignments.bam` and output file `output_reads.fastq` is shown below:
!!! note ""
```bash
#!/bin/bash
#SBATCH --job-name=BamTools_Convert
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
......@@ -31,8 +30,8 @@ Running BamTools **convert** on Crane with input file `input_alignments.bam` a
module load bamtools/2.4
bamtools convert -format fastq -in input_alignments.bam -out output_reads.fastq
{{< /highlight >}}
{{% /panel %}}
```
All BamTools commands are single threaded, and therefore both `#SBATCH --nodes` and `#SBATCH --ntasks-per-node` are set to **1**.
......@@ -40,106 +39,106 @@ All BamTools commands are single threaded, and therefore both `#SBATCH --nodes`
## BamTools Count
The basic usage of the BamTools **count** is:
{{< highlight bash >}}
```bash
$ bamtools count -in input_alignments.bam
{{< /highlight >}}
```
The command **bamtools count** outputs the total number of alignments in the BAM file.
## BamTools Coverage
The basic usage of the BamTools **coverage** is:
{{< highlight bash >}}
```bash
$ bamtools coverage -in input_alignments.bam -out output_reads_coverage.txt
{{< /highlight >}}
```
The command **bamtools coverage **prints the coverage data for a single BAM file.
## BamTools Filter
The basic usage of the BamTools **filter** is:
{{< highlight bash >}}
```bash
$ bamtools filter -in input_alignments.bam -out output_alignments_filtered.bam -length 100
{{< /highlight >}}
```
The command **bamtools filter** filters the BAM file based on specified options. In this example, the resulting bam file `output_alignments_filtered.bam` contains alignments with length longer than 100 base pairs.
## BamTools Header
The basic usage of the BamTools **header** is:
{{< highlight bash >}}
```bash
$ bamtools header -in input_alignments.bam -out output_alignments_header.txt
{{< /highlight >}}
```
The command **bamtools header** prints the header from BAM file.
## BamTools Index
The basic usage of the BamTools **index** is:
{{< highlight bash >}}
```bash
$ bamtools index -in input_alignments.bam
{{< /highlight >}}
```
The command **bamtools index** creates index for BAM file and prints `input_alignments.bam.bai` file.
## BamTools Merge
The basic usage of the BamTools **merge** is:
{{< highlight bash >}}
```bash
$ bamtools merge -in input_alignments_1.bam -in input_alignments_2.bam -in input_alignments_3.bam -out output_alignments_merged.bam
{{< /highlight >}}
```
The command **bamtools merge** merges multiple (more than 2) BAM files into one.
## BamTools Random
The basic usage of the BamTools **random** is:
{{< highlight bash >}}
```bash
$ bamtools random -in input_alignments.bam -out output_alignments_100.bam -n 100
{{< /highlight >}}
```
The command **bamtools random** grabs a random subset of alignments. With the option `-n 100`, 100 randomly chosen alignments are stored in the output file `output_alignments_100.bam`.
## BamTools Resolve
The basic usage of the BamTools **resolve** is:
{{< highlight bash >}}
```bash
$ bamtools resolve -twoPass -in input_alignments.bam -out output_alignments.bam
{{< /highlight >}}
```
The command **bamtools resolve** resolves paired-end reads. The resolving mode is required, and it can be `-makeStats`, `-markPairs`, or `-twoPass`.
## BamTools Revert
The basic usage of the BamTools **revert** is:
{{< highlight bash >}}
```bash
$ bamtools revert -in input_alignments.bam -out output_alignments_reverted.bam
{{< /highlight >}}
```
The command **bamtools revert** removes duplicate marks and restores original base qualities.
## BamTools Sort
The basic usage of the BamTools **sort** is:
{{< highlight bash >}}
```bash
$ bamtools sort -in input_alignments.bam -out output_alignments_sorted.bam -byname
{{< /highlight >}}
```
The command **bamtools sort** sorts a BAM file according to a given option. `output_alignments_sorted.bam` is the resulting file, where the alignments are sorted by name.
## BamTools Split
The basic usage of the BamTools **split** is:
{{< highlight bash >}}
```bash
$ bamtools split -in input_alignments.bam -mapped
{{< /highlight >}}
```
The command **bamtools split** splits BAM file on user-specified property and creates a new BAM output file for each value found. In the given example, an output file `input_alignments.MAPPED.bam` is produced after `-mapped` split option is specified. Beside `mapped`, the split option can be: `-paired`, `-reference`, or `-tag <tag_name>`.
## BamTools Stats
The basic usage of the BamTools **stats** is:
{{< highlight bash >}}
```bash
$ bamtools stats -in input_alignments.bam
{{< /highlight >}}
```
The command **bamtools stats** prints general alignment statistics from the BAM file.
---
title: Data Manipulation Tools
summary: "How to use data manipulation tools on HCC machines"
---
{{ children('applications/app_specific/bioinformatics_tools/data_manipulation_tools') }}
+++
title = "SAMtools"
description = "How to use SAMtools on HCC machines"
weight = "52"
+++
---
title: SAMtools
summary: "How to use SAMtools on HCC machines"
---
The SAM format is a standard format for storing large nucleotide sequence alignments. The BAM format is just the binary form from SAM. [SAMtools](http://www.htslib.org/) is a toolkit for manipulating alignments in SAM/BAM format, including sorting, merging, indexing and generating alignments in a per-position format.
The basic usage of SAMtools is:
{{< highlight bash >}}
```bash
$ samtools COMMAND [options]
{{< /highlight >}}
```
where **COMMAND** is one of the following SAMtools commands**:**
- **view**: SAM/BAM and BAM/SAM conversion
......@@ -36,10 +35,10 @@ where **COMMAND** is one of the following SAMtools commands**:**
For detailed description and more information on a specific command, just type:
{{< highlight bash >}}
```bash
$ samtools COMMAND
{{< /highlight >}}
```
or check the [SAMtools manual](http://www.htslib.org/doc/samtools.html).
The page [Running SAMtools Commands]({{<relref "running_samtools_commands" >}}) shows how to run SAMtools on HCC.
The page [Running SAMtools Commands](running_samtools_commands) shows how to run SAMtools on HCC.
+++
title = "Running SAMtools Commands"
description = "How to run SAMtools commands on HCC resources"
weight = "10"
+++
---
title: Running SAMtools Commands
summary: "How to run SAMtools commands on HCC resources"
---
## SAMtools View
One of the most frequently used SAMtools command is **view**. The basic usage of the **samtools view** is:
{{< highlight bash >}}
```bash
$ samtools view input_alignments.[bam|sam] [options] -o output_alignments.[sam|bam]
{{< /highlight >}}
```
where **input_alignments.[bam|sam]** is the input file with the alignments in BAM/SAM format, and **output_alignments.[sam|bam]** file is the converted file into SAM or BAM format respectively.
Running **samtools view** on Crane with `8 CPUs`, input file `input_alignments.sam` with available header (**-S**), output in BAM format (**-b**) and output file `output_alignments.bam` is shown below:
{{% panel header="`samtools_view.submit`"%}}
{{< highlight bash >}}
#!/bin/sh
Running **samtools view** on Swan with `8 CPUs`, input file `input_alignments.sam` with available header (**-S**), output in BAM format (**-b**) and output file `output_alignments.bam` is shown below:
!!! note ""
```bash
#!/bin/bash
#SBATCH --job-name=SAMtools_View
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
......@@ -29,8 +28,8 @@ Running **samtools view** on Crane with `8 CPUs`, input file `input_alignments.s
module load samtools/1.9
samtools view -bS -@ $SLURM_NTASKS_PER_NODE input_alignments.sam -o output_alignments.bam
{{< /highlight >}}
{{% /panel %}}
```
The most intensive SAMtools commands (**samtools view**, **samtools sort**) are multi-threaded, and therefore using the SAMtools option **-@ <number_of_CPUs>** is recommended.
......@@ -38,54 +37,54 @@ The most intensive SAMtools commands (**samtools view**, **samtools sort**) are
## SAMtools Sort
Sorting BAM files is recommended for further analysis of these files. The BAM file is sorted based on its position in the reference, as determined by its alignment. An example of using `4 CPUs` to sort the input file `input_alignments.bam` by the read name follows:
{{< highlight bash >}}
$ samtools sort -n -@ 4 input_alignments.bam output_alignments_sorted
{{< /highlight >}}
```bash
$ samtools sort -n -@ 4 input_alignments.bam -o output_alignments_sorted
```
## SAMtools Index
The **samtools index** command creates a new index file that allows fast look-up of the data in a sorted SAM or BAM file.
{{< highlight bash >}}
```bash
$ samtools index input_alignments_sorted.bam output_index.bai
{{< /highlight >}}
```
## SAMtools Idxstats
The **samtools idxstats** command prints stats for the BAM index file. The output is TAB delimited with each line consisting of *reference sequence name*, *sequence length*, *number of mapped reads* and *number of unmapped reads*.
{{< highlight bash >}}
```bash
$ samtools idxstats input_alignments_sorted.bam
{{< /highlight >}}
```
## SAMtools Merge
The **samtools merge** command merges multiple sorted alignments into one output file.
{{< highlight bash >}}
```bash
$ samtools merge output_alignments_merge.bam input_alignments_sorted_1.bam input_alignments_sorted_2.bam
{{< /highlight >}}
```
## SAMtools Faidx
The command **samtools faidx** indexes the reference sequence in fasta format or extracts subsequence from indexed reference sequence.
{{< highlight bash >}}
```bash
$ samtools faidx input_reference.fasta
{{< /highlight >}}
```
## SAMtools Mpileup
The **samtools mpileup** command generates file in `bcf` or `pileup` format for one or multiple BAM files. For each genomic coordinate, the overlapping read bases and indels at that position in the input BAM file are printed.
{{< highlight bash >}}
$ samtools mpileup input_alignments_sorted.bam > output_alignments.bcf
{{< /highlight >}}
```bash
$ samtools mpileup input_alignments_sorted.bam -o output_alignments.bcf
```
## SAMtools View
The **samtools tview** command starts an interactive text alignment viewer that can be used to visualize how reads are aligned to specific regions of the reference genome.
{{< highlight bash >}}
```bash
$ samtools tview input_alignments_sorted.bam
{{< /highlight >}}
```
+++
title = "SRAtoolkit"
description = "How to run SRAtoolkit on HCC resources"
weight = "10"
+++
---
title: SRAtoolkit
summary: "How to run SRAtoolkit on HCC resources"
---
[SRA (Sequence Read Archive)](http://www.ncbi.nlm.nih.gov/sra) is an NCBI-defined format for NGS data. Every data submitted to NCBI needs to be in SRA format. The SRA Toolkit provides tools for converting different formats of data into SRA format, and vice versa, extracting SRA data in other different formats.
[SRA (Sequence Read Archive)](http://www.ncbi.nlm.nih.gov/sra) is an NCBI-defined format for NGS data. Every data submitted to NCBI needs to be in SRA format. The SRA Toolkit provides tools for downloading data, converting different formats of data into SRA format, and vice versa, extracting SRA data in other different formats.
The SRA Toolkit allows converting data from the SRA format to the following formats: `ABI SOLiD native`, `fasta`, `fastq`, `sff`, `sam`, and `Illumina native`. Also, the SRA Toolkit allows converting data from `fasta`, `fastq`, `AB SOLiD-SRF`, `AB SOLiD-native`, `Illumina SRF`, `Illumina native`, `sff`, and `bam` format into the SRA format.
The SRA Toolkit supports downloading SRA data using the **"prefetch"** command:
```bash
$ prefetch <sra_id>
```
where `<sra_id>` is the assigned SRA identification in NCBI (e.g., SRR1482462).
The SRA Toolkit contains multiple **"format"-dump** commands, where **format** is the file format the SRA data is converted to **abi-dump**, **fastq-dump**, **illumina-dump**, **sam-dump**, **sff-dump**, and **vdb-dump**.
One of the most commonly used commands is **fastq-dump**:
{{< highlight bash >}}
```bash
$ fastq-dump [options] input_reads.sra
{{< /highlight >}}
```
This command can be applied on the downloaded SRA data with **"prefetch"**.
An example of running **fastq-dump** on Crane to convert SRA file containing paired-end reads is:
{{% panel header="`sratoolkit.submit`"%}}
{{< highlight bash >}}
#!/bin/sh
An example of running **fastq-dump** on Swan to convert SRA file containing paired-end reads is:
!!! note ""
```bash
#!/bin/bash
#SBATCH --job-name=SRAtoolkit
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
......@@ -30,13 +36,22 @@ An example of running **fastq-dump** on Crane to convert SRA file containing pai
#SBATCH --output=SRAtoolkit.%J.out
#SBATCH --error=SRAtoolkit.%J.err
module load SRAtoolkit/2.9
module load SRAtoolkit/2.11
fastq-dump --split-files input_reads.sra
{{< /highlight >}}
{{% /panel %}}
```
This script outputs two fastq paired end reads `input_reads_1.fastq` and `input_reads_2.fastq`.
To download `bam` files from NCBI using the SRA identification, the following commands can be used:
```bash
$ module load SRAtoolkit/2.11 samtools
$ sam-dump <sra_id> | samtools view -bS - > <sra_id>.bam
```
where `<sra_id>` is the assigned SRA identification in NCBI (e.g., SRR1482462).
All SRAtoolkit commands are single threaded, and therefore both `#SBATCH --nodes` and `#SBATCH --ntasks-per-node` in the SLURM script are set to **1**.
......@@ -44,14 +59,13 @@ The SRA Toolkit contains multiple **"format"-load** commands, where **format** i
An example of bam file `input_alignments.bam` uploaded to NCBI is shown below:
{{< highlight bash >}}
```bash
$ bam-load \-o input_reads.sra input_alignments.bam
{{< /highlight >}}
```
Other frequently used SRAtoolkit tools are:
- **prefetch**: allows command-line downloading of SRA, dbGaP, and ADSP data
- **sra-stat**: generate statistics about SRA data
- **sra-pileup**: generate pileup statistics on aligned SRA data
- **vdb-config**: display and modify VDB configuration information
......@@ -59,25 +73,6 @@ Other frequently used SRAtoolkit tools are:
- **vdb-decrypt**: decrypt non-SRA dbGaP data
- **vdb-validate**: validate the integrity of downloaded SRA data
{{% notice info %}}
**Prefetch instructions:**
\\
\\
When **prefetch** is used, the files are downloaded in **${HOME}/ncbi/public** by default.
\\
Since the */home* directory (*$HOME*) is not writable from the worker nodes, the file can not be saved in *$(HOME)/ncbi/public* when submitting a SLURM job.
\\
\\
To change the default output directory for **prefetch** to **${WORK}/ncbi/public**, please follow these three steps:
\\
**$ wget https://raw.githubusercontent.com/ncbi/ncbi-vdb/master/libs/kfg/default.kfg -P $HOME/.ncbi/**
\\
**$ vim $HOME/.ncbi/default.kfg**
\\
Here, set *"/repository/user/main/public/root"* to *"/work/group/username/ncbi/public"*, where **group** is the name of **your HCC group**, and **username** is **your HCC username**.
\\
**$ export VDB_CONFIG=$HOME/.ncbi/default.kfg**
\\
\\
You need to do these steps only once.
{{% /notice %}}
!!! note
**If needed, the location of the caching on a per-user basis can be changed with `vdb-config -i`.**
---
title: De Novo Assembly Tools
summary: "How to use de novo assembly tools on HCC machines"
---
{{ children('applications/app_specific/bioinformatics_tools/de_novo_assembly_tools') }}
+++
title = "Oases"
description = "How to run Oases on HCC resources"
weight = "10"
+++
---
title: Oases
summary: "How to run Oases on HCC resources"
---
Velvet by itself generates assembled contigs for DNA data. However, using the Oases extension for Velvet, a transcriptome assembly can be produced. [Oases](https://www.ebi.ac.uk/~zerbino/oases/) is an extension of Velvet for generating de novo assembly for RNA-Seq data. Oases uses the preliminary assembly produced by Velvet as an input, and constructs transcripts.
In order to be able to run Oases, after `velveth`, `velvetg` needs to be run with the `–read_trkg yes` option:
{{< highlight bash >}}
```bash
$ velvetg output_directory/ -min_contig_lgth 200 -read_trkg yes
{{< /highlight >}}
```
The `output_directory/` after `velvetg` with `-read_trkg` option on contains the following files:
{{% panel header="`Output directory after Velvetg`"%}}
{{< highlight bash >}}
!!! note ""
```bash
$ ls
contigs.fa Graph2 LastGraph Log PreGraph Roadmaps Sequences stats.txt
{{< /highlight >}}
{{% /panel %}}
```
Oases has a lot of parameters that can be found in its [manual](https://www.ebi.ac.uk/~zerbino/oases/OasesManual.pdf). While Velvet is multi-threaded, Oases is not.
A simple SLURM script to run Oases on the Velvet output stored in `output_directory/` with minimum transcript length of `200` is shown below:
{{% panel header="`oases.submit`"%}}
{{< highlight bash >}}
#!/bin/sh
!!! note ""
```bash
#!/bin/bash
#SBATCH --job-name=Velvet_Oases
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
......@@ -40,19 +39,19 @@ A simple SLURM script to run Oases on the Velvet output stored in `output_direct
module load oases/0.2
oases output_directory/ -min_trans_lgth 200
{{< /highlight >}}
{{% /panel %}}
```
### Oases Output
The `output_directory/` after Oases contains the following files:
{{% panel header="`Output directory after Oases`"%}}
{{< highlight bash >}}
!!! note ""
```bash
$ ls output_directory/
contig-ordering.txt contigs.fa Graph2 LastGraph Log PreGraph Roadmaps Sequences stats.txt transcripts.fa
{{< /highlight >}}
{{% /panel %}}
```
Oases produces two additional output files: `transcripts.fa` and `contig-ordering.txt`. The predicted transcript sequences are found in the fasta file `transcripts.fa`.
......@@ -60,4 +59,5 @@ Oases produces two additional output files: `transcripts.fa` and `contig-orderin
### Useful Information
In order to test the Oases (oases/0.2.8) performance, we used three paired-end input fastq files, `small_1.fastq` and `small_2.fastq`, `medium_1.fastq` and `medium_2.fastq`, and `large_1.fastq` and `large_2.fastq`. Some statistics about the input files and the time and memory resources used by Oases are shown in the table below:
{{< readfile file="/static/html/oases.html" >}}
{% include "../../../../static/html/oases.html"%}
+++
title = "Ray"
description = "How to run Ray on HCC resources"
weight = "10"
+++
---
title: Ray
summary: "How to run Ray on HCC resources"
---
[Ray](http://denovoassembler.sourceforge.net/) is a de novo de Bruijn genome assembler that works with next-generation sequencing data (Illumina, 454, SOLiD). Ray is scalable and parallel software that takes advantage of multiple nodes and multiple CPUs using MPI (message passing interface).
......@@ -19,18 +18,18 @@ Ray can be used for building multiple applications:
In order to see all options available for running Ray, just type:
{{< highlight bash >}}
```bash
$ mpiexec Ray -help
{{< /highlight >}}
```
All options used for Ray can be defined on the command line:
{{< highlight bash >}}
```bash
$ mpiexec Ray -k <kmer_value> -p input_reads_pair_1.[fa|fq] input_reads_pair_2.[fa|fq] -s input_reads.[fa|fq] -o <output_directory>
{{< /highlight >}}
```
or can be stored in a configuration file `.conf` (one option per line):
{{< highlight bash >}}
```bash
$ mpiexec Ray Ray.conf
{{< /highlight >}}
```
Ray supports both paired-end (`-p`) and single-end reads (`-s`). Moreover, Ray can detect the input files automatically if the input directory is provided (`-detect-sequence-files input_directory`).
......@@ -39,9 +38,9 @@ Ray supports odd values for k-mer equal to or greater than 21 (`-k <kmer_value>`
Simple SLURM script for running Ray with both paired-end and single-end data with `k-mer=31`, `8 CPUs` and `4 GB RAM per CPU` is shown below:
{{% panel header="`ray.submit`"%}}
{{< highlight bash >}}
#!/bin/sh
!!! note ""
```bash
#!/bin/bash
#SBATCH --job-name=Ray
#SBATCH --ntasks=8
#SBATCH --time=168:00:00
......@@ -52,14 +51,14 @@ Simple SLURM script for running Ray with both paired-end and single-end data wit
module load compiler/gcc/4.7 openmpi/1.6 ray/2.3
mpiexec Ray -k 31 -p input_reads_pair_1.fastq input_reads_pair_2.fastq -s input_reads.fasta -o output_directory
{{< /highlight >}}
{{% /panel %}}
```
where **input_reads_pair_1.fastq** and **input_reads_pair_2.fastq** are the paired-end input files in `fastq` format, and **input_reads.fasta** is the single-end input file in `fasta` format.
{{% notice note %}}
!!! note %}}
It is **not** necessary to specify the number of processes with the `-n` option to `mpiexec`. OpenMPI will determine that automatically from SLURM based on the value of the `--ntasks` option.
{{% /notice %}}
### Ray Output
......@@ -77,4 +76,4 @@ One of the most important results are:
### Useful Information
In order to test the Ray performance, we used three paired-end input fastq files, `small_1.fastq` and `small_2.fastq`, `medium_1.fastq` and `medium_2.fastq`, and `large_1.fastq` and `large_2.fastq`. Some statistics about the input files and the time and memory resources used by Ray are shown in the table below:
{{< readfile file="/static/html/ray.html" >}}
{% include "../../../../static/html/ray.html"%}
+++
title = "SOAPdenovo2"
description = "How to run SOAPdenovo2 on HCC resources"
weight = "10"
+++
---
title: SOAPdenovo2
summary: "How to run SOAPdenovo2 on HCC resources"
---
[SOAPdenovo](http://soap.genomics.org.cn/soapdenovo.html) is a de novo genome assembler for short reads. It is specially designed for Illumina GA short reads and large plant and animal genomes. SOAPdenovo2 is a newer version of SOAPdenovo with improved algorithm that reduces memory consumption, resolves more repeat regions, increases coverage, and optimizes the assembly for large genomes.
......@@ -12,54 +11,54 @@ SOAPdenovo2 has two commands, **SOAPdenovo-63mer** and **SOAPdenovo-127mer**. Th
In order to see the options available for **SOAPdenovo-63mer** just
type:
{{< highlight bash >}}
```bash
$ SOAPdenovo-63mer
{{< /highlight >}}
```
SOAPdenovo2 provides a mechanism to run the whole workflow at once, or in 5 separate steps.
The basic usage of SOAPdenovo2 is:
{{< highlight bash >}}
```bash
$ SOAPdenovo-63mer all -s configFile -o output_directory/outputGraph -K <kmer_value> [options]
{{< /highlight >}}
```
where **configFile** is a defined configuration file, **outputGraph** is the prefix of the output files, and **kmer_value** is the value of k-mer used for building the assembly (`<=63` for SOAPdenovo-63mer and `<=127` for SOAPdenovo-127mer).
If you want to run the assembly process step by step, then use the following sequential commands:
{{% panel theme="info" header="SOAPdenovo2 Step 1 Options" %}}
{{< highlight bash >}}
!!! note "SOAPdenovo2 Step 1 Options"
```bash
SOAPdenovo-63mer pregraph -s configFile -o outputGraph [options]
OR
SOAPdenovo-63mer sparse_pregraph -s configFile -K <kmer_value> -z <genome_size> -o outputGraph [options]
{{< /highlight >}}
{{% /panel %}}
```
{{% panel theme="info" header="SOAPdenovo2 Step 2 Options" %}}
{{< highlight bash >}}
!!! note "SOAPdenovo2 Step 2 Options"
```bash
SOAPdenovo-63mer contig -g inputGraph [options]
{{< /highlight >}}
{{% /panel %}}
```
{{% panel theme="info" header="SOAPdenovo2 Step 3 Options" %}}
{{< highlight bash >}}
!!! note "SOAPdenovo2 Step 3 Options"
```bash
SOAPdenovo-63mer map -s configFile -g inputGraph [options]
{{< /highlight >}}
{{% /panel %}}
```
{{% panel theme="info" header="SOAPdenovo2 Step 4 Options" %}}
{{< highlight bash >}}
!!! note "SOAPdenovo2 Step 4 Options"
```bash
SOAPdenovo-63mer scaff -g inputGraph [options]
{{< /highlight >}}
{{% /panel %}}
```
As you can notice from the commands above, in order to run SOAPdenovo2, you first need to create a config file (`configFile`) that contains different information about the read files (`read length`, `insert size`, `reads location`). SOAPdenovo2 accepts read files in 3 formats: fasta, fastq and bam.
The example configuration file **configFile** for 2 paired-end fastq files, 1 paired-end fasta file and 1 single-end fastq file looks like:
{{% panel header="`configFile`"%}}
{{< highlight bash >}}
!!! note ""
```bash
#maximal read length
max_rd_len=150
[LIB]
......@@ -88,16 +87,16 @@ f1=input_reads_pair_1.fa
f2=input_reads_pair_2.fa
#fastq file for single reads
q=input_reads.fq
{{< /highlight >}}
{{% /panel %}}
```
After creating the configuration file **configFile**, the next step is to run the assembler using this file.
Simple SLURM script for running SOAPdenovo2 with `k-mer=31`, `8 CPUSs` and `50GB of RAM` is shown below:
{{% panel header="`soapdenovo2.submit`"%}}
{{< highlight bash >}}
#!/bin/sh
!!! note ""
```bash
#!/bin/bash
#SBATCH --job-name=SOAPdenovo2
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
......@@ -109,26 +108,27 @@ Simple SLURM script for running SOAPdenovo2 with `k-mer=31`, `8 CPUSs` and `50GB
module load soapdenovo2/r240
SOAPdenovo-63mer all -s configFile -K 31 -o output_directory/output31 -p $SLURM_NTASKS_PER_NODE
{{< /highlight >}}
{{% /panel %}}
```
### SOAPdenovo2 Output
SOAPdenovo2 outputs number of files in its `output_directory/` after each executed step. The final assembly output is in the `.contig` file.
{{% panel header="`Output directory after SOAPdenovo2`"%}}
{{< highlight bash >}}
!!! note ""
```bash
$ ls
output31.Arc output31.ContigIndex output31.gapSeq output31.newContigInde
output31.bubbleInScaff output31.contigPosInscaff output31.kmerFreq output31.peGrads
output31.contig output31.edge.gz output31.links output31.preArc
{{< /highlight >}}
{{% /panel %}}
```
### Useful Information
In order to test the SOAPdenovo2 (soapdenovo2/r240) performance, we used three different size input files. Some statistics about the input files and the time and memory resources used by SOAPdenovo2 are shown in the table below:
{{< readfile file="/static/html/soapdenovo2.html" >}}
{% include "../../../../static/html/soapdenovo2.html"%}
In general, SOAPdenovo2 is a memory intensive assembler that requires approximately 30-60 GB memory for assembling 50 million reads. However, SOAPdenovo2 is a fast assembler and it takes around an hour to assemble 50 million reads.
+++
title = "Trinity"
description = "How to use Trinity on HCC machines"
weight = "52"
+++
---
title: Trinity
summary: "How to use Trinity on HCC machines"
---
[Trinity](https://github.com/trinityrnaseq/trinityrnaseq/wiki) is a method for efficient and robust de novo reconstruction of transcriptomes from RNA-Seq data. Trinity combines three independent software modules: `Inchworm`, `Chrysalis`, and `Butterfly`. All these modules can be applied sequentially to process large RNA-Seq datasets.
[Trinity](https://github.com/trinityrnaseq/trinityrnaseq/wiki) is a method for efficient and robust de novo reconstruction of transcriptomes from RNA-Seq data. Trinity combines four independent software modules: `Normalization`, `Inchworm`, `Chrysalis` and `Assembly`. All these modules can be applied sequentially to process large RNA-Seq datasets.
The basic usage of Trinity is:
{{< highlight bash >}}
$ Trinity --seqType [fa|fq] --JM <jellyfish_memory> --left input_reads_pair_1.[fa|fq] --right input_reads_pair_2.[fa|fq] [options]
{{< /highlight >}}
where **input_reads_pair_1.[fa|fq]** and **input_reads_pair_2.[fa|fq]** are the input paired-end files of sequence reads in fasta/fastq format, and **--seqType** is the type of these input reads. The option **--JM** defines the number of GB of system memory required for k-mer counting by jellyfish.
```bash
$ Trinity --seqType [fa|fq] --max_memory <maximum_memory> --left input_reads_pair_1.[fa|fq] --right input_reads_pair_2.[fa|fq] [options]
```
where **input_reads_pair_1.[fa|fq]** and **input_reads_pair_2.[fa|fq]** are the input paired-end files of sequence reads in fasta/fastq format, and **--seqType** is the type of these input reads. The option **--max_memory** specifies the maximum memory to use with Trinity.
!!! note
**Trinity produces many intermediate files that can affect the file system. To avoid any issues, please copy all the input data to the faster local storage called "scratch", store the output in "scratch" and finally copy all the needed output files from "scratch" to /work. The "scratch" directories are unique per job and are deleted when the job finishes. This can greatly improve performance!**
Additional Trinity **options** can be found in the Trinity website, or by typing:
{{< highlight bash >}}
```bash
$ Trinity
{{< /highlight >}}
```
Running the Trinity pipeline from beginning to end on large datasets may exceed the walltime limit for a single job. Therefore, Trinity provides a mechanism to run the workflow in four separate steps, where each step resumes from the previous one. The same Trinity command and options are run for each step, with an additional option that is included for the different steps. On the last step, the Trinity command is run as normal.
{{% panel theme="info" header="Step 1 Options" %}}
{{< highlight bash >}}
Trinity.pl [options] --no_run_chrysalis
{{< /highlight >}}
{{% /panel %}}
{{% panel theme="info" header="Step 2 Options" %}}
{{< highlight bash >}}
Trinity.pl [options] --no_run_quantifygraph
{{< /highlight >}}
{{% /panel %}}
{{% panel theme="info" header="Step 3 Options" %}}
{{< highlight bash >}}
Trinity.pl [options] --no_run_butterfly
{{< /highlight >}}
{{% /panel %}}
{{% panel theme="info" header="Step 4 Options" %}}
{{< highlight bash >}}
Trinity.pl [options]
{{< /highlight >}}
{{% /panel %}}
!!! note "Step 1 Options"
```bash
Trinity [options] --no_run_inchworm
```
!!! note "Step 2 Options"
```bash
Trinity [options] --no_run_chrysalis
```
!!! note "Step 3 Options"
```bash
Trinity [options] --no_distributed_trinity_exec
```
!!! note "Step 4 Options"
```bash
Trinity [options]
```
Each step may be run as its own job, providing a workaround for the single job walltime limit. To see how to run each step of Trinity as a single job under the SLURM scheduler on HCC, please check:
{{% children %}}
{{ children('applications/app_specific/bioinformatics_tools/de_novo_assembly_tools/trinity') }}
### Useful Information
In order to test the Trinity (trinity/r2014-04-13p1) performance, we used three paired-end input fastq files, `small_1.fastq` and `small_2.fastq`, `medium_1.fastq` and `medium_2.fastq`, and `large_1.fastq` and `large_2.fastq`. Some statistics about the input files and the time and memory resources used by Trinity are shown in the table below:
{{< readfile file="/static/html/trinity.html" >}}
{% include "../../../../../static/html/trinity.html"%}
{{% notice tip %}}
!!! tip
The Inchworm (step 1) and Chrysalis (step 2) steps can be memory intensive. A basic recommendation is to have **1GB of RAM per 1M ~76 base Illumina paired-end reads**.
{{% /notice %}}