Commit fec34637 authored by npavlovikj's avatar npavlovikj
Browse files

add bioinformatics pages part 2

parent 9c602df3
......@@ -46,6 +46,7 @@ An example of how to run Bowtie2 local alignment on Crane utilizing the default
module load bowtie/2.2
module load biodata
bowtie2 -x $BOWTIE2_HORSE -f -1 input_reads_pair_1.fasta -2 input_reads_pair_2.fasta -S bowtie2_alignments.sam --local -p $SLURM_NTASKS_PER_NODE
{{< /highlight >}}
......@@ -76,4 +77,3 @@ cp /scratch/blast_nucleotide.results .
The organisms and their appropriate environmental variables for all genomes and chromosome files, as well as for short read aligned indices are shown on the link below:
[Organisms](#organisms)
1. [HCC-DOCS](index.html)
2. [HCC-DOCS Home](HCC-DOCS-Home_327685.html)
3. [HCC Documentation](HCC-Documentation_332651.html)
4. [Running Applications](Running-Applications_7471153.html)
5. [Bioinformatics Tools](Bioinformatics-Tools_8193279.html)
<span id="title-text"> HCC-DOCS : Pre-Processing Tools </span>
==============================================================
Created by <span class="author"> Adam Caprez</span> on Sep 04, 2014
 
+++
title = "Pre-processing Tools"
description = "How to use pre-processing tools on HCC machines"
weight = "52"
+++
{{% children %}}
\ No newline at end of file
1. [HCC-DOCS](index.html)
2. [HCC-DOCS Home](HCC-DOCS-Home_327685.html)
3. [HCC Documentation](HCC-Documentation_332651.html)
4. [Running Applications](Running-Applications_7471153.html)
5. [Bioinformatics Tools](Bioinformatics-Tools_8193279.html)
6. [Pre-Processing Tools](Pre-Processing-Tools_8193298.html)
+++
title = "Cutadapt"
description = "How to run Cutadapt on HCC resources"
weight = "10"
+++
<span id="title-text"> HCC-DOCS : Cutadapt </span>
==================================================
Created by <span class="author"> Adam Caprez</span>, last modified by
<span class="editor"> Natasha Pavlovikj</span> on Dec 12, 2016
| Name | Version | Resource |
|----------|---------|----------|
| cutadapt | 1.4 | Tusker |
| | | |
|----------|-----|-------|
| cutadapt | 1.4 | Crane |
 
Cutadapt
(<a href="https://code.google.com/p/cutadapt/" class="external-link">https://code.google.com/p/cutadapt/</a>)
is a tool for removing adapter sequences from DNA sequencing data.
Although most of the adapters are located at the 3' end of the
sequencing read, Cutadapt allows multiple adapter removal from both 3'
and 5' ends.
[Cutadapt] (https://cutadapt.readthedocs.io/en/stable/index.html) is a tool for removing adapter sequences from DNA sequencing data. Although most of the adapters are located at the 3' end of the sequencing read, Cutadapt allows multiple adapter removal from both 3' and 5' ends.
The basic usage of Cutadapt is:
{{< highlight bash >}}
$ cutadapt [-a|-b|-g] <adapter_sequence> input_reads.[fasta|fastq] > output_reads.[fasta|fastq]
{{< /highlight >}}
**General Cutadapt Usage**
``` syntaxhighlighter-pre
cutadapt [-a|-b|-g] <adapter_sequence> input_reads.[fasta|fastq] > output_reads.[fasta|fastq]
```
where **&lt;adapter\_sequence&gt;** is the nucleotide sequence of the
actual adapter, i**nput\_reads.\[fasta\|fastq\]** is the input file with
sequencing data in fasta/fastq format, and respectively,
**output\_reads.\[fasta\|fastq\]** is the final trimmed file in
fasta/fastq format. The option **-a** allows removing of an adapter from
the 3' end of the sequencing read. The option **-b** removes adapters
ligated to the 5' or 3' end. The option **-g** removes adapter sequences
from the 5' end. These options can be used multiple times for different
adapters.
where **&lt;adapter_sequence&gt;** is the nucleotide sequence of the actual adapter, **input_reads.[fasta|fastq]** is the input file with sequencing data in fasta/fastq format, and respectively, **output_reads.[fasta|fastq]** is the final trimmed file in fasta/fastq format.
\\
The option **-a** allows removal of adapters from the 3' end of the sequencing read. The option **-b** removes adapters ligated to the 5' or 3' end. The option **-g** removes adapter sequences from the 5' end. These options can be used multiple times for different adapters.
More information about the Cutadapt options can be found by typing:
**Additional Cutadapt Options**
``` syntaxhighlighter-pre
[<username>@login.tusker~]$ cutadapt --help
```
Simple Cutadapt script that trims the adapter sequences **AGGCACACAGGG**
and **TGAGACACGCA** from the 3' end and **AACCGGTT** from the 5' end of
single-end fasta input file is shown below:
**cutadapt.submit**
\#!/bin/sh
\#SBATCH --job-name=Cutadapt
\#SBATCH --nodes=1
\#SBATCH --ntasks-per-node=1
\#SBATCH --time=168:00:00
\#SBATCH --mem=30gb
\#SBATCH --output=Cutadapt.%J.out
\#SBATCH --error=Cutadapt.%J.err
 
| |
|-------------------------------------|
| module load python/2.7 cutadapt/1.4 |
cutadapt -a AGGCACACAGGG -a TGAGACACGCA -g AACCGGTT input\_reads.fasta
&gt; output\_reads.fasta
Cutadapt is single threaded program, and therefore **\#SBATCH
--nodes=1** and **\#SBATCH --ntasks-per-node=1**. Cutadapt allows
paired-end trimming where each pair is trimmed separately in a single
pass:
**Cutadapt Usage for Paired-End Reads**
``` syntaxhighlighter-pre
cutadapt -a ADAPTER_PAIR_1 input_reads_pair_1.fastq > output_reads_pair_1.fastq
cutadapt -a ADAPTER_PAIR_2 input_reads_pair_2.fastq > output_reads_pair_2.fastq
```
 
**Cutadapt Output**
Beside the fasta/fastq file of reads with removed adapter sequences,
Cutadapt also outputs useful statistics per adapter sequence.
Attachments:
------------
<img src="assets/images/icons/bullet_blue.gif" width="8" height="8" />
[cb\_cutadapt\_module.xsl](attachments/8193303/8127586.xsl)
(application/octet-stream)
<img src="assets/images/icons/bullet_blue.gif" width="8" height="8" />
[crane\_cutadapt\_version.xsl](attachments/8193303/8127587.xsl)
(application/octet-stream)
<img src="assets/images/icons/bullet_blue.gif" width="8" height="8" />
[crane\_modules.xml](attachments/8193303/8127588.xml)
(application/octet-stream)
<img src="assets/images/icons/bullet_blue.gif" width="8" height="8" />
[tusker\_cutadapt\_version.xsl](attachments/8193303/8127589.xsl)
(application/octet-stream)
<img src="assets/images/icons/bullet_blue.gif" width="8" height="8" />
[tusker\_modules.xml](attachments/8193303/8127590.xml)
(application/octet-stream)
{{< highlight bash >}}
$ cutadapt --help
{{< /highlight >}}
\\
Simple Cutadapt script that trims the adapter sequences **AGGCACACAGGG** and **TGAGACACGCA** from the 3' end and **AACCGGTT** from the 5' end of single-end fasta input file is shown below:
{{% panel header="`cutadapt.submit`"%}}
{{< highlight bash >}}
#!/bin/sh
#SBATCH --job-name=Cutadapt
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=168:00:00
#SBATCH --mem=10gb
#SBATCH --output=Cutadapt.%J.out
#SBATCH --error=Cutadapt.%J.err
module load cutadapt/1.13
cutadapt -a AGGCACACAGGG -a TGAGACACGCA -g AACCGGTT input_reads.fasta > output_reads.fasta
{{< /highlight >}}
{{% /panel %}}
Cutadapt is single threaded program, and therefore `#SBATCH --nodes=1` and `#SBATCH --ntasks-per-node=1`.
\\
\\
\\
Cutadapt allows paired-end trimming where each pair is trimmed separately in a single pass:
{{< highlight bash >}}
$ cutadapt -a ADAPTER_PAIR_1 input_reads_pair_1.fastq > output_reads_pair_1.fastq
$ cutadapt -a ADAPTER_PAIR_2 input_reads_pair_2.fastq > output_reads_pair_2.fastq
{{< /highlight >}}
\\
<span style="color: rgb(0,0,0);font-size: 20.0px;line-height: 1.5;">Cutadapt Output</span>
Beside the fasta/fastq file of reads with removed adapter sequences, Cutadapt also outputs useful statistics per adapter sequence.
\ No newline at end of file
1. [HCC-DOCS](index.html)
2. [HCC-DOCS Home](HCC-DOCS-Home_327685.html)
3. [HCC Documentation](HCC-Documentation_332651.html)
4. [Running Applications](Running-Applications_7471153.html)
5. [Bioinformatics Tools](Bioinformatics-Tools_8193279.html)
6. [Pre-Processing Tools](Pre-Processing-Tools_8193298.html)
+++
title = "PRINSEQ"
description = "How to run PRINSEQ on HCC resources"
weight = "10"
+++
<span id="title-text"> HCC-DOCS : PRINSEQ </span>
=================================================
Created by <span class="author"> Adam Caprez</span>, last modified by
<span class="editor"> Natasha Pavlovikj</span> on Dec 12, 2016
| Name | Version | Resource |
|--------------|---------|----------|
| prinseq-lite | 0.20 | Crane |
 
PRINSEQ (PReprocessing and INformation of SEQuence data)
(<a href="http://prinseq.sourceforge.net/" class="external-link">http://prinseq.sourceforge.net/</a>)
is a tool used for filtering, formatting or trimming genome and
metagenomic sequence data in fasta/fastq format. Moreover, PRINSEQ
generates summary statistics of sequence and quality data.
[PRINSEQ (PReprocessing and INformation of SEQuence data)] (http://prinseq.sourceforge.net/) is a tool used for filtering, formatting or trimming genome and metagenomic sequence data in fasta/fastq format. Moreover, PRINSEQ generates summary statistics of sequence and quality data.
More information about the PRINSEQ program can be shown with:
**Additional PRINSEQ Options**
``` syntaxhighlighter-pre
[<username>@login.crane ~]$ prinseq-lite.pl --help
```
**
**
**PRINSEQ script for single-end fasta data**
{{< highlight bash >}}
$ prinseq-lite.pl --help
{{< /highlight >}}
\\
<span style="color: rgb(0,0,0);font-size: 20.0px;line-height: 1.5;">PRINSEQ with single-end fasta data</span>
The basic usage of PRINSEQ for single-end data is:
**General PRINSEQ Usage for Single-End Data**
``` syntaxhighlighter-pre
prinseq-lite.pl [-fasta|-fastq] input_reads.[fasta|fastq] -out_format [1|2|3|4|5] [options]
```
where **input\_reads.\[fasta\|fastq\]** is an input file of sequence
data in fasta/fastq format, and **options** are additional parameters
that can be found in the PRINSEQ manual:
<a href="http://prinseq.sourceforge.net/manual.html" class="external-link">http://prinseq.sourceforge.net/manual.html</a>.
The output format (**-out\_format**) can be **1 **(fasta only),
**2 **(fasta and qual), **3 **(fastq), **4 **(fastq and input fasta),
and **5 **(fastq, fasta and qual).
Simple PRINSEQ SLURM script for single-end fasta data and fasta output
format is shown below:
**prinseq\_single\_end.submit**
\#!/bin/sh
\#SBATCH --job-name=PRINSEQ
\#SBATCH --nodes=1
\#SBATCH --ntasks-per-node=1
\#SBATCH --time=168:00:00
\#SBATCH --mem=20gb
\#SBATCH --output=PRINSEQ\_single.%J.out
\#SBATCH --error=PRINSEQ\_single.%J.err
 
| |
|-------------------------------|
| module load prinseq-lite/0.20 |
prinseq-lite.pl -fasta input\_reads.fasta -out\_format 1
PRINSEQ is single threaded program, and therefore both **\#SBATCH
--nodes** and **\#SBATCH --ntasks-per-node** are set to **1**.
 
**PRINSEQ script for paired-end fastq data**
{{< highlight bash >}}
$ prinseq-lite.pl [-fasta|-fastq] input_reads.[fasta|fastq] -out_format [1|2|3|4|5] [options]
{{< /highlight >}}
where **input_reads.[fasta|fastq]** is an input file of sequence data in fasta/fastq format, and **options** are additional parameters that can be found in the [PRINSEQ manual] (http://prinseq.sourceforge.net/manual.html).
The output format (`-out_format`) can be **1** (fasta only), **2** (fasta and qual), **3** (fastq), **4** (fastq and input fasta), and **5** (fastq, fasta and qual).
Simple PRINSEQ SLURM script for single-end fasta data and fasta output format is shown below:
{{% panel header="`prinseq_single_end.submit`"%}}
{{< highlight bash >}}
#!/bin/sh
#SBATCH --job-name=PRINSEQ
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=168:00:00
#SBATCH --mem=10gb
#SBATCH --output=PRINSEQ_single.%J.out
#SBATCH --error=PRINSEQ_single.%J.err
module load prinseq-lite/0.20
prinseq-lite.pl -fasta input_reads.fasta -out_format 1
{{< /highlight >}}
{{% /panel %}}
PRINSEQ is single threaded program, and therefore both `#SBATCH --nodes` and `#SBATCH --ntasks-per-node` are set to **1**.
\\
\\
\\
<span style="color: rgb(0,0,0);font-size: 20.0px;line-height: 1.5;">PRINSEQ for paired-end fastq data</span>
The basic usage of PRINSEQ for paired-end data is:
{{< highlight bash >}}
$ prinseq-lite.pl [-fasta|-fastq] input_reads_pair_1.[fasta|fastq] [-fasta2|-fastq2] input_reads_pair_2.[fasta|fastq] -out_format [1|2|3|4|5] [options]
{{< /highlight >}}
where **input_reads_pair_1.[fasta|fastq]** and **input_reads_pair_2.[fasta|fastq]** are pair 1 and pair 2 of the input files of sequence data in fasta/fastq format, and **options** are additional parameters that can be found in the the [PRINSEQ manual] (http://prinseq.sourceforge.net/manual.html).
**General PRINSEQ Usage for Paired-End Data**
``` syntaxhighlighter-pre
prinseq-lite.pl [-fasta|-fastq] input_reads_pair_1.[fasta|fastq] [-fasta2|-fastq2] input_reads_pair_2.[fasta|fastq] -out_format [1|2|3|4|5] [options]
```
where **input\_reads\_pair\_1.\[fasta\|fastq\]**
and ****input\_reads\_pair\_2.\[fasta\|fastq\]**** are pair 1 and pair 2
of the input files of sequence data in fasta/fastq format,
and **options** are additional parameters that can be found in the
PRINSEQ
manual: <a href="http://prinseq.sourceforge.net/manual.html" class="external-link">http://prinseq.sourceforge.net/manual.html</a>.
The output format (**-out\_format**) can be **1 **(fasta only),
**2 **(fasta and qual), **3 **(fastq), **4 **(fastq and input fasta),
and **5 **(fastq, fasta and qual).
Simple PRINSEQ SLURM script for paired-end fastq data and fastq output
format is shown below:
**prinseq\_paired\_end.submit**
\#!/bin/sh
\#SBATCH --job-name=PRINSEQ
\#SBATCH --nodes=1
\#SBATCH --ntasks-per-node=1
\#SBATCH --time=168:00:00
\#SBATCH --mem=20gb
\#SBATCH --output=PRINSEQ\_paired.%J.out
\#SBATCH --error=PRINSEQ\_paired.%J.err
 
| |
|-------------------------------|
| module load prinseq-lite/0.20 |
prinseq-lite.pl -fastq input\_reads\_pair\_1.fastq -fastq2
input\_reads\_pair\_2.fastq -out\_format 3
PRINSEQ is single threaded program, and therefore both **\#SBATCH
--nodes** and **\#SBATCH --ntasks-per-node** are set to **1**.
**
**
The output format (`-out_format`) can be **1** (fasta only), **2** (fasta and qual), **3** (fastq), **4** (fastq and input fasta), and **5** (fastq, fasta and qual).
**PRINSEQ Output**
Simple PRINSEQ SLURM script for paired-end fastq data and fastq output format is shown below:
{{% panel header="`prinseq_paired_end.submit`"%}}
{{< highlight bash >}}
#!/bin/sh
#SBATCH --job-name=PRINSEQ
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=168:00:00
#SBATCH --mem=10gb
#SBATCH --output=PRINSEQ_paired.%J.out
#SBATCH --error=PRINSEQ_paired.%J.err
PRINSEQ gives statistics about the input and filtered sequences, and
also outputs single-end or paired-end files of sequences filtered by
specified parameters.
module load prinseq-lite/0.20
Attachments:
------------
prinseq-lite.pl -fastq input_reads_pair_1.fastq -fastq2 input_reads_pair_2.fastq -out_format 3
{{< /highlight >}}
{{% /panel %}}
<img src="assets/images/icons/bullet_blue.gif" width="8" height="8" />
[cb\_prinseq\_module.xsl](attachments/8193299/8127572.xsl)
(application/octet-stream)
<img src="assets/images/icons/bullet_blue.gif" width="8" height="8" />
[crane\_modules.xml](attachments/8193299/8127573.xml)
(application/octet-stream)
<img src="assets/images/icons/bullet_blue.gif" width="8" height="8" />
[crane\_prinseq\_version.xsl](attachments/8193299/8127574.xsl)
(application/octet-stream)
PRINSEQ is single threaded program, and therefore both `#SBATCH --nodes` and `#SBATCH --ntasks-per-node` are set to **1**.
\\
<span style="color: rgb(0,0,0);font-size: 20.0px;line-height: 1.5;">PRINSEQ Output</span>
PRINSEQ gives statistics about the input and filtered sequences, and also outputs files of single-end or paired-end sequences filtered by specified parameters.
\ No newline at end of file
1. [HCC-DOCS](index.html)
2. [HCC-DOCS Home](HCC-DOCS-Home_327685.html)
3. [HCC Documentation](HCC-Documentation_332651.html)
4. [Running Applications](Running-Applications_7471153.html)
5. [Bioinformatics Tools](Bioinformatics-Tools_8193279.html)
6. [Pre-Processing Tools](Pre-Processing-Tools_8193298.html)
+++
title = "Scythe"
description = "How to run Scythe on HCC resources"
weight = "10"
+++
<span id="title-text"> HCC-DOCS : Scythe </span>
================================================
Created by <span class="author"> Adam Caprez</span>, last modified by
<span class="editor"> Natasha Pavlovikj</span> on Dec 12, 2016
| Name | Version | Resource |
|--------|---------|----------|
| scythe | 0.991 | Tusker |
 
Scythe
(<a href="https://github.com/vsbuffalo/scythe" class="external-link">https://github.com/vsbuffalo/scythe</a>)
is a 3' end adapter trimmer that uses a Naive Bayesian approach to
classify contaminant substrings in sequence reads. 3' ends often include
poor quality bases which need to be removed prior the quality-based
trimming, mapping, assemblies, and further analysis.
[Scythe] (https://github.com/vsbuffalo/scythe) is a 3' end adapter trimmer that uses a Naive Bayesian approach to classify contaminant substrings in sequence reads. 3' ends often include poor quality bases which need to be removed prior the quality-based trimming, mapping, assemblies, and further analysis.
The basic usage of Scythe is:
{{< highlight bash >}}
$ scythe -a adapter_file.fasta input_reads.fastq -o output_reads.fastq
{{< /highlight >}}
where **adapter_file.fasta** is fasta input file of the adapter sequences that need to be removed from the 3' end of the sequence data, and **input_reads.fastq** is the input sequencing data in fastq format.
**General Scythe Usage**
``` syntaxhighlighter-pre
scythe -a adapter_file.fasta input_reads.fastq -o output_reads.fastq
```
where **adapter\_file.fasta** is a fasta input file of the adapter
sequences that need to be removed from the 3' end of the sequence data,
and **input\_reads.fastq** is the input sequencing data in fastq format.
The file **output\_reads.fastq** contains the sequencing reads with
removed adapters. If the adapter sequences are unknown, Scythe by itself
provides two adapter sequences that can be used with the **-a**
option: **illumina\_adapters.fa** and **truseq\_adapters.fasta**.
The file **output_reads.fastq** contains the sequencing reads with removed adapters. If the adapter sequences are unknown, Scythe by itself provides two adapter sequences that can be used with the **-a** option: **illumina_adapters.fa** and **truseq_adapters.fasta**.
More information about Scythe can found by typing:
**Additional Scythe Options**
``` syntaxhighlighter-pre
[<username>@login.tusker ~]$ scythe --help
```
Simple Scythe script that uses the **illumina\_adapters.fa** file and
**input\_reads.fastq** for Tusker is shown below:
**scythe.submit**
\#!/bin/sh
\#SBATCH --job-name=Scythe
\#SBATCH --nodes=1
\#SBATCH --ntasks-per-node=1
\#SBATCH --time=168:00:00
\#SBATCH --mem=20gb
\#SBATCH --output=Scythe.%J.out
\#SBATCH --error=Scythe.%J.err
 
| |
|--------------------------|
| module load scythe/0.991 |
scythe -a $SCYTHE\_HOME/illumina\_adapters.fa input\_reads.fastq -o
output\_reads.fastq
Scythe is single threaded program, and therefore both **\#SBATCH
--nodes** and **\#SBATCH --ntasks-per-node** are set to **1**. The two
adapter sequences provided by Scythe are stored in **$SCYTHE\_HOME**.
Hence, to access the illumina adapter file
use: **$SCYTHE\_HOME/illumina\_adapters.fa**, and to access the TruSeq
file use:** $SCYTHE\_HOME/**truseq\_adapters.fasta****.
**
**
**Scythe Output**
{{< highlight bash >}}
$ scythe --help
{{< /highlight >}}
\\
Simple Scythe script that uses the `illumina_adapters.fa` file and `input_reads.fastq` for Tusker is shown below:
{{% panel header="`scythe.submit`"%}}
{{< highlight bash >}}
#!/bin/sh
#SBATCH --job-name=Scythe
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=168:00:00
#SBATCH --mem=10gb
#SBATCH --output=Scythe.%J.out
#SBATCH --error=Scythe.%J.err
module load scythe/0.991
scythe -a ${SCYTHE_HOME}/illumina_adapters.fa input_reads.fastq -o output_reads.fastq
{{< /highlight >}}
{{% /panel %}}
Scythe is single threaded program, and therefore both `#SBATCH --nodes` and `#SBATCH --ntasks-per-node` are set to **1**.
The two adapter sequences provided by Scythe are stored in **$SCYTHE_HOME**. Hence, to access the illumina adapter file use: `$SCYTHE_HOME/illumina_adapters.fa`, and to access the TruSeq file use: `$SCYTHE_HOME/truseq_adapters.fasta`.
\\
\\
\\
<span style="color: rgb(0,0,0);font-size: 20.0px;line-height: 1.5;">Scythe Output</span>
Scythe returns fastq file of reads with removed adapter sequences.
\\
\\
\\
<span style="color: rgb(0,0,0);font-size: 20.0px;line-height: 1.5;">Useful Information</span>
 
**Useful Information**
In order to test the SCYTHE (scythe/0.991) performance on Tusker, we
used three paired-end input fastq files: **small\_1.fastq**,
**small\_2.fastq**, **medium\_1.fastq**, **medium\_2.fastq**,
**large\_1.fastq, large\_2.fastq**. Some statistics about the input
files and the time and memory resources required for SCYTHE are shown on
the table below:
<table style="width:100%;">
<colgroup>
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
</colgroup>
<thead>
<tr class="header">
<th> </th>
<th><p><strong>total # of sequences</strong></p></th>
<th><p><strong>total # of bases</strong></p></th>
<th><p><strong>total size in MB</strong></p></th>
<th><p><strong>required time</strong></p></th>
<th><p><strong>required memory</strong></p></th>
<th># of used CPUs</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><p><strong>small_1.fastq</strong></p></td>
<td><p>50,121</p></td>
<td><p>2,506,050</p></td>
<td><p>8.010 MB</p></td>