Skip to content
Snippets Groups Projects
Commit f0671b9d authored by Adam Caprez's avatar Adam Caprez
Browse files

Merge branch 'bioinformatics-part4' into 'master'

Add bio pages part 4

See merge request !29
parents ba5c0711 4a59f001
No related branches found
No related tags found
1 merge request!29Add bio pages part 4
Showing
with 952 additions and 10932 deletions
+++ +++
title = "Bioinformatics Tools" title = "Bioinformatics Tools"
description = "How to use various bioinformatics tools on HCC machines"
weight = "52"
+++ +++
<span style="color: rgb(0,0,0);">The following is a categorized list of The following is a categorized list of bioinformatics tools available on HCC. Each page contains summary of the tool, information about the HCC resources that have the specific tool, links to user documentation, as well as example SLURM submit scripts.
bioinformatics tools available on HCC. Each page contains summary of the
tool, information about the HCC resources that have the specific
tool, links to user documentation, as well as example SLURM submit
scripts. More detailed information about submitting SLURM jobs and
checking job status on HCC can be
found [here](Submitting-Jobs_332222.html).</span>
<span style="color: rgb(0,0,0);"> </span> More detailed information about submitting SLURM jobs and checking job status on HCC can be found [here](../../submitting_jobs)
{{% children %}}
+++ +++
title = "Alignment Tools" title = "Alignment Tools"
description = "How to use various alignment tools on HCC machines"
weight = "52"
+++ +++
1. [HCC-DOCS](index.html) {{% children %}}
2. [HCC-DOCS Home](HCC-DOCS-Home_327685.html) \ No newline at end of file
3. [HCC Documentation](HCC-Documentation_332651.html)
4. [Running Applications](Running-Applications_7471153.html)
5. [Bioinformatics Tools](Bioinformatics-Tools_8193279.html)
<span id="title-text"> HCC-DOCS : Alignment Tools </span>
=========================================================
Created by <span class="author"> Adam Caprez</span> on Sep 04, 2014
1. [HCC-DOCS](index.html) +++
2. [HCC-DOCS Home](HCC-DOCS-Home_327685.html) title = "BLAT"
3. [HCC Documentation](HCC-Documentation_332651.html) description = "How to run BLAT on HCC resources"
4. [Running Applications](Running-Applications_7471153.html) weight = "10"
5. [Bioinformatics Tools](Bioinformatics-Tools_8193279.html) +++
6. [Alignment Tools](Alignment-Tools_8193288.html)
<span id="title-text"> HCC-DOCS : BLAT </span>
==============================================
Created by <span class="author"> Adam Caprez</span>, last modified by BLAT is a pairwise alignment tool similar to BLAST. It is more accurate and about 500 times faster than the existing tools for mRNA/DNA alignments and it is about 50 times faster with protein/protein alignments. BLAT accepts short and long query and database sequences as input files.
<span class="editor"> Natasha Pavlovikj</span> on Dec 12, 2016
| Name | Version | Resource |
|------|---------|----------|
| blat | 35x1 | Tusker |
| | | |
|------|------|-------|
| blat | 35x1 | Crane |
<span style="line-height: 1.4285715;">
</span>
<span style="line-height: 1.4285715;">BLAT is a pairwise alignment tool
similar to BLAST. It is more accurate and about 500 times faster than
the existing tools for mRNA/DNA alignments and it is about 50 times
faster with protein/protein alignments. BLAT accepts short and long
query and database sequences as input files.</span>
The basic usage of BLAT is: The basic usage of BLAT is:
{{< highlight bash >}}
**General BLAT Usage** $ blat database query output_alignment.txt [options]
{{< /highlight >}}
``` syntaxhighlighter-pre where **database** is the name of the database used for the alignment, **query** is the name of the input file of sequence data in `fasta/nib/2bit` format, and **output_alignment.txt** is the output alignment file.
blat database query output_alignment.txt [options]
``` Additional parameters for BLAT alignment can be found in the [manual] (http://genome.ucsc.edu/FAQ/FAQblat), or by using:
{{< highlight bash >}}
where **database** is the name of the database used for the alignment, $ blat
**query** is the name of the input file of sequence data in {{< /highlight >}}
fasta/nib/2bit format, and **output\_alignment.txt** is the output
alignment file. Additional parameters for BLAT alignment can be found in \\
the Running BLAT on Tusker with query file `input_reads.fasta` and database `db.fa` is shown below:
manual: <a href="http://genome.ucsc.edu/goldenPath/help/blatSpec.html" class="external-link">http://genome.ucsc.edu/goldenPath/help/blatSpec.html</a>, {{% panel header="`blat_alignment.submit`"%}}
or by using {{< highlight bash >}}
#!/bin/sh
**Additional BLAT Options** #SBATCH --job-name=Blat
#SBATCH --nodes=1
``` syntaxhighlighter-pre #SBATCH --ntasks-per-node=1
[<username>@login.tusker~]$ blat #SBATCH --time=168:00:00
``` #SBATCH --mem=50gb
#SBATCH --output=Blat.%J.out
Running BLAT on Tusker with query file **input\_reads.fasta** and #SBATCH --error=Blat.%J.err
database **db.fa** is shown below:
module load blat/35x1
**blat\_alignment.submit**
blat db.fa input_reads.fasta output_alignment.txt
\#!/bin/sh {{< /highlight >}}
\#SBATCH --job-name=Blat {{% /panel %}}
\#SBATCH --nodes=1
\#SBATCH --ntasks-per-node=1 Although BLAT is a single threaded program (`#SBATCH --nodes=1`, `#SBATCH --ntasks-per-node=1`) it is still much faster than the other alignment tools.
\#SBATCH --time=168:00:00
\#SBATCH --mem=50gb \\
\#SBATCH --output=Blat.%J.out <span style="color: rgb(0,0,0);font-size: 20.0px;line-height: 1.5;">BLAT Output</span>
\#SBATCH --error=Blat.%J.err
BLAT output is a list containing the following information:
- the score of the alignment
| | - the region of query sequence that matches the database sequence
|-----------------------| - the size of the query sequence
| module load blat/35x1 | - the level of identity as a percentage of the alignment
- the chromosome and position that the query sequence maps to
blat db.fa input\_reads.fasta output\_alignment.txt \ No newline at end of file
Although BLAT is a single threaded program (**\#SBATCH --nodes=1**,
**\#SBATCH --ntasks-per-node=1**) it is still much faster than the other
alignment tools.
**BLAT Output**
BLAT output is a list containing the following information: *the score
of the alignment*, *the region of query sequence that matches the
database sequence*, *the size of the query sequence*, *the level of
identity as a percentage of the alignment* and *the chromosome and
position that the query sequence maps to*.
Attachments:
------------
<img src="assets/images/icons/bullet_blue.gif" width="8" height="8" />
[cb\_blat\_module.xsl](attachments/8193292/8127546.xsl)
(application/octet-stream)
<img src="assets/images/icons/bullet_blue.gif" width="8" height="8" />
[crane\_blat\_version.xsl](attachments/8193292/8127547.xsl)
(application/octet-stream)
<img src="assets/images/icons/bullet_blue.gif" width="8" height="8" />
[crane\_modules.xml](attachments/8193292/8127548.xml)
(application/octet-stream)
<img src="assets/images/icons/bullet_blue.gif" width="8" height="8" />
[tusker\_blat\_version.xsl](attachments/8193292/8127549.xsl)
(application/octet-stream)
<img src="assets/images/icons/bullet_blue.gif" width="8" height="8" />
[tusker\_modules.xml](attachments/8193292/8127550.xml)
(application/octet-stream)
1. [HCC-DOCS](index.html) +++
2. [HCC-DOCS Home](HCC-DOCS-Home_327685.html) title = "Clustal Omega"
3. [HCC Documentation](HCC-Documentation_332651.html) description = "How to run Clustal Omega on HCC resources"
4. [Running Applications](Running-Applications_7471153.html) weight = "10"
5. [Bioinformatics Tools](Bioinformatics-Tools_8193279.html) +++
6. [Alignment Tools](Alignment-Tools_8193288.html)
<span id="title-text"> HCC-DOCS : Clustal Omega </span> [Clustal Omega] (http://www.clustal.org/omega/) is a general purpose multiple sequence alignment (MSA) tool used mainly with protein, as well as DNA and RNA sequences. Clustal Omega is fast and scalable aligner that can align datasets of hundreds of thousands of sequences in reasonable time.
=======================================================
Created by <span class="author"> Adam Caprez</span>, last modified by
<span class="editor"> Natasha Pavlovikj</span> on Dec 12, 2016
| Name | Version | Resource |
|---------------|---------|----------|
| clustal-omega | 1.2 | Tusker |
| | | |
|---------------|-----|-------|
| clustal-omega | 1.2 | Crane |
Clustal Omega
(<a href="http://www.clustal.org/omega/" class="external-link">http://www.clustal.org/omega/</a>)
is a general purpose multiple sequence alignment (MSA) tool used mainly
with protein, as well as DNA and RNA sequences. Clustal Omega is fast
and scalable aligner that can align datasets of hundreds of thousands of
sequences in reasonable time.
The general usage of Clustal Omega is: The general usage of Clustal Omega is:
{{< highlight bash >}}
$ clustalo -i input_file.fasta -o output_file.fasta [options]
{{< /highlight >}}
where **input_file.fasta** is the multiple sequence input file in `fasta` format, and **output_file.fasta** is the multiple sequence alignment output file in `fasta` format.
**General Clustal Omega Usage** \\
``` syntaxhighlighter-pre
clustalo -i input_file.fasta -o output_file.fasta [options]
```
where **input\_file.fasta** is the multiple sequence input file in
*fasta* format, and **output\_file.fasta** is the multiple sequence
alignment output file in *fasta* format.
Clustal Omega accepts 3 types of sequence input files: Clustal Omega accepts 3 types of sequence input files:
- sequence file with aligned/unaligned sequences - sequence file with aligned/unaligned sequences
<!-- -->
- multiple alignment in a file/profile of aligned sequences - multiple alignment in a file/profile of aligned sequences
<!-- -->
- Hidden Markov Model (HMM) - Hidden Markov Model (HMM)
These input files must contain at least 2 sequences and must be in one These input files must contain at least 2 sequences and must be in one of the following MSA file formats: `a2m`, `fa[sta]`, `clu[stal]`, `msf`, `phy[lip]`, `selex`, `st[ockholm]`, `vie[nna]`. Moreover, if not specified, the generated output file is in `fasta` format.
of the following MSA file formats: **a2m**, **fa\[sta\]**,
**clu\[stal\]**, **msf**, **phy\[lip\]**, **selex**, **st\[ockholm\]**,
**vie\[nna\]**. Moreover, if not specified, the generated output file is
in *fasta* format.
\\
More Clustal Omega options can be found by typing: More Clustal Omega options can be found by typing:
{{< highlight bash >}}
**Additional Clustal Omega Options** $ clustalo -h
{{< /highlight >}}
``` syntaxhighlighter-pre
[<username>@login.tusker~]$ clustalo -h \\
``` Running Clustal Omega on Tusker with input file `input_reads.fasta` with `8 threads` and `10GB memory` is shown below:
{{% panel header="`clustal_omega.submit`"%}}
{{< highlight bash >}}
Running Clustal Omega on Tusker with input #!/bin/sh
file **input\_reads.fasta** with **8 threads** and **10GB memory** is #SBATCH --job-name=Clustal_Omega
shown below: #SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
**clustal\_omega.submit** #SBATCH --time=10:00:00
#SBATCH --mem=10gb
\#!/bin/sh #SBATCH --output=ClustalOmega.%J.out
\#SBATCH --job-name=Clustal\_Omega #SBATCH --error=ClustalOmega.%J.err
\#SBATCH --nodes=1
\#SBATCH --ntasks-per-node=8 module load clustal-omega/1.2
\#SBATCH --time=10:00:00
\#SBATCH --mem=10gb clustalo -i input_reads.fasta -o output_msa.sto --outfmt=st --threads=$SLURM_NTASKS_PER_NODE
\#SBATCH --output=ClustalOmega.%J.out {{< /highlight >}}
\#SBATCH --error=ClustalOmega.%J.err {{% /panel %}}
The output file `output_msa.sto` contains the resulting multiple sequence alignments in Stockholm format (**--outfmt=st**).
| |
|-------------------------------|
| module load clustal-omega/1.2 |
clustalo -i input\_reads.fasta -o output\_msa.sto --outfmt=st
--threads=$SLURM\_NTASKS\_PER\_NODE
The output file **output\_msa.sto** contains the resulting multiple
sequence alignments in Stockholm format (**--outfmt=st**).
Moreover, if you change the command above with: Moreover, if you change the command above with:
{{< highlight bash >}}
$ clustalo -i input_reads.sto --dealign -v
{{< /highlight >}}
Clustal Omega will read the input file in Stockholm format, de-align the sequences, and then re-align them, printing progress report in meanwhile (**-v**). Because it is not specified, the output will be in the default `fasta` format.
**Clustal Omega with De-align Option** \\
<span style="color: rgb(0,0,0);font-size: 20.0px;line-height: 1.5;">Clustal Omega Output</span>
``` syntaxhighlighter-pre
clustalo -i input_reads.sto --dealign -v
```
Clustal Omega will read the input file in Stockholm format, de-align the
sequences, and then re-align them, printing progress report in meanwhile
(**-v**). Because it is not specified, the output will be in the default
**fasta** format.
**Clustal Omega Output**
The basic Clustal Omega output produces one alignment file in the
specified output format. More intermediate outputs can be generated
using specific Clustal Omega options, such
as: **--distmat-out=&lt;file&gt;** (*pairwise distance matrix output
file*) and **--guidetree-out=&lt;file&gt;** (*guide tree output file*).
**
Useful Information**
In order to test the Clustal Omega performance on Tusker, we used three
DNA and protein input fasta files: **data\_1. fasta, data\_2. fasta,
data\_3.fasta**. Some statistics about the input files and the time and
memory resources required for Clustal Omega are shown on the table
below:
<table style="width:100%;">
<colgroup>
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
</colgroup>
<thead>
<tr class="header">
<th> </th>
<th><p><strong>total # of sequences</strong></p></th>
<th><p><strong>average sequence length</strong></p></th>
<th><p><strong>total size in MB</strong></p></th>
<th><p><strong>Clustal Omega required time</strong></p></th>
<th><p><strong>Clustal Omega required memory</strong></p></th>
<th># of used CPUs</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><p><strong>data_1.fasta</strong></p></td>
<td><p>1,200</p></td>
<td><p>510.17</p></td>
<td><p>641 KB</p></td>
<td><p>~ 5 minutes</p></td>
<td><span>~ 65 MB</span></td>
<td>8</td>
</tr>
<tr class="even">
<td><p><strong>data_2.fasta</strong></p></td>
<td><p>5,715</p></td>
<td><p>174.20</p></td>
<td><p>1,100 KB</p></td>
<td>~ 5 minutes</td>
<td><p>~ 140 MB</p></td>
<td><p>8</p></td>
</tr>
<tr class="odd">
<td><p><strong>data_3.fasta</strong></p></td>
<td><p>93,675</p></td>
<td><p>94.29</p></td>
<td><p>11,000 KB</p></td>
<td><p>~ 30 minutes</p></td>
<td><p>~ 2 GB</p></td>
<td><p>8</p></td>
</tr>
</tbody>
</table>
Attachments:
------------
<img src="assets/images/icons/bullet_blue.gif" width="8" height="8" /> The basic Clustal Omega output produces one alignment file in the specified output format. More intermediate outputs can be generated using specific Clustal Omega options, such as: **--distmat-out=<file>** (*pairwise distance matrix output file*) and **--guidetree-out=<file>** (*guide tree output file*).
[crane\_clustal\_omega\_version.xsl](attachments/9470379/9863812.xsl)
(application/octet-stream)
<img src="assets/images/icons/bullet_blue.gif" width="8" height="8" />
[cb\_clustal\_omega\_module.xsl](attachments/9470379/9863813.xsl)
(application/octet-stream)
<img src="assets/images/icons/bullet_blue.gif" width="8" height="8" />
[tusker\_clustal\_omega\_version.xsl](attachments/9470379/9863814.xsl)
(application/octet-stream)
<img src="assets/images/icons/bullet_blue.gif" width="8" height="8" />
[crane\_modules.xml](attachments/9470379/9863815.xml)
(application/octet-stream)
<img src="assets/images/icons/bullet_blue.gif" width="8" height="8" />
[tusker\_modules.xml](attachments/9470379/9863816.xml)
(application/octet-stream)
\\
<span style="color: rgb(0,0,0);font-size: 20.0px;line-height: 1.5;">Useful Information</span>
In order to test the Clustal Omega performance on Tusker, we used three DNA and protein input fasta files, `data_1.fasta`, `data_2.fasta`, `data_3.fasta`. Some statistics about the input files and the time and memory resources used by Clustal Omega on Tusker are shown on the table below:
{{< readfile file="/static/html/clustal_omega.html" >}}
\ No newline at end of file
+++ +++
title = "Data Manipulation Tools" title = "Data Manipulation Tools"
description = "How to use data manipulation tools on HCC machines"
weight = "52"
+++ +++
1. [HCC-DOCS](index.html) {{% children %}}
2. [HCC-DOCS Home](HCC-DOCS-Home_327685.html) \ No newline at end of file
3. [HCC Documentation](HCC-Documentation_332651.html)
4. [Running Applications](Running-Applications_7471153.html)
5. [Bioinformatics Tools](Bioinformatics-Tools_8193279.html)
<span id="title-text"> HCC-DOCS : Data Manipulation Tools </span>
=================================================================
Created by <span class="author"> Adam Caprez</span> on Sep 04, 2014
+++
title = "Running SAMtools Commands"
description = "How to run SAMtools commands on HCC resources"
weight = "10"
+++
<span style="color: rgb(0,0,0);font-size: 20.0px;line-height: 1.5;">SAMtools View:</span>
One of the most frequently used SAMtools command is **view**. The basic usage of the **samtools view** is:
{{< highlight bash >}}
$ samtools view input_alignments.[bam|sam] [options] -o output_alignments.[sam|bam]
{{< /highlight >}}
where **input_alignments.[bam|sam]** is the input file with the alignments in BAM/SAM format, and **output_alignments.[sam|bam]** file is the converted file into SAM or BAM format respectively.
Running **samtools view** on Tusker with `8 CPUs`, input file `input_alignments.sam` with available header (**-S**), output in BAM format (**-b**) and output file `output_alignments.bam` is shown below:
{{% panel header="`samtools_view.submit`"%}}
{{< highlight bash >}}
#!/bin/sh
#SBATCH --job-name=SAMtools_View
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=168:00:00
#SBATCH --mem=10gb
#SBATCH --output=SAMtools.%J.out
#SBATCH --error=SAMtools.%J.err
module load samtools/1.9
samtools view -bS -@ $SLURM_NTASKS_PER_NODE input_alignments.sam -o output_alignments.bam
{{< /highlight >}}
{{% /panel %}}
The most intensive SAMtools commands (**samtools view**, **samtools sort**) are multi-threaded, and therefore using the SAMtools option **-@ <number_of_CPUs>** is recommended.
\\
<span style="color: rgb(0,0,0);font-size: 20.0px;line-height: 1.5;">SAMtools Sort:</span>
Sorting BAM files is recommended for further analysis of these files. The BAM file is sorted based on its position in the reference, as determined by its alignment. An example of using `4 CPUs` to sort the input file `input_alignments.bam` by the read name follows:
{{< highlight bash >}}
$ samtools sort -n -@ 4 input_alignments.bam output_alignments_sorted
{{< /highlight >}}
\\
<span style="color: rgb(0,0,0);font-size: 20.0px;line-height: 1.5;">SAMtools Index:</span>
The **samtools index** command creates a new index file that allows fast look-up of the data in a sorted SAM or BAM file.
{{< highlight bash >}}
$ samtools index input_alignments_sorted.bam output_index.bai
{{< /highlight >}}
\\
<span style="color: rgb(0,0,0);font-size: 20.0px;line-height: 1.5;">SAMtools Idxstats:</span>
The **samtools idxstats** command prints stats for the BAM index file. The output is TAB delimited with each line consisting of *reference sequence name*, *sequence length*, *number of mapped reads* and *number of unmapped reads*.
{{< highlight bash >}}
$ samtools idxstats input_alignments_sorted.bam
{{< /highlight >}}
\\
<span style="color: rgb(0,0,0);font-size: 20.0px;line-height: 1.5;">SAMtools Merge:</span>
The **samtools merge** command merges multiple sorted alignments into one output file.
{{< highlight bash >}}
$ samtools merge output_alignments_merge.bam input_alignments_sorted_1.bam input_alignments_sorted_2.bam
{{< /highlight >}}
\\
<span style="color: rgb(0,0,0);font-size: 20.0px;line-height: 1.5;">SAMtools Faidx:</span>
The command **samtools faidx** indexes the reference sequence in fasta format or extracts subsequence from indexed reference sequence.
{{< highlight bash >}}
$ samtools faidx input_reference.fasta
{{< /highlight >}}
\\
<span style="color: rgb(0,0,0);font-size: 20.0px;line-height: 1.5;">SAMtools Mpileup:</span>
The **samtools mpileup** command generates file in `bcf` or `pileup` format for one or multiple BAM files. For each genomic coordinate, the overlapping read bases and indels at that position in the input BAM file are printed.
{{< highlight bash >}}
$ samtools mpileup input_alignments_sorted.bam > output_alignments.bcf
{{< /highlight >}}
\\
<span style="color: rgb(0,0,0);font-size: 20.0px;line-height: 1.5;">SAMtools View:</span>
The **samtools tview** command starts an interactive text alignment viewer that can be used to visualize how reads are aligned to specific regions of the reference genome.
{{< highlight bash >}}
$ samtools tview input_alignments_sorted.bam
{{< /highlight >}}
\ No newline at end of file
...@@ -35,7 +35,7 @@ A simple SLURM script to run Oases on the Velvet output stored in `output_direct ...@@ -35,7 +35,7 @@ A simple SLURM script to run Oases on the Velvet output stored in `output_direct
#SBATCH --output=Oases.%J.out #SBATCH --output=Oases.%J.out
#SBATCH --error=Oases.%J.err #SBATCH --error=Oases.%J.err
module load oases/0.2.8 module load oases/0.2
oases output_directory/ -min_trans_lgth 200 oases output_directory/ -min_trans_lgth 200
{{< /highlight >}} {{< /highlight >}}
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment