clustal_omega.md 3.12 KB
Newer Older
npavlovikj's avatar
npavlovikj committed
1
2
3
4
5
+++
title = "Clustal Omega"
description =  "How to run Clustal Omega on HCC resources"
weight = "10"
+++
6

npavlovikj's avatar
npavlovikj committed
7
[Clustal Omega] (http://www.clustal.org/omega/) is a general purpose multiple sequence alignment (MSA) tool used mainly with protein, as well as DNA and RNA sequences. Clustal Omega is fast and scalable aligner that can align datasets of hundreds of thousands of sequences in reasonable time.
8
9

The general usage of Clustal Omega is:
npavlovikj's avatar
npavlovikj committed
10
11
12
13
{{< highlight bash >}}
$ clustalo -i input_file.fasta -o output_file.fasta [options]
{{< /highlight >}}
where **input_file.fasta** is the multiple sequence input file in `fasta` format, and **output_file.fasta** is the multiple sequence alignment output file in `fasta` format.
14

npavlovikj's avatar
npavlovikj committed
15
\\
16
17
Clustal Omega accepts 3 types of sequence input files:

npavlovikj's avatar
npavlovikj committed
18
19
20
- sequence file with aligned/unaligned sequences
- multiple alignment in a file/profile of aligned sequences
- Hidden Markov Model (HMM) 
21

npavlovikj's avatar
npavlovikj committed
22
These input files must contain at least 2 sequences and must be in one of the following MSA file formats: `a2m`, `fa[sta]`, `clu[stal]`, `msf`, `phy[lip]`, `selex`, `st[ockholm]`, `vie[nna]`. Moreover, if not specified, the generated output file is in `fasta` format.
23

npavlovikj's avatar
npavlovikj committed
24
\\
25
More Clustal Omega options can be found by typing:
npavlovikj's avatar
npavlovikj committed
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
{{< highlight bash >}}
$ clustalo -h
{{< /highlight >}}

\\
Running Clustal Omega on Tusker with input file `input_reads.fasta` with `8 threads` and `10GB memory` is shown below:
{{% panel header="`clustal_omega.submit`"%}}
{{< highlight bash >}}
#!/bin/sh
#SBATCH --job-name=Clustal_Omega
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=10:00:00
#SBATCH --mem=10gb
#SBATCH --output=ClustalOmega.%J.out
#SBATCH --error=ClustalOmega.%J.err

module load clustal-omega/1.2

clustalo -i input_reads.fasta -o output_msa.sto --outfmt=st 	--threads=$SLURM_NTASKS_PER_NODE
{{< /highlight >}}
{{% /panel %}}

The output file `output_msa.sto` contains the resulting multiple sequence alignments in Stockholm format (**--outfmt=st**).
50
51

Moreover, if you change the command above with:
npavlovikj's avatar
npavlovikj committed
52
53
54
55
{{< highlight bash >}}
$ clustalo -i input_reads.sto --dealign -v
{{< /highlight >}}
Clustal Omega will read the input file in Stockholm format, de-align the sequences, and then re-align them, printing progress report in meanwhile (**-v**). Because it is not specified, the output will be in the default `fasta` format.
56

npavlovikj's avatar
npavlovikj committed
57
58
\\
<span style="color: rgb(0,0,0);font-size: 20.0px;line-height: 1.5;">Clustal Omega Output</span>
59

npavlovikj's avatar
npavlovikj committed
60
The basic Clustal Omega output produces one alignment file in the specified output format. More intermediate outputs can be generated using specific Clustal Omega options, such as: **--distmat-out=<file>** (*pairwise distance matrix output file*) and **--guidetree-out=<file>** (*guide tree output file*).
61

npavlovikj's avatar
npavlovikj committed
62
63
\\
<span style="color: rgb(0,0,0);font-size: 20.0px;line-height: 1.5;">Useful Information</span>
64

npavlovikj's avatar
npavlovikj committed
65
66
In order to test the Clustal Omega performance on Tusker, we used three DNA and protein input fasta files, `data_1.fasta`, `data_2.fasta`, `data_3.fasta`. Some statistics about the input files and the time and memory resources used by Clustal Omega on Tusker are shown on the table below:
{{< readfile file="/static/html/clustal_omega.html" >}}