clustal_omega.md 2.98 KB
Newer Older
npavlovikj's avatar
npavlovikj committed
1
2
3
4
5
+++
title = "Clustal Omega"
description =  "How to run Clustal Omega on HCC resources"
weight = "10"
+++
6

npavlovikj's avatar
i    
npavlovikj committed
7

npavlovikj's avatar
npavlovikj committed
8
[Clustal Omega] (http://www.clustal.org/omega/) is a general purpose multiple sequence alignment (MSA) tool used mainly with protein, as well as DNA and RNA sequences. Clustal Omega is fast and scalable aligner that can align datasets of hundreds of thousands of sequences in reasonable time.
9

npavlovikj's avatar
i    
npavlovikj committed
10

11
The general usage of Clustal Omega is:
npavlovikj's avatar
npavlovikj committed
12
13
14
15
{{< highlight bash >}}
$ clustalo -i input_file.fasta -o output_file.fasta [options]
{{< /highlight >}}
where **input_file.fasta** is the multiple sequence input file in `fasta` format, and **output_file.fasta** is the multiple sequence alignment output file in `fasta` format.
16

npavlovikj's avatar
i    
npavlovikj committed
17

18
19
Clustal Omega accepts 3 types of sequence input files:

npavlovikj's avatar
npavlovikj committed
20
21
22
- sequence file with aligned/unaligned sequences
- multiple alignment in a file/profile of aligned sequences
- Hidden Markov Model (HMM) 
23

npavlovikj's avatar
npavlovikj committed
24
These input files must contain at least 2 sequences and must be in one of the following MSA file formats: `a2m`, `fa[sta]`, `clu[stal]`, `msf`, `phy[lip]`, `selex`, `st[ockholm]`, `vie[nna]`. Moreover, if not specified, the generated output file is in `fasta` format.
25

npavlovikj's avatar
i    
npavlovikj committed
26

27
More Clustal Omega options can be found by typing:
npavlovikj's avatar
npavlovikj committed
28
29
30
31
{{< highlight bash >}}
$ clustalo -h
{{< /highlight >}}

npavlovikj's avatar
i    
npavlovikj committed
32

33
Running Clustal Omega on Crane with input file `input_reads.fasta` with `8 threads` and `10GB memory` is shown below:
npavlovikj's avatar
npavlovikj committed
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
{{% panel header="`clustal_omega.submit`"%}}
{{< highlight bash >}}
#!/bin/sh
#SBATCH --job-name=Clustal_Omega
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=10:00:00
#SBATCH --mem=10gb
#SBATCH --output=ClustalOmega.%J.out
#SBATCH --error=ClustalOmega.%J.err

module load clustal-omega/1.2

clustalo -i input_reads.fasta -o output_msa.sto --outfmt=st 	--threads=$SLURM_NTASKS_PER_NODE
{{< /highlight >}}
{{% /panel %}}

The output file `output_msa.sto` contains the resulting multiple sequence alignments in Stockholm format (**--outfmt=st**).
52
53

Moreover, if you change the command above with:
npavlovikj's avatar
npavlovikj committed
54
55
56
57
{{< highlight bash >}}
$ clustalo -i input_reads.sto --dealign -v
{{< /highlight >}}
Clustal Omega will read the input file in Stockholm format, de-align the sequences, and then re-align them, printing progress report in meanwhile (**-v**). Because it is not specified, the output will be in the default `fasta` format.
58

npavlovikj's avatar
i    
npavlovikj committed
59
60

### Clustal Omega Output
61

npavlovikj's avatar
npavlovikj committed
62
The basic Clustal Omega output produces one alignment file in the specified output format. More intermediate outputs can be generated using specific Clustal Omega options, such as: **--distmat-out=<file>** (*pairwise distance matrix output file*) and **--guidetree-out=<file>** (*guide tree output file*).
63

npavlovikj's avatar
i    
npavlovikj committed
64
65

### Useful Information
66

npavlovikj's avatar
npavlovikj committed
67
In order to test the Clustal Omega performance on Tusker, we used three DNA and protein input fasta files, `data_1.fasta`, `data_2.fasta`, `data_3.fasta`. Some statistics about the input files and the time and memory resources used by Clustal Omega on Tusker are shown on the table below:
npavlovikj's avatar
i    
npavlovikj committed
68
{{< readfile file="/static/html/clustal_omega.html" >}}