soapdenovo2.md 5.51 KB
Newer Older
npavlovikj's avatar
npavlovikj committed
1
2
3
4
5
+++
title = "SOAPdenovo2"
description =  "How to run SOAPdenovo2 on HCC resources"
weight = "10"
+++
6
7


8
[SOAPdenovo](http://soap.genomics.org.cn/soapdenovo.html) is a de novo genome assembler for short reads. It is specially designed for Illumina GA short reads and large plant and animal genomes. SOAPdenovo2 is a newer version of SOAPdenovo with improved algorithm that reduces memory consumption, resolves more repeat regions, increases coverage, and optimizes the assembly for large genomes.
9

npavlovikj's avatar
npavlovikj committed
10
SOAPdenovo2 has two commands, **SOAPdenovo-63mer** and **SOAPdenovo-127mer**. The first one is suitable for assembly with k-mer values less than 63 bp, requires less memory and runs faster. The latter one works for k-mer values less than 127 bp.
11

npavlovikj's avatar
i    
npavlovikj committed
12

13
14
In order to see the options available for **SOAPdenovo-63mer** just
type:
npavlovikj's avatar
npavlovikj committed
15
16
17
{{< highlight bash >}}
$ SOAPdenovo-63mer
{{< /highlight >}}
18

npavlovikj's avatar
npavlovikj committed
19
SOAPdenovo2 provides a mechanism to run the whole workflow at once, or in 5 separate steps.
20

npavlovikj's avatar
i    
npavlovikj committed
21

22
The basic usage of SOAPdenovo2 is:
npavlovikj's avatar
npavlovikj committed
23
24
25
26
{{< highlight bash >}}
$ SOAPdenovo-63mer all -s configFile -o output_directory/outputGraph -K <kmer_value> [options]
{{< /highlight >}}
where **configFile** is a defined configuration file, **outputGraph** is the prefix of the output files, and **kmer_value** is the value of k-mer used for building the assembly (`<=63` for SOAPdenovo-63mer and `<=127` for SOAPdenovo-127mer).
27

npavlovikj's avatar
i    
npavlovikj committed
28

npavlovikj's avatar
npavlovikj committed
29
If you want to run the assembly process step by step, then use the following sequential commands:
30

npavlovikj's avatar
npavlovikj committed
31
32
{{% panel theme="info" header="SOAPdenovo2 Step 1 Options" %}}
{{< highlight bash >}}
33
34
35
SOAPdenovo-63mer pregraph -s configFile -o outputGraph [options]
OR
SOAPdenovo-63mer sparse_pregraph -s configFile -K <kmer_value> -z <genome_size> -o outputGraph [options]
npavlovikj's avatar
npavlovikj committed
36
37
{{< /highlight >}}
{{% /panel %}}
38

npavlovikj's avatar
npavlovikj committed
39
40
{{% panel theme="info" header="SOAPdenovo2 Step 2 Options" %}}
{{< highlight bash >}}
41
SOAPdenovo-63mer contig -g inputGraph [options]
npavlovikj's avatar
npavlovikj committed
42
43
{{< /highlight >}}
{{% /panel %}}
44

npavlovikj's avatar
npavlovikj committed
45
46
{{% panel theme="info" header="SOAPdenovo2 Step 3 Options" %}}
{{< highlight bash >}}
47
SOAPdenovo-63mer map -s configFile -g inputGraph [options]
npavlovikj's avatar
npavlovikj committed
48
49
{{< /highlight >}}
{{% /panel %}}
50

npavlovikj's avatar
npavlovikj committed
51
52
{{% panel theme="info" header="SOAPdenovo2 Step 4 Options" %}}
{{< highlight bash >}}
53
SOAPdenovo-63mer scaff -g inputGraph [options]
npavlovikj's avatar
npavlovikj committed
54
55
56
57
58
{{< /highlight >}}
{{% /panel %}}

As you can notice from the commands above, in order to run SOAPdenovo2, you first need to create a config file (`configFile`) that contains different information about the read files (`read length`, `insert size`, `reads location`). SOAPdenovo2 accepts read files in 3 formats: fasta, fastq and bam.

npavlovikj's avatar
i    
npavlovikj committed
59

npavlovikj's avatar
npavlovikj committed
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
The example configuration file **configFile** for 2 paired-end fastq files, 1 paired-end fasta file and 1 single-end fastq file looks like:
{{% panel header="`configFile`"%}}
{{< highlight bash >}}
#maximal read length
max_rd_len=150
[LIB]
#average insert size of the library
avg_ins=300
#if sequences are forward-reverse of reverse-forward
reverse_seq=0
#in which part(s) the reads are used (only contigs, only scaffolds, both contigs and scaffolds, only gap closure)
asm_flags=3
#cut the reads to the given length
rd_len_cutoff=100
#in which order the reads are used while scaffolding
rank=1
# cutoff of pair number for a reliable connection (at least 3 for short insert size)
pair_num_cutoff=3
#minimum aligned length to contigs for a reliable read location (at least 32 for short insert size)
map_len=32
#paired-end fastq files, read 1 file should always be followed by read 2 file
q1=input_reads1_pair_1.fq
q2=input_reads1_pair_2.fq
#another pair of paired-end fastq files, read 1 file should always be followed by read 2 file
q1=input_reads2_pair_1.fq
q2=input_reads2_pair_2.fq
#paired-end fasta files, read 1 file should always be followed by read 2 file
f1=input_reads_pair_1.fa
f2=input_reads_pair_2.fa
#fastq file for single reads
q=input_reads.fq
{{< /highlight >}}
{{% /panel %}}

After creating the configuration file **configFile**, the next step is to run the assembler using this file.

npavlovikj's avatar
i    
npavlovikj committed
96

Caughlin Bohn's avatar
Caughlin Bohn committed
97
Simple SLURM script for running SOAPdenovo2 with `k-mer=31`, `8 CPUSs` and `50GB of RAM` is shown below:
npavlovikj's avatar
npavlovikj committed
98
99
{{% panel header="`soapdenovo2.submit`"%}}
{{< highlight bash >}}
Caughlin Bohn's avatar
Caughlin Bohn committed
100
#!/bin/bash
npavlovikj's avatar
npavlovikj committed
101
102
103
104
105
106
107
108
109
110
111
112
113
114
#SBATCH --job-name=SOAPdenovo2
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=168:00:00
#SBATCH --mem=50gb
#SBATCH --output=SOAPdenovo2.%J.out
#SBATCH --error=SOAPdenovo2.%J.err

module load soapdenovo2/r240

SOAPdenovo-63mer all -s configFile -K 31 -o output_directory/output31 -p $SLURM_NTASKS_PER_NODE
{{< /highlight >}}
{{% /panel %}}

npavlovikj's avatar
i    
npavlovikj committed
115
116

### SOAPdenovo2 Output
npavlovikj's avatar
npavlovikj committed
117
118
119
120
121

SOAPdenovo2 outputs number of files in its `output_directory/` after each executed step. The final assembly output is in the `.contig` file.
{{% panel header="`Output directory after SOAPdenovo2`"%}}
{{< highlight bash >}}
$ ls
122
123
124
output31.Arc            output31.ContigIndex       output31.gapSeq    output31.newContigInde
output31.bubbleInScaff  output31.contigPosInscaff  output31.kmerFreq  output31.peGrads
output31.contig         output31.edge.gz           output31.links     output31.preArc
npavlovikj's avatar
npavlovikj committed
125
126
{{< /highlight >}}
{{% /panel %}}
127

npavlovikj's avatar
i    
npavlovikj committed
128
129

### Useful Information
130

Caughlin Bohn's avatar
Caughlin Bohn committed
131
In order to test the SOAPdenovo2 (soapdenovo2/r240) performance, we used three different size input files. Some statistics about the input files and the time and memory resources used by SOAPdenovo2 are shown in the table below:
npavlovikj's avatar
npavlovikj committed
132
{{< readfile file="/static/html/soapdenovo2.html" >}}
133

npavlovikj's avatar
i    
npavlovikj committed
134
In general, SOAPdenovo2 is a memory intensive assembler that requires approximately 30-60 GB memory for assembling 50 million reads. However, SOAPdenovo2 is a fast assembler and it takes around an hour to assemble 50 million reads.