running_blast_alignment.md 5.24 KB
Newer Older
npavlovikj's avatar
npavlovikj committed
1
2
3
4
5
+++
title = " Running BLAST Alignment"
description =  "How to run BLAST alignment on HCC resources"
weight = "10"
+++
6
7
8
9


Basic BLAST has the following commands:

npavlovikj's avatar
npavlovikj committed
10
11
12
13
14
- **blastn**: search nucleotide database using a nucleotide query
- **blastp**: search protein database using a protein query
- **blastx**: search protein database using a translated nucleotide query
- **tblastn**: search translated nucleotide database using a protein query
- **tblastx**: search translated nucleotide database using a translated nucleotide query
15

npavlovikj's avatar
i    
npavlovikj committed
16

17
The basic usage of **blastn** is:
npavlovikj's avatar
npavlovikj committed
18
19
20
21
22
23
24
25
26
27
28
29
{{< highlight bash >}}
$ blastn -query input_reads.fasta -db input_reads_db -out blastn_output.alignments [options]
{{< /highlight >}}
where **input_reads.fasta** is an input file of sequence data in fasta format, **input_reads_db** is the generated BLAST database, and **blastn_output.alignments** is the output file where the alignments are stored.

Additional parameters can be found in the [BLAST manual] (https://www.ncbi.nlm.nih.gov/books/NBK279690/), or by typing:
{{< highlight bash >}}
$ blastn -help
{{< /highlight >}}

These BLAST alignment commands are multi-threaded, and therefore using the BLAST option **-num_threads <number_of_CPUs>** is recommended.

npavlovikj's avatar
i    
npavlovikj committed
30

npavlovikj's avatar
npavlovikj committed
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
HCC hosts multiple BLAST databases and indices on both Tusker and Crane. In order to use these resources, the ["biodata" module] (../../../biodata_module) needs to be loaded first. The **$BLAST** variable contains the following currently available databases:

- **16SMicrobial**
- **env_nt**
- **est**
- **est_human**
- **est_mouse**
- **est_others**
- **gss**
- **human_genomic**
- **human_genomic_transcript**
- **mouse_genomic_transcript**
- **nr**
- **nt**
- **other_genomic**
- **refseq_genomic**
- **refseq_rna**
- **sts**
- **swissprot**
- **tsa_nr**
- **tsa_nt**

If you want to create and use a BLAST database that is not mentioned above, check [Create Local BLAST Database](create_local_blast_database).

npavlovikj's avatar
i    
npavlovikj committed
55

npavlovikj's avatar
npavlovikj committed
56
57
58
Basic SLURM example of nucleotide BLAST run against the non-redundant **nt** BLAST database with `8 CPUs` is provided below. When running BLAST alignment, it is recommended to first copy the query and database files to the **/scratch/** directory of the worker node. Moreover, the BLAST output is also saved in this directory (**/scratch/blastn_output.alignments**). After BLAST finishes, the output file is copied from the worker node to your current work directory.
{{% notice info %}}
**Please note that the worker nodes can not write to the */home/* directories and therefore you need to run your job from your */work/* directory.**
59
**This example will first copy your database to faster local storage called “scratch”.  This can greatly improve performance!**
npavlovikj's avatar
npavlovikj committed
60
61
62
63
64
65
66
67
68
69
70
71
{{% /notice %}}

{{% panel header="`blastn_alignment.submit`"%}}
{{< highlight bash >}}
#!/bin/sh
#SBATCH --job-name=BlastN
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=168:00:00
#SBATCH --mem=20gb
#SBATCH --output=BlastN.%J.out
#SBATCH --error=BlastN.%J.err
72
73
74
75

module load blast/2.7
module load biodata/1.0

npavlovikj's avatar
npavlovikj committed
76
77
78
79
80
81
82
83
84
85
cd $WORK/<project_folder>
cp $BLAST/nt.* /scratch/
cp input_reads.fasta /scratch/

blastn -query /scratch/input_reads.fasta -db /scratch/nt -out /scratch/blastn_output.alignments -num_threads $SLURM_NTASKS_PER_NODE

cp /scratch/blastn_output.alignments $WORK/<project_folder>
{{< /highlight >}}
{{% /panel %}}

npavlovikj's avatar
i    
npavlovikj committed
86

npavlovikj's avatar
npavlovikj committed
87
88
89
90
91
One important BLAST parameter is the **e-value threshold** that changes the number of hits returned by showing only those with value lower than the given. To show the hits with **e-value** lower than 1e-10, modify the given script as follows:
{{< highlight bash >}}
$ blastn -query input_reads.fasta -db input_reads_db -out blastn_output.alignments -num_threads $SLURM_NTASKS_PER_NODE -evalue 1e-10
{{< /highlight >}}

npavlovikj's avatar
i    
npavlovikj committed
92

npavlovikj's avatar
npavlovikj committed
93
94
The default BLAST output is in pairwise format. However, BLAST’s parameter **-outfmt** supports output in [different formats] (https://www.ncbi.nlm.nih.gov/books/NBK279684/) that are easier for parsing.

npavlovikj's avatar
i    
npavlovikj committed
95

npavlovikj's avatar
npavlovikj committed
96
97
98
Basic SLURM example of protein BLAST run against the non-redundant **nr **BLAST database with tabular output format and `8 CPUs` is shown below. Similarly as before, the query and database files are copied to the **/scratch/** directory. The BLAST output is also saved in this directory (**/scratch/blastx_output.alignments**). After BLAST finishes, the output file is copied from the worker node to your current work directory.
{{% notice info %}}
**Please note that the worker nodes can not write to the */home/* directories and therefore you need to run your job from your */work/* directory.**
99
**This example will first copy your database to faster local storage called “scratch”.  This can greatly improve performance!**
npavlovikj's avatar
npavlovikj committed
100
101
102
103
104
105
106
107
108
109
110
111
{{% /notice %}}

{{% panel header="`blastx_alignment.submit`"%}}
{{< highlight bash >}}
#!/bin/sh
#SBATCH --job-name=BlastX
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=168:00:00
#SBATCH --mem=20gb
#SBATCH --output=BlastX.%J.out
#SBATCH --error=BlastX.%J.err
112

npavlovikj's avatar
npavlovikj committed
113
114
module load blast/2.7
module load biodata/1.0
115

npavlovikj's avatar
npavlovikj committed
116
117
118
cd $WORK/<project_folder>
cp $BLAST/nr.* /scratch/
cp input_reads.fasta /scratch/
119

npavlovikj's avatar
npavlovikj committed
120
blastx -query /scratch/input_reads.fasta -db /scratch/nr -outfmt 6 -out /scratch/blastx_output.alignments -num_threads $SLURM_NTASKS_PER_NODE
121

npavlovikj's avatar
npavlovikj committed
122
123
cp /scratch/blastx_output.alignments $WORK/<project_folder>
{{< /highlight >}}
npavlovikj's avatar
i    
npavlovikj committed
124
{{% /panel %}}