running_blast_alignment.md 5.31 KB
Newer Older
npavlovikj's avatar
npavlovikj committed
1
2
3
4
5
+++
title = " Running BLAST Alignment"
description =  "How to run BLAST alignment on HCC resources"
weight = "10"
+++
6
7
8
9


Basic BLAST has the following commands:

npavlovikj's avatar
npavlovikj committed
10
11
12
13
14
- **blastn**: search nucleotide database using a nucleotide query
- **blastp**: search protein database using a protein query
- **blastx**: search protein database using a translated nucleotide query
- **tblastn**: search translated nucleotide database using a protein query
- **tblastx**: search translated nucleotide database using a translated nucleotide query
15

npavlovikj's avatar
i    
npavlovikj committed
16

17
The basic usage of **blastn** is:
npavlovikj's avatar
npavlovikj committed
18
19
20
21
22
23
24
25
26
27
28
29
{{< highlight bash >}}
$ blastn -query input_reads.fasta -db input_reads_db -out blastn_output.alignments [options]
{{< /highlight >}}
where **input_reads.fasta** is an input file of sequence data in fasta format, **input_reads_db** is the generated BLAST database, and **blastn_output.alignments** is the output file where the alignments are stored.

Additional parameters can be found in the [BLAST manual] (https://www.ncbi.nlm.nih.gov/books/NBK279690/), or by typing:
{{< highlight bash >}}
$ blastn -help
{{< /highlight >}}

These BLAST alignment commands are multi-threaded, and therefore using the BLAST option **-num_threads <number_of_CPUs>** is recommended.

npavlovikj's avatar
i    
npavlovikj committed
30

31
HCC hosts multiple BLAST databases and indices on both Tusker and Crane. In order to use these resources, the ["biodata" module] ({{<relref "/guides/running_applications/bioinformatics_tools/biodata_module">}}) needs to be loaded first. The **$BLAST** variable contains the following currently available databases:
npavlovikj's avatar
npavlovikj committed
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52

- **16SMicrobial**
- **env_nt**
- **est**
- **est_human**
- **est_mouse**
- **est_others**
- **gss**
- **human_genomic**
- **human_genomic_transcript**
- **mouse_genomic_transcript**
- **nr**
- **nt**
- **other_genomic**
- **refseq_genomic**
- **refseq_rna**
- **sts**
- **swissprot**
- **tsa_nr**
- **tsa_nt**

53
If you want to create and use a BLAST database that is not mentioned above, check [Create Local BLAST Database]({{<relref "create_local_blast_database" >}}).
npavlovikj's avatar
npavlovikj committed
54

npavlovikj's avatar
i    
npavlovikj committed
55

npavlovikj's avatar
npavlovikj committed
56
57
58
Basic SLURM example of nucleotide BLAST run against the non-redundant **nt** BLAST database with `8 CPUs` is provided below. When running BLAST alignment, it is recommended to first copy the query and database files to the **/scratch/** directory of the worker node. Moreover, the BLAST output is also saved in this directory (**/scratch/blastn_output.alignments**). After BLAST finishes, the output file is copied from the worker node to your current work directory.
{{% notice info %}}
**Please note that the worker nodes can not write to the */home/* directories and therefore you need to run your job from your */work/* directory.**
59
**This example will first copy your database to faster local storage called “scratch”.  This can greatly improve performance!**
npavlovikj's avatar
npavlovikj committed
60
61
62
63
64
65
66
67
68
69
70
71
{{% /notice %}}

{{% panel header="`blastn_alignment.submit`"%}}
{{< highlight bash >}}
#!/bin/sh
#SBATCH --job-name=BlastN
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=168:00:00
#SBATCH --mem=20gb
#SBATCH --output=BlastN.%J.out
#SBATCH --error=BlastN.%J.err
72
73
74
75

module load blast/2.7
module load biodata/1.0

npavlovikj's avatar
npavlovikj committed
76
77
78
79
80
81
82
83
84
85
cd $WORK/<project_folder>
cp $BLAST/nt.* /scratch/
cp input_reads.fasta /scratch/

blastn -query /scratch/input_reads.fasta -db /scratch/nt -out /scratch/blastn_output.alignments -num_threads $SLURM_NTASKS_PER_NODE

cp /scratch/blastn_output.alignments $WORK/<project_folder>
{{< /highlight >}}
{{% /panel %}}

npavlovikj's avatar
i    
npavlovikj committed
86

npavlovikj's avatar
npavlovikj committed
87
88
89
90
91
One important BLAST parameter is the **e-value threshold** that changes the number of hits returned by showing only those with value lower than the given. To show the hits with **e-value** lower than 1e-10, modify the given script as follows:
{{< highlight bash >}}
$ blastn -query input_reads.fasta -db input_reads_db -out blastn_output.alignments -num_threads $SLURM_NTASKS_PER_NODE -evalue 1e-10
{{< /highlight >}}

npavlovikj's avatar
i    
npavlovikj committed
92

npavlovikj's avatar
npavlovikj committed
93
94
The default BLAST output is in pairwise format. However, BLAST’s parameter **-outfmt** supports output in [different formats] (https://www.ncbi.nlm.nih.gov/books/NBK279684/) that are easier for parsing.

npavlovikj's avatar
i    
npavlovikj committed
95

npavlovikj's avatar
npavlovikj committed
96
97
98
Basic SLURM example of protein BLAST run against the non-redundant **nr **BLAST database with tabular output format and `8 CPUs` is shown below. Similarly as before, the query and database files are copied to the **/scratch/** directory. The BLAST output is also saved in this directory (**/scratch/blastx_output.alignments**). After BLAST finishes, the output file is copied from the worker node to your current work directory.
{{% notice info %}}
**Please note that the worker nodes can not write to the */home/* directories and therefore you need to run your job from your */work/* directory.**
99
**This example will first copy your database to faster local storage called “scratch”.  This can greatly improve performance!**
npavlovikj's avatar
npavlovikj committed
100
101
102
103
104
105
106
107
108
109
110
111
{{% /notice %}}

{{% panel header="`blastx_alignment.submit`"%}}
{{< highlight bash >}}
#!/bin/sh
#SBATCH --job-name=BlastX
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=168:00:00
#SBATCH --mem=20gb
#SBATCH --output=BlastX.%J.out
#SBATCH --error=BlastX.%J.err
112

npavlovikj's avatar
npavlovikj committed
113
114
module load blast/2.7
module load biodata/1.0
115

npavlovikj's avatar
npavlovikj committed
116
117
118
cd $WORK/<project_folder>
cp $BLAST/nr.* /scratch/
cp input_reads.fasta /scratch/
119

npavlovikj's avatar
npavlovikj committed
120
blastx -query /scratch/input_reads.fasta -db /scratch/nr -outfmt 6 -out /scratch/blastx_output.alignments -num_threads $SLURM_NTASKS_PER_NODE
121

npavlovikj's avatar
npavlovikj committed
122
123
cp /scratch/blastx_output.alignments $WORK/<project_folder>
{{< /highlight >}}
npavlovikj's avatar
i    
npavlovikj committed
124
{{% /panel %}}