running_blast_alignment.md 4.99 KB
Newer Older
npavlovikj's avatar
npavlovikj committed
1
2
3
4
5
+++
title = " Running BLAST Alignment"
description =  "How to run BLAST alignment on HCC resources"
weight = "10"
+++
6
7
8
9


Basic BLAST has the following commands:

npavlovikj's avatar
npavlovikj committed
10
11
12
13
14
- **blastn**: search nucleotide database using a nucleotide query
- **blastp**: search protein database using a protein query
- **blastx**: search protein database using a translated nucleotide query
- **tblastn**: search translated nucleotide database using a protein query
- **tblastx**: search translated nucleotide database using a translated nucleotide query
15
16

The basic usage of **blastn** is:
npavlovikj's avatar
npavlovikj committed
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
{{< highlight bash >}}
$ blastn -query input_reads.fasta -db input_reads_db -out blastn_output.alignments [options]
{{< /highlight >}}
where **input_reads.fasta** is an input file of sequence data in fasta format, **input_reads_db** is the generated BLAST database, and **blastn_output.alignments** is the output file where the alignments are stored.

Additional parameters can be found in the [BLAST manual] (https://www.ncbi.nlm.nih.gov/books/NBK279690/), or by typing:
{{< highlight bash >}}
$ blastn -help
{{< /highlight >}}

These BLAST alignment commands are multi-threaded, and therefore using the BLAST option **-num_threads <number_of_CPUs>** is recommended.

\\
HCC hosts multiple BLAST databases and indices on both Tusker and Crane. In order to use these resources, the ["biodata" module] (../../../biodata_module) needs to be loaded first. The **$BLAST** variable contains the following currently available databases:

- **16SMicrobial**
- **env_nt**
- **est**
- **est_human**
- **est_mouse**
- **est_others**
- **gss**
- **human_genomic**
- **human_genomic_transcript**
- **mouse_genomic_transcript**
- **nr**
- **nt**
- **other_genomic**
- **refseq_genomic**
- **refseq_rna**
- **sts**
- **swissprot**
- **tsa_nr**
- **tsa_nt**

If you want to create and use a BLAST database that is not mentioned above, check [Create Local BLAST Database](create_local_blast_database).

\\
Basic SLURM example of nucleotide BLAST run against the non-redundant **nt** BLAST database with `8 CPUs` is provided below. When running BLAST alignment, it is recommended to first copy the query and database files to the **/scratch/** directory of the worker node. Moreover, the BLAST output is also saved in this directory (**/scratch/blastn_output.alignments**). After BLAST finishes, the output file is copied from the worker node to your current work directory.
{{% notice info %}}
**Please note that the worker nodes can not write to the */home/* directories and therefore you need to run your job from your */work/* directory.**
{{% /notice %}}

{{% panel header="`blastn_alignment.submit`"%}}
{{< highlight bash >}}
#!/bin/sh
#SBATCH --job-name=BlastN
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=168:00:00
#SBATCH --mem=20gb
#SBATCH --output=BlastN.%J.out
#SBATCH --error=BlastN.%J.err
70
71
72
73

module load blast/2.7
module load biodata/1.0

npavlovikj's avatar
npavlovikj committed
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
cd $WORK/<project_folder>
cp $BLAST/nt.* /scratch/
cp input_reads.fasta /scratch/

blastn -query /scratch/input_reads.fasta -db /scratch/nt -out /scratch/blastn_output.alignments -num_threads $SLURM_NTASKS_PER_NODE

cp /scratch/blastn_output.alignments $WORK/<project_folder>
{{< /highlight >}}
{{% /panel %}}

\\
One important BLAST parameter is the **e-value threshold** that changes the number of hits returned by showing only those with value lower than the given. To show the hits with **e-value** lower than 1e-10, modify the given script as follows:
{{< highlight bash >}}
$ blastn -query input_reads.fasta -db input_reads_db -out blastn_output.alignments -num_threads $SLURM_NTASKS_PER_NODE -evalue 1e-10
{{< /highlight >}}

\\
The default BLAST output is in pairwise format. However, BLAST’s parameter **-outfmt** supports output in [different formats] (https://www.ncbi.nlm.nih.gov/books/NBK279684/) that are easier for parsing.

\\
Basic SLURM example of protein BLAST run against the non-redundant **nr **BLAST database with tabular output format and `8 CPUs` is shown below. Similarly as before, the query and database files are copied to the **/scratch/** directory. The BLAST output is also saved in this directory (**/scratch/blastx_output.alignments**). After BLAST finishes, the output file is copied from the worker node to your current work directory.
{{% notice info %}}
**Please note that the worker nodes can not write to the */home/* directories and therefore you need to run your job from your */work/* directory.**
{{% /notice %}}

{{% panel header="`blastx_alignment.submit`"%}}
{{< highlight bash >}}
#!/bin/sh
#SBATCH --job-name=BlastX
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=168:00:00
#SBATCH --mem=20gb
#SBATCH --output=BlastX.%J.out
#SBATCH --error=BlastX.%J.err
109

npavlovikj's avatar
npavlovikj committed
110
111
module load blast/2.7
module load biodata/1.0
112

npavlovikj's avatar
npavlovikj committed
113
114
115
cd $WORK/<project_folder>
cp $BLAST/nr.* /scratch/
cp input_reads.fasta /scratch/
116

npavlovikj's avatar
npavlovikj committed
117
blastx -query /scratch/input_reads.fasta -db /scratch/nr -outfmt 6 -out /scratch/blastx_output.alignments -num_threads $SLURM_NTASKS_PER_NODE
118

npavlovikj's avatar
npavlovikj committed
119
120
121
cp /scratch/blastx_output.alignments $WORK/<project_folder>
{{< /highlight >}}
{{% /panel %}}