sratoolkit.md 2.93 KB
Newer Older
npavlovikj's avatar
npavlovikj committed
1
2
3
4
5
6
7
+++
title = "SRAtoolkit"
description =  "How to run SRAtoolkit on HCC resources"
weight = "10"
+++


8
[SRA (Sequence Read Archive)](http://www.ncbi.nlm.nih.gov/sra) is an NCBI-defined format for NGS data. Every data submitted to NCBI needs to be in SRA format. The SRA Toolkit provides tools for downloading data, converting different formats of data into SRA format, and vice versa, extracting SRA data in other different formats.
npavlovikj's avatar
npavlovikj committed
9
10
11

The SRA Toolkit allows converting data from the SRA format to the following formats: `ABI SOLiD native`, `fasta`, `fastq`, `sff`, `sam`, and `Illumina native`. Also, the SRA Toolkit allows converting data from `fasta`, `fastq`, `AB SOLiD-SRF`, `AB SOLiD-native`, `Illumina SRF`, `Illumina native`, `sff`, and `bam` format into the SRA format.

12
13
14
15
16
17
The SRA Toolkit supports downloading SRA data using the **"prefetch"** command:
{{< highlight bash >}}
$ prefetch <sra_id>
{{< /highlight >}}
where `<sra_id>` is the assigned SRA identification in NCBI (e.g., SRR1482462). 

npavlovikj's avatar
npavlovikj committed
18
19
The SRA Toolkit contains multiple **"format"-dump** commands, where **format** is the file format the SRA data is converted to **abi-dump**, **fastq-dump**, **illumina-dump**, **sam-dump**, **sff-dump**, and **vdb-dump**.

npavlovikj's avatar
i    
npavlovikj committed
20

npavlovikj's avatar
npavlovikj committed
21
22
23
24
One of the most commonly used commands is **fastq-dump**:
{{< highlight bash >}}
$ fastq-dump [options] input_reads.sra
{{< /highlight >}}
25
This command can be applied on the downloaded SRA data with **"prefetch"**.
npavlovikj's avatar
npavlovikj committed
26

npavlovikj's avatar
i    
npavlovikj committed
27

28
An example of running **fastq-dump** on Crane to convert SRA file containing paired-end reads is:
npavlovikj's avatar
npavlovikj committed
29
30
{{% panel header="`sratoolkit.submit`"%}}
{{< highlight bash >}}
Caughlin Bohn's avatar
Caughlin Bohn committed
31
#!/bin/bash
npavlovikj's avatar
npavlovikj committed
32
33
34
35
36
37
38
39
#SBATCH --job-name=SRAtoolkit
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=168:00:00
#SBATCH --mem=10gb
#SBATCH --output=SRAtoolkit.%J.out
#SBATCH --error=SRAtoolkit.%J.err

40
module load SRAtoolkit/2.11
npavlovikj's avatar
npavlovikj committed
41
42
43
44
45
46
47
48

fastq-dump --split-files input_reads.sra
{{< /highlight >}}
{{% /panel %}}
This script outputs two fastq paired end reads `input_reads_1.fastq` and `input_reads_2.fastq`.

All SRAtoolkit commands are single threaded, and therefore both `#SBATCH --nodes` and `#SBATCH --ntasks-per-node` in the SLURM script are set to **1**.

npavlovikj's avatar
i    
npavlovikj committed
49

npavlovikj's avatar
npavlovikj committed
50
51
The SRA Toolkit contains multiple **"format"-load** commands, where **format** is the file format of the data that is uploaded to NCBI: `srf-load`, `sff-load`, `refseq-load`, `pacbio-load`, `illumina-load`, `helicos-load`, `fastq-load`, `cg-load`, `bam-load`, and `abi-load`.

npavlovikj's avatar
i    
npavlovikj committed
52

npavlovikj's avatar
npavlovikj committed
53
54
55
56
57
An example of bam file `input_alignments.bam` uploaded to NCBI is shown below:
{{< highlight bash >}}
$ bam-load \-o input_reads.sra input_alignments.bam
{{< /highlight >}}

npavlovikj's avatar
i    
npavlovikj committed
58

npavlovikj's avatar
npavlovikj committed
59
60
61
62
63
64
65
66
Other frequently used SRAtoolkit tools are:

- **sra-stat**: generate statistics about SRA data
- **sra-pileup**: generate pileup statistics on aligned SRA data
- **vdb-config**: display and modify VDB configuration information
- **vdb-encrypt**: encrypt non-SRA dbGaP data
- **vdb-decrypt**: decrypt non-SRA dbGaP data
- **vdb-validate**: validate the integrity of downloaded SRA data