sratoolkit.md 3.54 KB
Newer Older
npavlovikj's avatar
npavlovikj committed
1
2
3
4
5
6
7
8
9
10
11
12
13
+++
title = "SRAtoolkit"
description =  "How to run SRAtoolkit on HCC resources"
weight = "10"
+++


[SRA (Sequence Read Archive)] (http://www.ncbi.nlm.nih.gov/sra) is an NCBI-defined format for NGS data. Every data submitted to NCBI needs to be in SRA format. The SRA Toolkit provides tools for converting different formats of data into SRA format, and vice versa, extracting SRA data in other different formats.

The SRA Toolkit allows converting data from the SRA format to the following formats: `ABI SOLiD native`, `fasta`, `fastq`, `sff`, `sam`, and `Illumina native`. Also, the SRA Toolkit allows converting data from `fasta`, `fastq`, `AB SOLiD-SRF`, `AB SOLiD-native`, `Illumina SRF`, `Illumina native`, `sff`, and `bam` format into the SRA format.

The SRA Toolkit contains multiple **"format"-dump** commands, where **format** is the file format the SRA data is converted to **abi-dump**, **fastq-dump**, **illumina-dump**, **sam-dump**, **sff-dump**, and **vdb-dump**.

npavlovikj's avatar
i    
npavlovikj committed
14

npavlovikj's avatar
npavlovikj committed
15
16
17
18
19
One of the most commonly used commands is **fastq-dump**:
{{< highlight bash >}}
$ fastq-dump [options] input_reads.sra
{{< /highlight >}}

npavlovikj's avatar
i    
npavlovikj committed
20

21
An example of running **fastq-dump** on Crane to convert SRA file containing paired-end reads is:
npavlovikj's avatar
npavlovikj committed
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
{{% panel header="`sratoolkit.submit`"%}}
{{< highlight bash >}}
#!/bin/sh
#SBATCH --job-name=SRAtoolkit
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=168:00:00
#SBATCH --mem=10gb
#SBATCH --output=SRAtoolkit.%J.out
#SBATCH --error=SRAtoolkit.%J.err

module load SRAtoolkit/2.9

fastq-dump --split-files input_reads.sra
{{< /highlight >}}
{{% /panel %}}
This script outputs two fastq paired end reads `input_reads_1.fastq` and `input_reads_2.fastq`.

All SRAtoolkit commands are single threaded, and therefore both `#SBATCH --nodes` and `#SBATCH --ntasks-per-node` in the SLURM script are set to **1**.

npavlovikj's avatar
i    
npavlovikj committed
42

npavlovikj's avatar
npavlovikj committed
43
44
The SRA Toolkit contains multiple **"format"-load** commands, where **format** is the file format of the data that is uploaded to NCBI: `srf-load`, `sff-load`, `refseq-load`, `pacbio-load`, `illumina-load`, `helicos-load`, `fastq-load`, `cg-load`, `bam-load`, and `abi-load`.

npavlovikj's avatar
i    
npavlovikj committed
45

npavlovikj's avatar
npavlovikj committed
46
47
48
49
50
An example of bam file `input_alignments.bam` uploaded to NCBI is shown below:
{{< highlight bash >}}
$ bam-load \-o input_reads.sra input_alignments.bam
{{< /highlight >}}

npavlovikj's avatar
i    
npavlovikj committed
51

npavlovikj's avatar
npavlovikj committed
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
Other frequently used SRAtoolkit tools are:

- **prefetch**: allows command-line downloading of SRA, dbGaP, and ADSP data
- **sra-stat**: generate statistics about SRA data
- **sra-pileup**: generate pileup statistics on aligned SRA data
- **vdb-config**: display and modify VDB configuration information
- **vdb-encrypt**: encrypt non-SRA dbGaP data
- **vdb-decrypt**: decrypt non-SRA dbGaP data
- **vdb-validate**: validate the integrity of downloaded SRA data

{{% notice info %}}
**Prefetch instructions:**
\\
\\
When **prefetch** is used, the files are downloaded in **${HOME}/ncbi/public** by default.
\\
Since the */home* directory (*$HOME*) is not writable from the worker nodes, the file can not be saved in *$(HOME)/ncbi/public* when submitting a SLURM job.
\\
\\
To change the default output directory for **prefetch** to **${WORK}/ncbi/public**, please follow these three steps:
\\
**$ wget https://raw.githubusercontent.com/ncbi/ncbi-vdb/master/libs/kfg/default.kfg -P $HOME/.ncbi/**
\\
**$ vim $HOME/.ncbi/default.kfg**
\\
Here, set *"/repository/user/main/public/root"* to *"/work/group/username/ncbi/public"*, where **group** is the name of **your HCC group**, and **username** is **your HCC username**.
\\
**$ export VDB_CONFIG=$HOME/.ncbi/default.kfg**
\\
\\
You need to do these steps only once.
npavlovikj's avatar
i    
npavlovikj committed
83
{{% /notice %}}