Commit 54bd0d9a authored by Natasha Pavlovikj's avatar Natasha Pavlovikj
Browse files

Remove aspera doc and update SRAToolkit

parent b05dba19
......@@ -5,10 +5,16 @@ weight = "10"
+++
[SRA (Sequence Read Archive)](http://www.ncbi.nlm.nih.gov/sra) is an NCBI-defined format for NGS data. Every data submitted to NCBI needs to be in SRA format. The SRA Toolkit provides tools for converting different formats of data into SRA format, and vice versa, extracting SRA data in other different formats.
[SRA (Sequence Read Archive)](http://www.ncbi.nlm.nih.gov/sra) is an NCBI-defined format for NGS data. Every data submitted to NCBI needs to be in SRA format. The SRA Toolkit provides tools for downloading data, converting different formats of data into SRA format, and vice versa, extracting SRA data in other different formats.
The SRA Toolkit allows converting data from the SRA format to the following formats: `ABI SOLiD native`, `fasta`, `fastq`, `sff`, `sam`, and `Illumina native`. Also, the SRA Toolkit allows converting data from `fasta`, `fastq`, `AB SOLiD-SRF`, `AB SOLiD-native`, `Illumina SRF`, `Illumina native`, `sff`, and `bam` format into the SRA format.
The SRA Toolkit supports downloading SRA data using the **"prefetch"** command:
{{< highlight bash >}}
$ prefetch <sra_id>
{{< /highlight >}}
where `<sra_id>` is the assigned SRA identification in NCBI (e.g., SRR1482462).
The SRA Toolkit contains multiple **"format"-dump** commands, where **format** is the file format the SRA data is converted to **abi-dump**, **fastq-dump**, **illumina-dump**, **sam-dump**, **sff-dump**, and **vdb-dump**.
......@@ -16,6 +22,7 @@ One of the most commonly used commands is **fastq-dump**:
{{< highlight bash >}}
$ fastq-dump [options] input_reads.sra
{{< /highlight >}}
This command can be applied on the downloaded SRA data with **"prefetch"**.
An example of running **fastq-dump** on Crane to convert SRA file containing paired-end reads is:
......@@ -30,7 +37,7 @@ An example of running **fastq-dump** on Crane to convert SRA file containing pai
#SBATCH --output=SRAtoolkit.%J.out
#SBATCH --error=SRAtoolkit.%J.err
module load SRAtoolkit/2.9
module load SRAtoolkit/2.11
fastq-dump --split-files input_reads.sra
{{< /highlight >}}
......@@ -51,33 +58,9 @@ $ bam-load \-o input_reads.sra input_alignments.bam
Other frequently used SRAtoolkit tools are:
- **prefetch**: allows command-line downloading of SRA, dbGaP, and ADSP data
- **sra-stat**: generate statistics about SRA data
- **sra-pileup**: generate pileup statistics on aligned SRA data
- **vdb-config**: display and modify VDB configuration information
- **vdb-encrypt**: encrypt non-SRA dbGaP data
- **vdb-decrypt**: decrypt non-SRA dbGaP data
- **vdb-validate**: validate the integrity of downloaded SRA data
{{% notice info %}}
**Prefetch instructions:**
\\
\\
When **prefetch** is used, the files are downloaded in **${HOME}/ncbi/public** by default.
\\
Since the */home* directory (*$HOME*) is not writable from the worker nodes, the file can not be saved in *$(HOME)/ncbi/public* when submitting a SLURM job.
\\
\\
To change the default output directory for **prefetch** to **${WORK}/ncbi/public**, please follow these three steps:
\\
**$ wget https://raw.githubusercontent.com/ncbi/ncbi-vdb/master/libs/kfg/default.kfg -P $HOME/.ncbi/**
\\
**$ vim $HOME/.ncbi/default.kfg**
\\
Here, set *"/repository/user/main/public/root"* to *"/work/group/username/ncbi/public"*, where **group** is the name of **your HCC group**, and **username** is **your HCC username**.
\\
**$ export VDB_CONFIG=$HOME/.ncbi/default.kfg**
\\
\\
You need to do these steps only once.
{{% /notice %}}
+++
title = "Downloading SRA data from NCBI"
description = "How to download data from NCBI"
weight = "52"
+++
One way to download high-volume data from NCBI is to use command line
utilities, such as **wget**, **ftp** or Aspera Connect **ascp**
plugin. The Aspera Connect plugin is commonly used high-performance transfer
plugin that provides the best transfer speed.
This plugin is available on our clusters as a module. In order to use it, load the appropriate module first:
{{< highlight bash >}}
$ module load aspera-cli
{{< /highlight >}}
The basic usage of the Aspera plugin is
{{< highlight bash >}}
$ ascp -i $ASPERA_PUBLIC_KEY -k 1 -T -l <max_download_rate_in_Mbps>m anonftp@ftp.ncbi.nlm.nih.gov:/<files_to_transfer> <local_work_output_directory>
{{< /highlight >}}
where **-k 1** enables resume of partial transfers, **-T** disables encryption for maximum throughput, and **-l** sets the transfer rate.
**\<files_to_transfer\>** mentioned in the basic usage of Aspera
plugin has a specifically defined pattern that needs to be followed:
{{< highlight bash >}}
<files_to_transfer> = /sra/sra-instant/reads/ByRun/sra/SRR|ERR|DRR/<first_6_characters_of_accession>/<accession>/<accession>.sra
{{< /highlight >}}
where **SRR\|ERR\|DRR** should be either **SRR**, **ERR **or **DRR** and should match the prefix of the target **.sra** file.
More **ascp** options can be seen by using:
{{< highlight bash >}}
$ ascp --help
{{< /highlight >}}
For example, if you want to download the **SRR304976** file from NCBI in your $WORK **data/** directory with downloading speed of **1000 Mbps**, you should use the following command:
{{< highlight bash >}}
$ ascp -i $ASPERA_PUBLIC_KEY -k 1 -T -l 1000m anonftp@ftp.ncbi.nlm.nih.gov:/sra/sra-instant/reads/ByRun/sra/SRR/SRR304/SRR304976/SRR304976.sra /work/[groupname]/[username]/data/
{{< /highlight >}}
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment