ray.md 3.47 KB
Newer Older
npavlovikj's avatar
npavlovikj committed
1
2
3
4
5
+++
title = "Ray"
description =  "How to run Ray on HCC resources"
weight = "10"
+++
npavlovikj's avatar
i    
npavlovikj committed
6

7

8
[Ray](http://denovoassembler.sourceforge.net/) is a de novo de Bruijn genome assembler that works with next-generation sequencing data (Illumina, 454, SOLiD). Ray is scalable and parallel software that takes advantage of multiple nodes and multiple CPUs using MPI (message passing interface).
9

npavlovikj's avatar
i    
npavlovikj committed
10

11
12
Ray can be used for building multiple applications:

npavlovikj's avatar
npavlovikj committed
13
14
15
16
17
18
- de novo genome assembly
- de novo meta-genome assembly
- de novo transcriptome assembly
- quantification of contig abundances, microbiome consortia members, transcript expression
- taxonomy and gene ontology profiling of samples
- comparing DNA samples using words
19

npavlovikj's avatar
i    
npavlovikj committed
20

21
In order to see all options available for running Ray, just type:
npavlovikj's avatar
npavlovikj committed
22
23
24
25
26
27
28
29
30
31
32
33
34
{{< highlight bash >}}
$ mpiexec Ray -help
{{< /highlight >}}

All options used for Ray can be defined on the command line:
{{< highlight bash >}}
$ mpiexec Ray -k <kmer_value> -p input_reads_pair_1.[fa|fq] input_reads_pair_2.[fa|fq] -s input_reads.[fa|fq] -o <output_directory>
{{< /highlight >}}
or can be stored in a configuration file `.conf` (one option per line):
{{< highlight bash >}}
$ mpiexec Ray Ray.conf
{{< /highlight >}}

npavlovikj's avatar
i    
npavlovikj committed
35

npavlovikj's avatar
npavlovikj committed
36
37
38
39
Ray supports both paired-end (`-p`) and single-end reads (`-s`). Moreover, Ray can detect the input files automatically if the input directory is provided (`-detect-sequence-files input_directory`).

Ray supports odd values for k-mer equal to or greater than 21 (`-k <kmer_value>`). Ray supports multiple file formats such as `fasta`, `fa`, `fasta.gz`, `fa.gz, `fasta.bz2`, `fa.bz2`, `fastq`, `fq`, `fastq.gz`, `fq.gz`, `fastq.bz2`, `fq.bz2`, `sff`, `csfasta`, `csfa`.

npavlovikj's avatar
i    
npavlovikj committed
40

Caughlin Bohn's avatar
Caughlin Bohn committed
41
Simple SLURM script for running Ray with both paired-end and single-end data with `k-mer=31`, `8 CPUs` and `4 GB RAM per CPU` is shown below:
npavlovikj's avatar
npavlovikj committed
42
43
{{% panel header="`ray.submit`"%}}
{{< highlight bash >}}
Caughlin Bohn's avatar
Caughlin Bohn committed
44
#!/bin/bash
npavlovikj's avatar
npavlovikj committed
45
46
47
48
49
50
51
52
53
54
55
56
57
58
#SBATCH --job-name=Ray
#SBATCH --ntasks=8
#SBATCH --time=168:00:00
#SBATCH --mem-per-cpu=4gb
#SBATCH --output=Ray.%J.out
#SBATCH --error=Ray.%J.err

module load compiler/gcc/4.7 openmpi/1.6 ray/2.3

mpiexec Ray -k 31 -p input_reads_pair_1.fastq input_reads_pair_2.fastq -s input_reads.fasta -o output_directory
{{< /highlight >}}
{{% /panel %}}
where **input_reads_pair_1.fastq** and **input_reads_pair_2.fastq** are the paired-end input files in `fastq` format, and **input_reads.fasta** is the single-end input file in `fasta` format.

npavlovikj's avatar
i    
npavlovikj committed
59

npavlovikj's avatar
npavlovikj committed
60
{{% notice note %}}
npavlovikj's avatar
npavlovikj committed
61
62
63
It is **not** necessary to specify the number of processes with the `-n` option to `mpiexec`. OpenMPI will determine that automatically from SLURM based on the value of the `--ntasks` option.
{{% /notice %}}

npavlovikj's avatar
i    
npavlovikj committed
64
65

### Ray Output
npavlovikj's avatar
npavlovikj committed
66
67
68
69
70
71
72
73
74
75

In the output folder (`-o output_directory`) Ray prints a lot of files with information about different steps and statistics from the execution process. Information about all output files can be found in Ray's manual.

One of the most important results are:

- **Scaffolds.fasta**: scaffold sequences in FASTA format
- **ScaffoldComponents.txt**: components of each scaffold
- **Contigs.fasta**: contiguous sequences in FASTA format
- **OutputNumbers.txt**: overall numbers for the assembly

npavlovikj's avatar
i    
npavlovikj committed
76
77

### Useful Information
npavlovikj's avatar
npavlovikj committed
78

Caughlin Bohn's avatar
Caughlin Bohn committed
79
In order to test the Ray performance, we used three paired-end input fastq files, `small_1.fastq` and `small_2.fastq`, `medium_1.fastq` and `medium_2.fastq`, and `large_1.fastq` and `large_2.fastq`. Some statistics about the input files and the time and memory resources used by Ray are shown in the table below:
npavlovikj's avatar
i    
npavlovikj committed
80
{{< readfile file="/static/html/ray.html" >}}