Commit a53bc884 authored by aknecht2's avatar aknecht2
Browse files

Updated workflow_module documentation to the updated format.

parent fc535dca
......@@ -173,68 +173,93 @@ In this snippet, we define a few things. The job to be run in this case is
bwa_align_single, which should line up with an existing WorkflowJob. For this
job we define all inputs and outputs required for the job. In general,
inputs and outputs are files, however some are strings or numeric arguments
that need to be loaded at run time. The param_name should match the name of
the parameter defined in the workflow_job yaml. Each key in inputs/outputs
defines the name of the file that will be used by job. Output files will
be created with the specified file name as a postfix. Prefixes for output
files are generated based on the tools used on them up to the current point
in the workflow.
The job bwa_align_single aligns a single-end read fastq file to the target
genome. We need to pass both the reference genome and fastq file as inputs.
The additional genome indices need to be passed as additional_inputs since they
are not used as arguments but are still required as input from bwa. Finally,
we write a single output file, align.sai. Let's consider a similar job, this
time using bowtie2 and paired end reads:
that need to be loaded at run time. The param_name defined here should match
the name of the parameter defined in the workflow_job yaml exactly. Here's
a snippet from the bwa_align_single.yaml:
.. code-block:: yaml
- bowtie2_align_paired:
inputs:
- ref_genome_prefix:
type: string
- download_1.fastq:
type: file
- download_2.fastq:
type: file
additional_inputs:
- ref_genome:
type: file
- ref_genome.1.bt2:
type: file
- ref_genome.2.bt2:
type: file
- ref_genome.3.bt2:
type: file
- ref_genome.4.bt2:
type: file
- ref_genome.rev.1.bt2:
type: file
- ref_genome.rev.2.bt2:
type: file
outputs:
- align.sam:
type: file
- align.quality:
type: stderr
bwa_align_single:
inputs:
base_genome_file:
type: file
file_type: genome_index
fastq1:
type: file
file_type: fastq
genome.fna.amb:
type: file
file_type: bwa_genome
genome.fna.ann:
type: file
file_type: bwa_genome
genome.fna.bwt:
type: file
file_type: bwa_genome
genome.fna.pac:
type: file
file_type: bwa_genome
genome.fna.sa:
type: file
file_type: bwa_genome
outputs:
output_sai:
type: file
file_type: sai
We can see that each of the param_names defined in our workflow_module
correspond to the name of each input / output in our bwa_align_single job.
The param_names are only used to correctly pass arguments through to the job.
The keys of each input/output dictionary defined in the workflow_module yaml
correspond to the actual names of the input files used, or the output files
that are created. For example, the output param_name for this job is
"output_sai" but the file that will actually be created is "align.sai".
To maintain uniqueness, output files are created with a prefix based on the
tools used on a file up to that point in the workflow, and the accession
numbers of the base files used. If we are using bwa_align_single on a fastq
file ENCFF000A.fastq.gz the output sai file created will be
ENCFF000A_bwa_single_align.sai. Let's look at a similar job:
.. code-block:: yaml
bowtie2_align_paired:
inputs:
ref_genome_prefix:
param_name: genome_prefix
download_1.fastq:
param_name: fastq1
download_2.fastq:
param_name: fastq2
ref_genome:
param_name: base_genome_file
ref_genome.1.bt2:
param_name: genome.1.bt2
ref_genome.2.bt2:
param_name: genome.2.bt2
ref_genome.3.bt2:
param_name: genome.3.bt2
ref_genome.4.bt2:
param_name: genome.4.bt2
ref_genome.rev.1.bt2:
param_name: genome.rev.1.bt2
ref_genome.rev.2.bt2:
param_name: genome.rev.2.bt2
outputs:
align.sam:
param_name: output_sam
align.quality:
param_name: fastq_quality
bowtie2_align_paired aligns two paired-end read fastq files to the target
genome, and has a few interesting differences. Just as before we define all
the inputs, additional_inputs, and outputs required for the job. However,
in this case not all arguments are type=file. The first input
"ref_genome_prefix" is defined as type string. bowtie2 requires the prefix for
the genome to be passed in as an argument, not the main genome fasta file
itself. As such, this value needs to be calculated at run time based on which
genome is used for bowtie2.
The second output argument "align.quality" has type=stderr, which will redirect
stderr to a new file. This can be similarly used to redirect stdout. To sum
up:
| **Inputs** Can have a type=[file, string, numeric, list]
| **Additional Inputs** Can have a type=[file, list]
| **Outputs** Must have a type=file
|
the inputs and outputs required for the job. However, in this case not all
the arguments are file arguments. The first input defined "ref_genome_prefix"
is a string used to find all the necessary genomic files. In this case we have
two inputs defined, aling.sam the final aligned file, and an align.quality file
that is redirected from stderr. To sum up, the keys define the names of the
generated files/arguments, the param_names define how to pass these files &
arguments through to the jobs.
Logical Splits (Markers)
^^^^^^^^^^^^^^^^^^^^^^^^^
......@@ -331,24 +356,23 @@ bwa_align_single:
- bwa_align_single:
inputs:
- ref_genome:
type: file
- download_1.fastq:
type: file
additional_inputs:
- ref_genome.amb:
type: file
- ref_genome.ann:
type: file
- ref_genome.bwt:
type: file
- ref_genome.pac:
type: file
- ref_genome.sa:
type: file
ref_genome:
param_name: base_genome_file
download_1.fastq:
param_name: fastq1
ref_genome.amb:
param_name: genome.fna.amb
ref_genome.ann:
param_name: genome.fna.ann
ref_genome.bwt:
param_name: genome.fna.bwt
ref_genome.pac:
param_name: genome.fna.pac
ref_genome.sa:
param_name: genome.fna.sa
outputs:
- align.sai:
type: file
align.sai:
param_name: output_sai
bwa_sai_to_sam:
......@@ -356,26 +380,25 @@ bwa_sai_to_sam:
- bwa_sai_to_sam:
inputs:
- ref_genome:
type: file
- align.sai:
type: file
- download_1.fastq:
type: file
additional_inputs:
- ref_genome.amb:
type: file
- ref_genome.ann:
type: file
- ref_genome.bwt:
type: file
- ref_genome.pac:
type: file
- ref_genome.sa:
type: file
ref_genome:
param_name: base_genome_file
align.sai:
param_name: input_sai
download_1.fastq:
param_name: base_fastq
ref_genome.amb:
param_name: genome.fna.amb
ref_genome.ann:
param_name: genome.fna.ann
ref_genome.bwt:
param_name: genome.fna.bwt
ref_genome.pac:
param_name: genome.fna.pac
ref_genome.sa:
param_name: genome.fna.sa
outputs:
- align.sam:
type: file
align.sam:
param_name: output_sam
and samtools_sam_to_bam:
......@@ -383,33 +406,25 @@ and samtools_sam_to_bam:
- samtools_sam_to_bam:
inputs:
- align.sam:
type: file
additional_inputs: null
align.sam:
param_name: align_sam
outputs:
- align.bam:
type: file
final_result: true
align.bam:
param_name: align_bam
final_result: true
The outputs of each step chain into the inputs of the next step for these jobs.
bwa_align_single creates an align.sai file which is used as an input for
bwa_sai_to_sam. bwa_sai_to_sam creates an align.sam file which is used as an
input for samtools_sam_to_bam. When creating jobs for a module, all inputs,
additional_inputs, and outputs for the entire module must be passed in.
In this case, we would need to pass in 2 inputs:
In this case, we would need to pass in 7 inputs:
.. code-block:: python
{
"ref_genome": "file-name",
"download_1.fastq": "file-name"
}
5 additional inputs:
.. code-block:: python
{
"download_1.fastq": "file-name",
"ref_genome.amb": "file-name",
"ref_genome.ann": "file-name",
"ref_genome.bwt": "file-name",
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment