Commit e97eed13 authored by aknecht2's avatar aknecht2
Browse files

Added initial full documentation for workflow_module, subject to revision.

parent 579c35f8
......@@ -22,6 +22,11 @@ class WorkflowModule(object):
:type workflow_jobs: dict
:param debug: A flag to print additional info.
:type debug: bool
The workflow module handles the inner workings of an individual module
yaml file. Provided with the correct workflow_jobs and a list of
files, the workflow module creates all jobs in the correct order for
the target module.
"""
def __init__(self, module_yaml, workflow_jobs, debug = False):
......@@ -52,8 +57,10 @@ class WorkflowModule(object):
def get_job_list(self, markers):
"""
:param markers: A dictionary containing marker values for the module.
:param markers: The splits to take for the current module.
:type markers: dict
:returns: A list of all jobs that will be run for this workflow with
the provided markers.
This function returns the list of jobs that will be run provided
the markers. If markers is not specified, it simply returns the
......@@ -169,6 +176,7 @@ class WorkflowModule(object):
:type markers: dict
:param job_name: The name of the job to get params for.
:type job_name: str
:returns: A dictionary containing inputs, additional_inputs, and outputs.
Returns the inputs, additional_inputs, and outputs defined for the
specified job & marker combination from the module. The values returned
......@@ -185,8 +193,8 @@ class WorkflowModule(object):
"""
:param markers: The input markers.
:type markers: dict
Returns a list of job names that match the provided markers.
:returns: The names of the jobs that will be run for this module with
the provided markers.
"""
job_list = self.get_job_list(markers)
job_names = []
......@@ -205,13 +213,23 @@ class WorkflowModule(object):
:param markers: The splits to take for the current module.
:type markers: dict
:param inputs: A dictionary mapping the logical file names defined in
the the workflow module -> the full file name.
the workflow module -> the full file name.
:type inputs: dict
:param additional_inputs: A dictionary mapping the logical file names
defined in the workflow module -> the full file name.
:type additional_inputs: dict
:param outputs: A dictionary mapping the logical file names defined in
the workflow module -> the full file name.
:returns: None
Adds all the jobs in the correct order for the current workflow.
In this case, the inputs, additional_inputs, and outputs need to be
passed in for all jobs in the workflow. If you consider a small
example of three jobs that get run back to back, with the output of
each getting piped into the input of the next. In this case, we only
need to pass a single input in, because the inputs of the future jobs
use the outputs of the previous jobs. We need to pass in three outputs
still for the outputs of each of the jobs.
"""
valid, msg = self._check_params(master_files, markers, inputs, additional_inputs, outputs)
if valid:
......@@ -380,9 +398,9 @@ class WorkflowModule(object):
"""
:param markers: The splits to take for the current workflow.
:type markers: dict
:returns: A list of all output files marked with final_result: True
Gets all the final_results for the workflow with the target markers.
That is, all the outputs that are marked with final_result: True.
"""
job_list = self.get_job_list(markers)
final_results = []
......
Workflow Module
=================
The structure of workflow modules are defined in yaml files and are loaded
in through the :py:class:`~chipathlon.workflow_module.WorkflowModule`
The module yaml files can be found in chipathlon/jobs/modules/
Here is the align.yaml module in full. This module servers as a good example
for talking through the features workflow modules have.
.. code-block:: yaml
align:
- bwa[tool]:
- single[read_end]:
- bwa_align_single:
inputs:
- ref_genome:
type: file
- download_1.fastq:
type: file
additional_inputs:
- ref_genome.amb:
type: file
- ref_genome.ann:
type: file
- ref_genome.bwt:
type: file
- ref_genome.pac:
type: file
- ref_genome.sa:
type: file
outputs:
- align.sai:
type: file
- bwa_sai_to_sam:
inputs:
- ref_genome:
type: file
- align.sai:
type: file
- download_1.fastq:
type: file
additional_inputs:
- ref_genome.amb:
type: file
- ref_genome.ann:
type: file
- ref_genome.bwt:
type: file
- ref_genome.pac:
type: file
- ref_genome.sa:
type: file
outputs:
- align.sam:
type: file
- paired[read_end]:
- bwa_align_paired:
inputs:
- ref_genome:
type: file
- download_1.fastq:
type: file
- download_2.fastq:
type: file
additional_inputs:
- ref_genome.amb:
type: file
- ref_genome.ann:
type: file
- ref_genome.bwt:
type: file
- ref_genome.pac:
type: file
- ref_genome.sa:
type: file
outputs:
- align.sam:
type: stdout
bowtie2[tool]:
- single[read_end]:
- bowtie2_align_single:
inputs:
- ref_genome_prefix:
type: string
- download_1.fastq:
type: file
additional_inputs:
- ref_genome:
type: file
- ref_genome.1.bt2:
type: file
- ref_genome.2.bt2:
type: file
- ref_genome.3.bt2:
type: file
- ref_genome.4.bt2:
type: file
- ref_genome.rev.1.bt2:
type: file
- ref_genome.rev.2.bt2:
type: file
outputs:
- align.sam:
type: file
- align.quality:
type: stderr
- paired[read_end]:
- bowtie2_align_paired:
inputs:
- ref_genome_prefix:
type: string
- download_1.fastq:
type: file
- download_2.fastq:
type: file
additional_inputs:
- ref_genome:
type: file
- ref_genome.1.bt2:
type: file
- ref_genome.2.bt2:
type: file
- ref_genome.3.bt2:
type: file
- ref_genome.4.bt2:
type: file
- ref_genome.rev.1.bt2:
type: file
- ref_genome.rev.2.bt2:
type: file
outputs:
- align.sam:
type: file
- align.quality:
type: stderr
- samtools_sam_to_bam:
inputs:
- align.sam:
type: file
additional_inputs: null
outputs:
- align.bam:
type: file
final_result: true
Job Markup
^^^^^^^^^^^
Ignoring everything else, let's first look at an individual job:
.. code-block:: yaml
- bwa_align_single:
inputs:
- ref_genome:
type: file
- download_1.fastq:
type: file
additional_inputs:
- ref_genome.amb:
type: file
- ref_genome.ann:
type: file
- ref_genome.bwt:
type: file
- ref_genome.pac:
type: file
- ref_genome.sa:
type: file
outputs:
- align.sai:
type: file
In this snippet, we define a few things. The job to be run in this case is
bwa_align_single, which should line up with an existing WorkflowJob. For
this job, we define all inputs, additional_inputs, and outputs required for the
job. Inputs are files or dynamic arguments required by a job. Arguments are
normally loaded through the workflow job, however some arguments need to be
calculated and passed at runtime. Additional inputs are files required by a
job but not explicitly referenced as arguments. These files still need to be
marked as inputs by pegasus. Finally, outputs are all the files created by
the job.
The job bwa_align_single aligns a single-end read fastq file to the target
genome. We need to pass both the reference genome and fastq file as inputs.
The additional genome indices need to be passed as additional_inputs since they
are not used as arguments but are still required as input from bwa. Finally,
we write a single output file, align.sai. Let's consider a similar job, this
time using bowtie2 and paired end reads:
.. code-block:: yaml
- bowtie2_align_paired:
inputs:
- ref_genome_prefix:
type: string
- download_1.fastq:
type: file
- download_2.fastq:
type: file
additional_inputs:
- ref_genome:
type: file
- ref_genome.1.bt2:
type: file
- ref_genome.2.bt2:
type: file
- ref_genome.3.bt2:
type: file
- ref_genome.4.bt2:
type: file
- ref_genome.rev.1.bt2:
type: file
- ref_genome.rev.2.bt2:
type: file
outputs:
- align.sam:
type: file
- align.quality:
type: stderr
bowtie2_align_paired aligns two paired-end read fastq files to the target
genome, and has a few interesting differences. Just as before we define all
the inputs, additional_inputs, and outputs required for the job. However,
in this case not all arguments are type=file. The first input
"ref_genome_prefix" is defined as type string. bowtie2 requires the prefix for
the genome to be passed in as an argument, not the main genome fasta file
itself. As such, this value needs to be calculated at run time based on which
genome is used for bowtie2.
The second output argument "align.quality" has type=stderr, which will redirect
stderr to a new file. This can be similarly used to redirect stdout. To sum
up:
| **Inputs** Can have a type=[file, string, numeric, list]
| **Additional Inputs** Can have a type=[file, list]
| **Outputs** Must have a type=file
|
Logical Splits (Markers)
^^^^^^^^^^^^^^^^^^^^^^^^^
When executing a particular module, sometimes there are multiple paths that
lead to the same result. In the case of the alignment module, we need to run
different jobs depending on if we use bwa or bowtie2, and whether or not we
are using paired-end reads. However, we always start from fastq files and
always end with a bam file. A module should represent a single logical unit
of a workflow, but markers allow for handling edge cases within an individual
marker. The markup for defining these markers can be seen in the first few
lines of the aling yaml file:
.. code-block:: yaml
align:
- bwa[tool]:
- single[read_end]:
The square brackets denote the name of the marker. The value of the marker is
the string outside of the brackets themselves. So, if we have tool=bwa we
should follow the list of jobs defined under bwa[tool], and if we then have
read_end=single, we should follow the list of jobs defined under
single[read_end].
Splits can be taken at any point in a workflow, and common jobs can be included
before or after any markers. In the case of the align module, the
samtools_sam_to_bam job is always run no matter what the markers are.
Indentation becomes an important identifier for which markers you are currently
in. Here's the full align.yaml only including job names, no params to help
better visualize what's happening:
.. code-block:: yaml
align:
- bwa[tool]:
- single[read_end]:
- bwa_align_single
- bwa_sai_to_sam
- paired[read_end]:
- bwa_align_paired
- bowtie2[tool]:
- single[read_end]:
- bowtie2_align_single
- paired[read_end]:
- bowtie2_align_paired
- samtools_sam_to_bam
It becomes easy to see here that the samtools_sam_to_bam job is at the same
level as the bwa[tool] and bowtie2[tool] entries. In the case that we are
using bwa and have single-end reads, we will run a total of three jobs:
bwa_align_single, bwa_sai_to_sam, then samtools_sam_to_bam. If we are using
bowtie2 and have paired-end reads, we will run a total of two jobs:
bowtie2_align_paired, samtools_sam_to_bam. Remember, when provided markers
we only follow the list of jobs that match those markers.
Keeping a job at the same level of indentation without any markers will cause
it to always run. You can do this before or after any markers, at any level
in the module. We could add an additional job that always runs if we use
tool=bwa regardless of what read_end type the files are by doing this:
.. code-block:: yaml
align:
- bwa[tool]:
- this_job_will_always_run
- single[read_end]:
- bwa_align_single
- bwa_sai_to_sam
- paired[read_end]:
- bwa_align_paired
Keep in mind, whenever you are creating jobs for a module during workflow
generation, you will need to pass in all the markers required for that module.
In this case, we will need to pass in a dictionary that looks like:
.. code-block:: python
{
"tool": "bwa",
"read_end": "paired"
}
Dependency Chaining
^^^^^^^^^^^^^^^^^^^^^
Outputs from previous jobs can be used as inputs / additional_inputs for future
steps without additional effort. Each of the files defined in the module
should have a unique name. This unique name can then be used in later steps
of the same module. Let's again consider the align module, and let's assume
we are using tool=bwa and read_end=single. We will run a total of three jobs,
bwa_align_single:
.. code-block:: yaml
- bwa_align_single:
inputs:
- ref_genome:
type: file
- download_1.fastq:
type: file
additional_inputs:
- ref_genome.amb:
type: file
- ref_genome.ann:
type: file
- ref_genome.bwt:
type: file
- ref_genome.pac:
type: file
- ref_genome.sa:
type: file
outputs:
- align.sai:
type: file
bwa_sai_to_sam:
.. code-block:: yaml
- bwa_sai_to_sam:
inputs:
- ref_genome:
type: file
- align.sai:
type: file
- download_1.fastq:
type: file
additional_inputs:
- ref_genome.amb:
type: file
- ref_genome.ann:
type: file
- ref_genome.bwt:
type: file
- ref_genome.pac:
type: file
- ref_genome.sa:
type: file
outputs:
- align.sam:
type: file
and samtools_sam_to_bam:
.. code-block:: yaml
- samtools_sam_to_bam:
inputs:
- align.sam:
type: file
additional_inputs: null
outputs:
- align.bam:
type: file
final_result: true
The outputs of each step chain into the inputs of the next step for these jobs.
bwa_align_single creates an align.sai file which is used as an input for
bwa_sai_to_sam. bwa_sai_to_sam creates an align.sam file which is used as an
input for samtools_sam_to_bam. When creating jobs for a module, all inputs,
additional_inputs, and outputs for the entire module must be passed in.
In this case, we would need to pass in 2 inputs:
.. code-block:: python
{
"ref_genome": "file-name",
"download_1.fastq": "file-name"
}
5 additional inputs:
.. code-block:: python
{
"ref_genome.amb": "file-name",
"ref_genome.ann": "file-name",
"ref_genome.bwt": "file-name",
"ref_genome.pac": "file-name",
"ref_genome.sa": "file-name"
}
and 3 outputs:
.. code-block:: python
{
"align.sai": "file-name",
"align.sam": "file-name",
"align.bam": "file-name"
}
We don't need to include align.sai or align.sam as inputs because they can
be chained from previous outputs. The bwa_align_single and bwa_sai_to_sam
jobs also share many of the same input files since they use the same unique
file name. Using the same name for outputs can cause errors though, as files
can get overwritten.
Workflow Module Class
^^^^^^^^^^^^^^^^^^^^^^
.. autoclass:: chipathlon.workflow_module.WorkflowModule
:members:
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment