Commit 404e9d87 authored by aknecht2's avatar aknecht2
Browse files

Started updating workflow_job & workflow_module documentation.

parent edb538c3
......@@ -13,12 +13,11 @@ understand their structure. Here's the bedtools_bam_to_bed.yaml in full:
bedtools_bam_to_bed:
inputs:
- name: input_bam
input_bam:
type: file
file_type: bam
additional_inputs: null
outputs:
- name: output_bed
output_bed:
type: stdout
file_type: bed
command: bedtools
......@@ -33,95 +32,213 @@ understand their structure. Here's the bedtools_bam_to_bed.yaml in full:
changeable: false
required: true
has_value: true
default: $inputs.0
walltime: 2000
default: $input_bam
walltime: 60
memory: 2000
cores: 1
nodes: 1
We can think of this file in 3 major sections: input & outputs, arguments, and
resources.
Inputs & Outputs
^^^^^^^^^^^^^^^^^^
Each input, additional_input, and output should be defined along with their
type & file_type if applicable. The file_type definitions can be found in
the configuration file chipathlon/conf.py. Remember, additional_inputs are
files that will be used by a job but are not directly referenced as arguments.
Inputs are necessarily required to be files. In some cases, you will need to
pass in a prefix that is generated at run time, so it cannot be a static
argument. Here is a snippet from macs2_narrow.yaml:
Each input, and output should be defined along with their type and file_type if
applicable. These type definitions can be found in the configuration file
chipathlon/conf.py. Inputs are not necessarily required to be files -- in some
cases you will need to pass in a prefix or other argument that is calculated
at run time. Outputs are always files and represent the fiels created by this
particular job. These inputs and outputs are passed into the workflow job at
generation time and can be used in arguments, though they do not have to be.
Some input or output files need to be included but are not explicitly
referenced by the arguments. Take bowtie2_align_single.yaml for example:
.. code-block:: yaml
bowtie2_align_single:
inputs:
genome_prefix:
type: string
fastq1:
type: file
file_type: fastq
base_genome_file:
type: file
file_type: genome_index
genome.1.bt2:
type: file
file_type: bowtie2_genome
genome.2.bt2:
type: file
file_type: bowtie2_genome
genome.3.bt2:
type: file
file_type: bowtie2_genome
genome.4.bt2:
type: file
file_type: bowtie2_genome
genome.rev.1.bt2:
type: file
file_type: bowtie2_genome
genome.rev.2.bt2:
type: file
file_type: bowtie2_genome
outputs:
output_sam:
type: file
file_type: sam
fastq_quality:
type: stderr
file_type: quality
command: bowtie2
arguments:
- "-x":
type: string
changeable: false
required: true
has_value: true
default: $genome_prefix
- "-U":
type: file
changeable: false
required: true
has_value: true
default: $fastq1
- "-S":
type: file
changeable: false
required: true
has_value: true
default: $output_sam
- "-p":
type: numeric
changeable: true
required: false
has_value: true
default: 8
walltime: 1440
memory: 8000
cores: 8
nodes: 1
Inputs are defined for each of the required genome index files, but only the
prefix is used as an argument. All files used by a particular job must be
included so that Pegasus can appropriately transfer them. In some cases,
jobs need a list of inputs defined for them. Here is a snippet from
gem_callpeak.yaml:
.. code-block::yaml
inputs:
- name: signal_bed
prefix:
type: string
chrom.sizes:
type: file
file_type: chrom_sizes
control.bed:
type: file
file_type: bed
- name: control_bed
signal.bed:
type: file
file_type: bed
- name: prefix
genome:
type: string
chr_fasta:
type: file_list
file_type: chr_fasta
macs2 requires a prefix to be passed in that depends on the files being
processed. The prefix is included as an input since it is calculated
dynamically. Finally, inputs & additional_inputs can also be the list
type. The list type is for a single argument that has multiple values.
Here is a snippet from gem_callpeak.yaml:
The gem peak caller requires including all individual chromsome fasta files.
The number of chromsomes changes depending on the organism so it becomes
impossible to manually define an input for each individual chromsome file.
In this case, we use the file_list type to pass in any number of files
for a particular input.
Outputs can be redirected from stdout or stderr by defining their type to
stdout or stderr respecitvely. You can see this in the
bowtie2_align_single.yaml:
.. code-block:: yaml
additional_inputs:
- name: read_distribution
outputs:
output_sam:
type: file
file_type: read_dist
- name: chr_fasta
type: list
file_type: chr_fasta
file_type: sam
fastq_quality:
type: stderr
file_type: quality
The gem peak caller requires including all individual chrosome fasta files.
The number of chromosomes can differ depending on the organism involved,
so we use the list type for the chr_fasta argument.
Here, we redirect the stderr into a fastq_quality. Note that the actual name
of created files is not handled by the workflow_job, but is based on the
name passed in from the workflow modules.
Output redirection can be done by defining type = stdout or stderr, and will
redirect from stdout/stderr respectively into the target output file.
The command definition defines the actual executable to be run. This can be a
command on your path, a script that you include in the chipathlon/jobs/scripts/
folder, or a wrapper script included in the chipathlon/jobs/wrappers folder.
Finally, the command definition is the actual executable to be run. It is
expected that chipathlon is running out of the pre-built conda env, so any
executable installed in your conda environment will be runnable. You can
also modify your sites.xml file to add more path variables.
Arguments
^^^^^^^^^^^
Each argument that is passed on the command line to an executable needs to be
defined here. This includes all positional arguments and keyword arguments as
well as their values. The arguments are a defined as a list to maintain order,
and will be added in the order you define them in. Each argument is defined
as a dictionary. The key of the dictionary is the key of the argument passed
on the command line if it is a keyword argument, otherwise if it is a
positional argument, the key of the dictionary is the value itself. The
values of the dictionary define the properties of each argument. Here's a
snippet from cp.yaml:
and will be added in the order you define them in. Even though order does not
matter for keyword arguments, we maintain a list structure to support both
positional and keyword arguments. Each argument is defined as a dictionary
with a few properties. These properties determine how the arguments are
actually added to the command. Here's a snippet from cp.yaml:
.. code-block:: yaml
arguments:
- "$inputs.0":
- "$input_file":
type: file
changeable: false
required: true
has_value: false
- "$outputs.0":
- "$output_file":
type: file
changeable: false
required: true
has_value: false
In this case the cp.yaml file is a mapper to the cp command. cp takes two
In this case the cp.yaml file is a mapper to the cp command. The cp command
takes two arguments, an input file and an output location. So, here we define
our two arguments to pass in: "$input_file" and "$output_file". In the case
of positional arguments like these, the root key of the dictionary is the
value passed on the command line. In the case of keyword arguments, the root
key of the dictionary is the key passed on the command line, and must have
a value provided. Here's a snippet from bwa_align_single.yaml:
.. code-block:: yaml
- "-q":
type: numeric
changeable: true
required: false
has_value: true
default: 5
- "-l":
type: numeric
changeable: true
required: false
has_value: true
default: 32
- "-k":
type: numeric
changeable: true
required: false
has_value: true
default: 2
The cp takes two
arguments, and input file and output location which is why the cp.yaml only
has two arguments specified. For each argument, you should define four
properties: type, changeable, required, and has_value, and can optionally
......
......@@ -18,135 +18,129 @@ workflow modules have.
- single[read_end]:
- bwa_align_single:
inputs:
- ref_genome:
type: file
- download_1.fastq:
type: file
additional_inputs:
- ref_genome.amb:
type: file
- ref_genome.ann:
type: file
- ref_genome.bwt:
type: file
- ref_genome.pac:
type: file
- ref_genome.sa:
type: file
ref_genome:
param_name: base_genome_file
download_1.fastq:
param_name: fastq1
ref_genome.amb:
param_name: genome.fna.amb
ref_genome.ann:
param_name: genome.fna.ann
ref_genome.bwt:
param_name: genome.fna.bwt
ref_genome.pac:
param_name: genome.fna.pac
ref_genome.sa:
param_name: genome.fna.sa
outputs:
- align.sai:
type: file
align.sai:
param_name: output_sai
- bwa_sai_to_sam:
inputs:
- ref_genome:
type: file
- align.sai:
type: file
- download_1.fastq:
type: file
additional_inputs:
- ref_genome.amb:
type: file
- ref_genome.ann:
type: file
- ref_genome.bwt:
type: file
- ref_genome.pac:
type: file
- ref_genome.sa:
type: file
ref_genome:
param_name: base_genome_file
align.sai:
param_name: input_sai
download_1.fastq:
param_name: base_fastq
ref_genome.amb:
param_name: genome.fna.amb
ref_genome.ann:
param_name: genome.fna.ann
ref_genome.bwt:
param_name: genome.fna.bwt
ref_genome.pac:
param_name: genome.fna.pac
ref_genome.sa:
param_name: genome.fna.sa
outputs:
- align.sam:
type: file
align.sam:
param_name: output_sam
- paired[read_end]:
- bwa_align_paired:
inputs:
- ref_genome:
type: file
- download_1.fastq:
type: file
- download_2.fastq:
type: file
additional_inputs:
- ref_genome.amb:
type: file
- ref_genome.ann:
type: file
- ref_genome.bwt:
type: file
- ref_genome.pac:
type: file
- ref_genome.sa:
type: file
ref_genome:
param_name: base_genome_file
download_1.fastq:
param_name: fastq1
download_2.fastq:
param_name: fastq2
ref_genome.amb:
param_name: genome.fna.amb
ref_genome.ann:
param_name: genome.fna.ann
ref_genome.bwt:
param_name: genome.fna.bwt
ref_genome.pac:
param_name: genome.fna.pac
ref_genome.sa:
param_name: genome.fna.sa
outputs:
- align.sam:
type: stdout
align.sai:
param_name: output_sai
- bowtie2[tool]:
- single[read_end]:
- bowtie2_align_single:
inputs:
- ref_genome_prefix:
type: string
- download_1.fastq:
type: file
additional_inputs:
- ref_genome:
type: file
- ref_genome.1.bt2:
type: file
- ref_genome.2.bt2:
type: file
- ref_genome.3.bt2:
type: file
- ref_genome.4.bt2:
type: file
- ref_genome.rev.1.bt2:
type: file
- ref_genome.rev.2.bt2:
type: file
ref_genome_prefix:
param_name: genome_prefix
download_1.fastq:
param_name: fastq1
ref_genome:
param_name: base_genome_file
ref_genome.1.bt2:
param_name: genome.1.bt2
ref_genome.2.bt2:
param_name: genome.2.bt2
ref_genome.3.bt2:
param_name: genome.3.bt2
ref_genome.4.bt2:
param_name: genome.4.bt2
ref_genome.rev.1.bt2:
param_name: genome.rev.1.bt2
ref_genome.rev.2.bt2:
param_name: genome.rev.2.bt2
outputs:
- align.sam:
type: file
- align.quality:
type: stderr
align.sam:
param_name: output_sam
align.quality:
param_name: fastq_quality
- paired[read_end]:
- bowtie2_align_paired:
inputs:
- ref_genome_prefix:
type: string
- download_1.fastq:
type: file
- download_2.fastq:
type: file
additional_inputs:
- ref_genome:
type: file
- ref_genome.1.bt2:
type: file
- ref_genome.2.bt2:
type: file
- ref_genome.3.bt2:
type: file
- ref_genome.4.bt2:
type: file
- ref_genome.rev.1.bt2:
type: file
- ref_genome.rev.2.bt2:
type: file
ref_genome_prefix:
param_name: genome_prefix
download_1.fastq:
param_name: fastq1
download_2.fastq:
param_name: fastq2
ref_genome:
param_name: base_genome_file
ref_genome.1.bt2:
param_name: genome.1.bt2
ref_genome.2.bt2:
param_name: genome.2.bt2
ref_genome.3.bt2:
param_name: genome.3.bt2
ref_genome.4.bt2:
param_name: genome.4.bt2
ref_genome.rev.1.bt2:
param_name: genome.rev.1.bt2
ref_genome.rev.2.bt2:
param_name: genome.rev.2.bt2
outputs:
- align.sam:
type: file
- align.quality:
type: stderr
align.sam:
param_name: output_sam
align.quality:
param_name: fastq_quality
- samtools_sam_to_bam:
inputs:
- align.sam:
type: file
additional_inputs: null
align.sam:
param_name: align_sam
outputs:
- align.bam:
type: file
final_result: true
align.bam:
param_name: align_bam
final_result: true
Job Markup
^^^^^^^^^^^
......@@ -157,34 +151,34 @@ Ignoring everything else, let's first look at an individual job:
- bwa_align_single:
inputs:
- ref_genome:
type: file
- download_1.fastq:
type: file
additional_inputs:
- ref_genome.amb:
type: file
- ref_genome.ann:
type: file
- ref_genome.bwt:
type: file
- ref_genome.pac:
type: file
- ref_genome.sa:
type: file
ref_genome:
param_name: base_genome_file
download_1.fastq:
param_name: fastq1
ref_genome.amb:
param_name: genome.fna.amb
ref_genome.ann:
param_name: genome.fna.ann
ref_genome.bwt:
param_name: genome.fna.bwt
ref_genome.pac:
param_name: genome.fna.pac
ref_genome.sa:
param_name: genome.fna.sa
outputs:
- align.sai:
type: file
align.sai:
param_name: output_sai
In this snippet, we define a few things. The job to be run in this case is
bwa_align_single, which should line up with an existing WorkflowJob. For
this job, we define all inputs, additional_inputs, and outputs required for the
job. Inputs are files or dynamic arguments required by a job. Arguments are
normally loaded through the workflow job, however some arguments need to be
calculated and passed at runtime. Additional inputs are files required by a
job but not explicitly referenced as arguments. These files still need to be
marked as inputs by pegasus. Finally, outputs are all the files created by
the job.
bwa_align_single, which should line up with an existing WorkflowJob. For this
job we define all inputs and outputs required for the job. In general,
inputs and outputs are files, however some are strings or numeric arguments
that need to be loaded at run time. The param_name should match the name of
the parameter defined in the workflow_job yaml. Each key in inputs/outputs
defines the name of the file that will be used by job. Output files will
be created with the specified file name as a postfix. Prefixes for output
files are generated based on the tools used on them up to the current point
in the workflow.
The job bwa_align_single aligns a single-end read fastq file to the target
genome. We need to pass both the reference genome and fastq file as inputs.
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment