Skip to content
Snippets Groups Projects
Commit b8f106c8 authored by aknecht2's avatar aknecht2
Browse files

Updated examples documentation, including downloadable files.

parent fd3337e3
No related branches found
No related tags found
No related merge requests found
......@@ -6,7 +6,17 @@ a run file, and a param file. The config file is used to specify system
information -- paths to required software, environment variables for pegasus
and so on. The run file is used to specify the actual files to process and
what software tools to use on them. Finally, the param file is used to
override any default params for the jobs in the workflow.
override any default params for the jobs in the workflow. In each of the
examples below, all three of these files will be talked about, and download
links to each will be provided.
Getting Started
^^^^^^^^^^^^^^^^
:download:`Config <examples/small_test_config.yaml>`
:download:`Run <examples/small_test_run.yaml>`
:download:`Param <examples/small_test_param.yaml>`
**Config**
......@@ -27,6 +37,16 @@ override any default params for the jobs in the workflow.
PATH: "/home/swanson/aknecht/.conda/envs/ih_env/bin:/bin/:/usr/bin/:/usr/local/bin/"
PEGASUS_HOME: "/usr/"
Specifying an email in the config file will send an email to the target
address once the workflow is complete. The pegasus_home definition corresponds
to the pegasus install location. This is necessary so the pegasus email
script (in pegasus/notification/email) can be found and executed successfully.
The config file profile information is passed through to the pegasus
`sites catalog <https://pegasus.isi.edu/documentation/site.php>`_. This allows
any pegasus `profile <https://pegasus.isi.edu/documentation/profiles.php>`_
information to be passed. The required information will be dependent on the
system you are submitting to.
**Run**
.. code-block:: yaml
......@@ -45,7 +65,7 @@ override any default params for the jobs in the workflow.
idr: &id002
- ENCFF001NIP
- ENCFF001NIS
peak: spp
peak: macs2
peak_type: narrow
signals: &id003
- ENCFF001NIP
......@@ -55,10 +75,49 @@ override any default params for the jobs in the workflow.
controls: *id001
file_type: fastq
idr: *id002
peak: spp
peak: macs2
peak_type: narrow
signals: *id003
The run file defines all genomic information required by the workflow, and
the accession files to process. Genomic information is defined for each
assembly used in the workflow, and must include a chromosome sizes file.
For each alignment tool used, a path to the base file must be specified.
In this case, we use both bowtie2 & bwa, so we specify a path for both.
In the second section, runs are defined as a list containing all information
necessary for processing:
* align
The alignment tool to use, should be either bwa or bowtie2. If starting
from bam files the alignment tool is not required.
* assembly
The assembly for alignment. Some peak callers use the chromosome file,
so assembly is required.
* file_type
Defines the type of files that processing initial begins with. Should be
either fastq or bam.
* peak
The tool used for peak calling. Should be one of [spp, gem, macs2,
peakranger, ccat, zerone, music].
* peak_type
The type of peak calling to perform. The peak type is tool dependent,
as tools support different peak calling types. Usually peak_type is narrow
or broad.
* signals
The list of chip data accession numbers to process.
* controls
The list of control inputs to process.
* idr
A pair of signal accessions to use for idr. Idr can only be run on pairs
of files. Idr is optional.
When creating runs, often times you'll want to investigate the same files
with multiple different peak calling and alignment tools. In the case above,
the two runs defined are identical except for the alignment tool -- one uses
bwa and the other uses bowite2. To avoid retyping a lot of information, lists
can be marked with ids using the & symbol. Later on in the file, the list can
be referenced using the * symbol.
**Param**
.. code-block:: yaml
......@@ -66,18 +125,26 @@ override any default params for the jobs in the workflow.
macs2_callpeak:
arguments:
"-g": "mm"
bwa_align_single:
music_punctate:
arguments:
"-q": 5
"-l": 32
"-k": 2
"-t": 1
bwa_align_paired:
"--mapp": "/work/ladunga/SHARED/workflows/mm9_50bp"
music_narrow:
arguments:
"-t": 1
samtools_sam_to_bam:
walltime: 60
memory: 16000
"--mapp": "/work/ladunga/SHARED/workflows/mm9_50bp"
music_broad:
arguments:
"--mapp": "/work/ladunga/SHARED/workflows/mm9_50bp"
picard_mark_duplicates:
arguments:
"REMOVE_DUPLICATES=": "false"
The param file allows you to adjust both arguments and requested resources for
any job in the workflow. In this case, we are processing mouse files so we
specify the "-g": "mm" for macs2 peak calling. The music peak caller requires
additional information to run successfully (even though we are not using it).
Finally, we specify not to remove duplicates.
**Generation**
To generate the workflow, pass these input files into the :ref:`chip-gen`
script, like so:
......
notify:
pegasus_home: "/usr/share/pegasus/"
email: "avi@kurtknecht.com"
profile:
pegasus:
style: "glite"
condor:
grid_resource: "pbs"
universe: "vanilla"
batch_queue: "batch"
env:
PYTHONPATH: "/home/swanson/aknecht/.conda/envs/ih_env/lib/python2.7/site-packages/"
PATH: "/home/swanson/aknecht/.conda/envs/ih_env/bin:/bin/:/usr/bin/:/usr/local/bin/"
PEGASUS_HOME: "/usr/"
macs2_callpeak:
arguments:
"-g": "mm"
music_punctate:
arguments:
"--mapp": "/work/ladunga/SHARED/workflows/mm9_50bp"
music_narrow:
arguments:
"--mapp": "/work/ladunga/SHARED/workflows/mm9_50bp"
music_broad:
arguments:
"--mapp": "/work/ladunga/SHARED/workflows/mm9_50bp"
picard_mark_duplicates:
arguments:
"REMOVE_DUPLICATES=": "false"
genomes:
mm9:
bowtie2: /work/ladunga/SHARED/mouse/mm9/mm9.genome.fa
bwa: /work/ladunga/SHARED/mouse/mm9/mm9.genome.fa
chrom.sizes: /work/ladunga/SHARED/mouse/mm9/mm9.chrom.sizes
runs:
- align: bwa
assembly: mm9
controls: &id001
- ENCFF001NIM
file_type: fastq
idr: &id002
- ENCFF001NIP
- ENCFF001NIS
peak: macs2
peak_type: narrow
signals: &id003
- ENCFF001NIP
- ENCFF001NIS
- align: bowtie2
assembly: mm9
controls: *id001
file_type: fastq
idr: *id002
peak: macs2
peak_type: narrow
signals: *id003
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment