examples.rst 6.13 KB
Newer Older
1
2
3
Examples
==========

aknecht2's avatar
aknecht2 committed
4
5
6
7
8
Whenever generating a workflow, there are three required files.  A config file,
a run file, and a param file.  The config file is used to specify system
information -- paths to required software, environment variables for pegasus
and so on.  The run file is used to specify the actual files to process and
what software tools to use on them.  Finally, the param file is used to
9
10
11
12
13
14
15
16
17
18
19
override any default params for the jobs in the workflow.  In each of the
examples below, all three of these files will be talked about, and download
links to each will be provided.

Getting Started
^^^^^^^^^^^^^^^^
:download:`Config <examples/small_test_config.yaml>`

:download:`Run <examples/small_test_run.yaml>`

:download:`Param <examples/small_test_param.yaml>`
aknecht2's avatar
aknecht2 committed
20

21
22
23
24
25
**Config**

.. code-block:: yaml

    notify:
aknecht2's avatar
aknecht2 committed
26
27
      pegasus_home: "/usr/share/pegasus/"
      email: "avi@kurtknecht.com"
28
    profile:
aknecht2's avatar
aknecht2 committed
29
30
31
32
33
34
35
36
37
38
      pegasus:
        style: "glite"
      condor:
        grid_resource: "pbs"
        universe: "vanilla"
        batch_queue: "batch"
      env:
        PYTHONPATH: "/home/swanson/aknecht/.conda/envs/ih_env/lib/python2.7/site-packages/"
        PATH: "/home/swanson/aknecht/.conda/envs/ih_env/bin:/bin/:/usr/bin/:/usr/local/bin/"
        PEGASUS_HOME: "/usr/"
39

40
41
42
43
44
45
46
47
48
49
Specifying an email in the config file will send an email to the target
address once the workflow is complete.  The pegasus_home definition corresponds
to the pegasus install location.  This is necessary so the pegasus email
script (in pegasus/notification/email) can be found and executed successfully.
The config file profile information is passed through to the pegasus
`sites catalog <https://pegasus.isi.edu/documentation/site.php>`_.  This allows
any pegasus `profile <https://pegasus.isi.edu/documentation/profiles.php>`_
information to be passed.  The required information will be dependent on the
system you are submitting to.

50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
**Run**

.. code-block:: yaml

    genomes:
      mm9:
        bowtie2: /work/ladunga/SHARED/mouse/mm9/mm9.genome.fa
        bwa: /work/ladunga/SHARED/mouse/mm9/mm9.genome.fa
        chrom.sizes: /work/ladunga/SHARED/mouse/mm9/mm9.chrom.sizes
    runs:
    - align: bwa
      assembly: mm9
      controls: &id001
      - ENCFF001NIM
      file_type: fastq
      idr: &id002
      - ENCFF001NIP
      - ENCFF001NIS
68
      peak: macs2
69
70
71
72
73
74
75
76
77
      peak_type: narrow
      signals: &id003
      - ENCFF001NIP
      - ENCFF001NIS
    - align: bowtie2
      assembly: mm9
      controls: *id001
      file_type: fastq
      idr: *id002
78
      peak: macs2
79
80
      peak_type: narrow
      signals: *id003
aknecht2's avatar
aknecht2 committed
81

82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
The run file defines all genomic information required by the workflow, and
the accession files to process.  Genomic information is defined for each
assembly used in the workflow, and must include a chromosome sizes file.
For each alignment tool used, a path to the base file must be specified.
In this case, we use both bowtie2 & bwa, so we specify a path for both.
In the second section, runs are defined as a list containing all information
necessary for processing:

* align
    The alignment tool to use, should be either bwa or bowtie2. If starting
    from bam files the alignment tool is not required.
* assembly
    The assembly for alignment.  Some peak callers use the chromosome file,
    so assembly is required.
* file_type
    Defines the type of files that processing initial begins with.  Should be
    either fastq or bam.
* peak
    The tool used for peak calling.  Should be one of [spp, gem, macs2,
    peakranger, ccat, zerone, music].
* peak_type
    The type of peak calling to perform.  The peak type is tool dependent,
    as tools support different peak calling types.  Usually peak_type is narrow
    or broad.
* signals
    The list of chip data accession numbers to process.
* controls
    The list of control inputs to process.
* idr
    A pair of signal accessions to use for idr.  Idr can only be run on pairs
    of files.  Idr is optional.

When creating runs, often times you'll want to investigate the same files
with multiple different peak calling and alignment tools.  In the case above,
the two runs defined are identical except for the alignment tool -- one uses
bwa and the other uses bowite2.  To avoid retyping a lot of information, lists
can be marked with ids using the & symbol.  Later on in the file, the list can
be referenced using the * symbol.

aknecht2's avatar
aknecht2 committed
121
122
123
124
125
126
127
**Param**

.. code-block:: yaml

    macs2_callpeak:
      arguments:
        "-g": "mm"
128
129
130
131
132
133
134
    music_punctate:
      arguments:
        "--mapp": "/work/ladunga/SHARED/workflows/mm9_50bp"
    music_narrow:
      arguments:
        "--mapp": "/work/ladunga/SHARED/workflows/mm9_50bp"
    music_broad:
aknecht2's avatar
aknecht2 committed
135
      arguments:
136
137
        "--mapp": "/work/ladunga/SHARED/workflows/mm9_50bp"
    picard_mark_duplicates:
aknecht2's avatar
aknecht2 committed
138
      arguments:
139
140
141
142
143
144
145
146
147
        "REMOVE_DUPLICATES=": "false"

The param file allows you to adjust both arguments and requested resources for
any job in the workflow.  In this case, we are processing mouse files so we
specify the "-g": "mm" for macs2 peak calling.  The music peak caller requires
additional information to run successfully (even though we are not using it).
Finally, we specify not to remove duplicates.

**Generation**
aknecht2's avatar
aknecht2 committed
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184

To generate the workflow, pass these input files into the :ref:`chip-gen`
script, like so:

.. code-block:: bash

    chip-gen \
      --dir DIRECTORY_NAME \
      --host DB_HOST \
      --username USERNAME \
      --password PASSWORD \
      --param param.yaml \
      --conf config.yaml \
      --run run.yaml

This will generate all files necessary to run the workflow in the specified
directory under a date-time stamped folder.  The structure will look like this:

.. code-block:: bash

    directory_name/
        date-timestamp/
          input/
            chipathlon.dax
            conf.rc
            db_meta/
            notify.sh
            sites.xml
            submit.sh
          output/
          work/

From here, you can use the submit.sh script to actually submit the workflow!
submit.sh creates status.sh & remove.sh, which are scripts used to check the
status of the workflow and remove the workflow respectively.  Upon completion
of the workflow the notify.sh script is used to email the address specified
in your configuration.