examples.rst 9.47 KB
Newer Older
1
2
3
Examples
==========

4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
Whenever generating a workflow, there are five total required files you
will need to create:

*   **Config File**
    A few pieces of info need to be defined in here, specifically the bin path
    to the chipathlon environment, the bin path to the idr environment, and
    the email address to message when the workflow is complete.
*   **Param File**
    Allows the user to overwrite options for many of the software tools being
    used.  Most numeric arguments have defaults that can be changed by the
    end-user.
*   **Run File**
    Describes the actually files to process and what alignment / peak calling
    tools should be used on them, and whether or not to run idr.
*   **Properties File**
    One of the required files by pegasus.  For more information see their
    `properties documentation <https://pegasus.isi.edu/documentation/properties.php>`_
*   **Sites File**
    One of the required files by pegasus.  For more information see their
    `sites catalog documentation <https://pegasus.isi.edu/documentation/site.php>`_

The information located in the properties file will be highly specific to
the environment that you're submitting on.  Additionally, genomic information
is expected to be downloaded & built for the target genome you're interested
in, as well as a chromsome sizes files.

Supported Tools
^^^^^^^^^^^^^^^^

Alignment:

* `bwa <http://bio-bwa.sourceforge.net>`_
* `bowtie2 <http://bowtie-bio.sourceforge.net/bowtie2/index.shtml>`_

Peak Calling:

* `spp <https://github.com/hms-dbmi/spp>`_ (narrow, broad)
* `zerone <https://omictools.com/zerone-tool>`_ (broad)
* `macs2 <https://github.com/taoliu/MACS>`_ (narrow, broad)
* `gem <http://groups.csail.mit.edu/cgs/gem/>`_ (narrow)
* `peakranger <http://ranger.sourceforge.net/manual1.18.html>`_ (narrow)
* `ccat <http://ranger.sourceforge.net/manual1.18.html>`_ (broad)
* `music <https://github.com/gersteinlab/MUSIC>`_ (narrow, punctate, broad)
* `pepr <https://github.com/shawnzhangyx/PePr>`_ (narrow)
* `hiddendomains <http://hiddendomains.sourceforge.net/>`_ (broad)
49
50
51
52
53
54
55
56

Getting Started
^^^^^^^^^^^^^^^^
:download:`Config <examples/small_test_config.yaml>`

:download:`Run <examples/small_test_run.yaml>`

:download:`Param <examples/small_test_param.yaml>`
aknecht2's avatar
aknecht2 committed
57

58
59
60
61
:download:`Properties <examples/small_test_properties.txt>`

:download:`Sites <examples/small_test_sites.xml>`

62
63
**Config**

64
65
66
67
68
69
.. code-block:: text

    chipathlon_bin: /home/swanson/aknecht/.conda/envs/chip/bin
    idr_bin: /home/swanson/aknecht/.conda/envs/idr/bin
    pegasus_home: /usr/share/pegasus/
    email: YOUREMAIL@DOMAIN.com
70

71
72
73
74
75
76
77
78
79
The top two lines define the bin paths to the chipathlon and idr environments.
The paths will depend on where you created your environments, but if you
followed the installation instructions they will be in your home directory in
the .conda folder.  These two paths are required to find all the necessary
software to execute. Specifying an email in the config file will send an email
to the target address once the workflow is complete.  The pegasus_home
definition corresponds to the pegasus install location.  This is necessary so
the pegasus email script (in pegasus/notification/email) can be found and
executed successfully.
80

81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
**Run**

.. code-block:: yaml

    genomes:
      mm9:
        bowtie2: /work/ladunga/SHARED/mouse/mm9/mm9.genome.fa
        bwa: /work/ladunga/SHARED/mouse/mm9/mm9.genome.fa
        chrom.sizes: /work/ladunga/SHARED/mouse/mm9/mm9.chrom.sizes
    runs:
    - align: bwa
      assembly: mm9
      controls: &id001
      - ENCFF001NIM
      file_type: fastq
      idr: &id002
      - ENCFF001NIP
      - ENCFF001NIS
99
      peak: macs2
100
101
102
103
104
105
106
107
108
      peak_type: narrow
      signals: &id003
      - ENCFF001NIP
      - ENCFF001NIS
    - align: bowtie2
      assembly: mm9
      controls: *id001
      file_type: fastq
      idr: *id002
109
      peak: macs2
110
111
      peak_type: narrow
      signals: *id003
aknecht2's avatar
aknecht2 committed
112

113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
The run file defines all genomic information required by the workflow, and
the accession files to process.  Genomic information is defined for each
assembly used in the workflow, and must include a chromosome sizes file.
For each alignment tool used, a path to the base file must be specified.
In this case, we use both bowtie2 & bwa, so we specify a path for both.
In the second section, runs are defined as a list containing all information
necessary for processing:

* align
    The alignment tool to use, should be either bwa or bowtie2. If starting
    from bam files the alignment tool is not required.
* assembly
    The assembly for alignment.  Some peak callers use the chromosome file,
    so assembly is required.
* file_type
    Defines the type of files that processing initial begins with.  Should be
    either fastq or bam.
* peak
131
132
    The tool used for peak calling.  Above in the supported tools section there
    is a list defining all peak calling tools, and their supporting peak types.
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
* peak_type
    The type of peak calling to perform.  The peak type is tool dependent,
    as tools support different peak calling types.  Usually peak_type is narrow
    or broad.
* signals
    The list of chip data accession numbers to process.
* controls
    The list of control inputs to process.
* idr
    A pair of signal accessions to use for idr.  Idr can only be run on pairs
    of files.  Idr is optional.

When creating runs, often times you'll want to investigate the same files
with multiple different peak calling and alignment tools.  In the case above,
the two runs defined are identical except for the alignment tool -- one uses
bwa and the other uses bowite2.  To avoid retyping a lot of information, lists
149
150
151
152
can be marked with ids using the & symbol and a unique identifier.  Later on in
the file, the list can be referenced using the * symbol.  Since we are only
changing the alignment tool there's no need to type out all the samples a
second time.
153

aknecht2's avatar
aknecht2 committed
154
155
156
157
158
159
160
**Param**

.. code-block:: yaml

    macs2_callpeak:
      arguments:
        "-g": "mm"
161
162
163
164
    bwa_align_single:
      arguments:
        "-l": 20
        "-q": 6
165
166
167
168
169
170
171
    music_punctate:
      arguments:
        "--mapp": "/work/ladunga/SHARED/workflows/mm9_50bp"
    music_narrow:
      arguments:
        "--mapp": "/work/ladunga/SHARED/workflows/mm9_50bp"
    music_broad:
aknecht2's avatar
aknecht2 committed
172
      arguments:
173
174
        "--mapp": "/work/ladunga/SHARED/workflows/mm9_50bp"
    picard_mark_duplicates:
aknecht2's avatar
aknecht2 committed
175
      arguments:
176
177
178
179
180
181
182
183
        "REMOVE_DUPLICATES=": "false"

The param file allows you to adjust both arguments and requested resources for
any job in the workflow.  In this case, we are processing mouse files so we
specify the "-g": "mm" for macs2 peak calling.  The music peak caller requires
additional information to run successfully (even though we are not using it).
Finally, we specify not to remove duplicates.

184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
**Properties**

.. code-block:: text

    pegasus.catalog.site = XML
    pegasus.catalog.site.file = small_test_sites.xml

    pegasus.condor.logs.symlink = false
    pegasus.transfer.links = true
    pegasus.data.configuration = sharedfs

Again, for more information on the properties file consult the pegasus
`properties documentation <https://pegasus.isi.edu/documentation/properties.php>`_

**Sites**

.. code-block:: xml

    <?xml version="1.0" ?>
    <sitecatalog version="4.0" xmlns="http://pegasus.isi.edu/schema/sitecatalog" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://pegasus.isi.edu/schema/sitecatalog http://pegasus.isi.edu/schema/sc-4.0.xsd">
      <site arch="x86_64" handle="local" os="LINUX">
        <directory path="/lustre/work/ladunga/SHARED/workflows/new_tests/full_test/work" type="shared-scratch">
          <file-server operation="all" url="file:///lustre/work/ladunga/SHARED/workflows/new_tests/full_test/work"/>
        </directory>
        <directory path="/lustre/work/ladunga/SHARED/workflows/new_tests/full_test/output" type="local-storage">
          <file-server operation="all" url="file:///lustre/work/ladunga/SHARED/workflows/new_tests/full_test/output"/>
        </directory>

        <profile key="change.dir" namespace="pegasus">true</profile>
        <profile key="transfer.threads" namespace="pegasus">4</profile>
        <profile key="universe" namespace="condor">vanilla</profile>
        <profile key="grid_resource" namespace="condor">pbs</profile>
        <profile key="batch_queue" namespace="condor">batch</profile>
        <profile key="style" namespace="pegasus">glite</profile>
      </site>
    </sitecatalog>

Again, for more information on the sites file consult the pegasus
`sites catalog documentation <https://pegasus.isi.edu/documentation/site.php>`_

224
**Generation**
aknecht2's avatar
aknecht2 committed
225
226
227
228
229
230
231
232
233
234
235

To generate the workflow, pass these input files into the :ref:`chip-gen`
script, like so:

.. code-block:: bash

    chip-gen \
      --dir DIRECTORY_NAME \
      --host DB_HOST \
      --param param.yaml \
      --conf config.yaml \
236
237
238
239
      --run run.yaml \
      --properties properties.txt \
      --execute-site local \
      --output-site local
aknecht2's avatar
aknecht2 committed
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260

This will generate all files necessary to run the workflow in the specified
directory under a date-time stamped folder.  The structure will look like this:

.. code-block:: bash

    directory_name/
        date-timestamp/
          input/
            chipathlon.dax
            db_meta/
            notify.sh
            submit.sh
          output/
          work/

From here, you can use the submit.sh script to actually submit the workflow!
submit.sh creates status.sh & remove.sh, which are scripts used to check the
status of the workflow and remove the workflow respectively.  Upon completion
of the workflow the notify.sh script is used to email the address specified
in your configuration.