how_to_submit_an_osg_job_with_htcondor.md 10 KB
Newer Older
Adam Caprez's avatar
Adam Caprez committed
1
2
3
4
5
+++
title = "How to submit an OSG job with HTCondor"
description = "How to submit an OSG job with HTCondor"
+++

6
{{% notice info%}}Jobs can be submitted to the OSG from Crane, so
Adam Caprez's avatar
Adam Caprez committed
7
8
9
10
11
12
13
14
15
16
17
there is no need to logon to a different submit host or get a grid
certificate!
{{% /notice %}}

###  What is HTCondor?

The [HTCondor](http://research.cs.wisc.edu/htcondor)
project provides software to schedule individual applications,
workflows, and for sites to manage resources.  It is designed to enable
High Throughput Computing (HTC) on large collections of distributed
resources for users and serves as the job scheduler used on the OSG.
18
 Jobs are submitted from the Crane login node to the
Adam Caprez's avatar
Adam Caprez committed
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
OSG using an HTCondor submission script.  For those who are used to
submitting jobs with SLURM, there are a few key differences to be aware
of:

### When using HTCondor

- All files (scripts, code, executables, libraries, etc) that are
  needed by the job are transferred to the remote compute site when
  the job is scheduled.  Therefore, all of the files required by the
  job must be specified in the HTCondor submit script.  Paths can be
  absolute or relative to the local directory from which the job is
  submitted.  The main executable (specified on the `Executable` line
  of the submit script) is transferred automatically with the job. 
  All other files need to be listed on the `transfer_input_files`
  line (see example below). 
- All files that are created by
  the job on the remote host will be transferred automatically back to
  the submit host when the job has completed.  This includes
  temporary/scratch and intermediate files that are not removed by
  your job.  If you do not want to keep these files, clean up the work
  space on the remote host by removing these files before the job
  exits (this can be done using a wrapper script for example).
  Specific output file names can be specified with the
  `transfer_input_files` option.  If these files do
  not exist on the remote
  host when the job exits, then the job will not complete successfully
  (it will be place in the *held* state).
- HTCondor scripts can queue
  (submit) as many jobs as you like.  All jobs queued from a single
  submit script will be identical except for the `Arguments` used. 
  The submit script in the example below queues 5 jobs with the first
  set of specified arguments, and 1 job with the second set of
  arguments.  By default, `Queue` when it is not followed by a number
  will submit 1 job.

For more information and advanced usage, see the
[HTCondor Manual](http://research.cs.wisc.edu/htcondor/manual/v8.3/index.html).

### Creating an HTCondor Script

HTCondor, much like Slurm, needs a script to tell it how to do what the
user wants. The example below is a basic script in a file say
'applejob.txt' that can be used to handle most jobs submitted to
HTCondor.

{{% panel theme="info" header="Example of a HTCondor script" %}}
{{< highlight batch >}}
#with executable, stdin, stderr and log
Universe = vanilla
Executable = a.out
Arguments = file_name 12
Output = a.out.out
Error = a.out.err
Log = a.out.log
Queue
{{< /highlight >}}
{{% /panel %}}

The table below explains the various attributes/keywords used in the above script.

| Attribute/Keyword |  Explanation                                                                              |
| ----------------- | ----------------------------------------------------------------------------------------- |
| #                 | Lines starting with '#' are considered as comments by HTCondor.                            |
| Universe          | is the way HTCondor manages different ways it can run, or what is called in the HTCondor documentation, a runtime environment.  The vanilla universe is where most jobs should be run. |
| Executable        | is the name of the executable you want to run on HTCondor.                                |
| Arguments         | are the command line arguments for your program. For example, if one was to run `ls -l /` on HTCondor. The Executable would be `ls` and the Arguments would be `-l /`. |
| Output            | is the file where the information printed to stdout will be sent.                         |
| Error             | is the file where the information printed to stderr will be sent.                         |
| Log               | is the file where information about your HTCondor job will be sent. Information like if the job is running, if it was halted or, if running in the standard universe, if the file was check-pointed or moved. |
| Queue             | is the command to send the job to HTCondor's scheduler.                                   |


Suppose you would like to submit a job e.g. a Monte-Carlo simulation,
where the same program needs to be run several times with the same
parameters the script above can be used with the following modification.

Modify the `Queue` command by giving it the number of times the job must
be run (and hence queued in HTCondor). Thus if the `Queue` command is
changed to `Queue 5`, a.out will be run 5 times with the exact same
parameters.

In another scenario if you would like to submit the same job but with
different parameters, HTCondor accepts files with multiple `Queue`
statements. Only the parameters that need to be changed should be
changed in the HTCondor script before calling the `Queue`.

Please see "A simple example " in next chapter for the detail use of
`$(Process)`

{{% panel theme="info" header="Another Example of a HTCondor script" %}}
{{< highlight batch >}}
#with executable, stdin, stderr and log
#and multiple Argument parameters
Universe = vanilla
Executable = a.out
Arguments = file_name 10
Output = a.out.$(Process).out
Error = a.out.$(Process).err
Log = a.out.$(Process).log
Queue
Arguments = file_name 20
Queue
Arguments = file_name 30
Queue
{{< /highlight >}}
{{% /panel %}}

### How to Submit and View Your job

The steps below describe how to submit a job and other important job
management tasks that you may need in order to monitor and/or control
the submitted job:

1.  How to submit a job to OSG - assuming that you named your HTCondor
    script as a file applejob.txt

    {{< highlight bash >}}[apple@login.crane ~] $ condor_submit applejob{{< /highlight >}}

    You will see the following output after submitting the job
    {{% panel theme="info" header="Example of condor_submit" %}}
    Submitting job(s)
    ......
    6 job(s) submitted to cluster 1013038
    {{% /panel %}}

2.  How to view your job status - to view the job status of your
    submitted jobs use the following shell command
    *Please note that by providing a user name as an argument to the
    `condor_q` command you can limit the list of submitted jobs to the
    ones that are owned by the named user*


    {{< highlight bash >}}[apple@login.crane ~] $ condor_q apple{{< /highlight >}}

    The code section below shows a typical output. You may notice that
    the column ST represents the status of the job (H: Held and I: Idle
    or waiting)

    {{% panel theme="info" header="Example of condor_q" %}}
    -- Schedd: login.crane.hcc.unl.edu : <129.93.227.113:9619?...
     ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
    1013034.4   apple       3/26 16:34   0+00:21:00 H  0   0.0  sjrun.py INPUT/INP
    1013038.0   apple       4/3  11:34   0+00:00:00 I  0   0.0  sjrun.py INPUT/INP
    1013038.1   apple       4/3  11:34   0+00:00:00 I  0   0.0  sjrun.py INPUT/INP
    1013038.2   apple       4/3  11:34   0+00:00:00 I  0   0.0  sjrun.py INPUT/INP
    1013038.3   apple       4/3  11:34   0+00:00:00 I  0   0.0  sjrun.py INPUT/INP
    ...
    16 jobs; 0 completed, 0 removed, 12 idle, 0 running, 4 held, 0 suspended
    {{% /panel %}}

3.  How to release a job - in a few cases a job may get held because of
    reasons such as authentication failure or other non-fatal errors, in
    those cases you may use the shell command below to release the job
    from the held status so that it can be rescheduled by the HTCondor.

    *Release one job:*
    {{< highlight bash >}}[apple@login.crane ~] $ condor_release 1013034.4{{< /highlight >}}

    *Release all jobs of a user apple:*
    {{< highlight bash >}}[apple@login.crane ~] $ condor_release apple{{< /highlight >}}

4.  How to delete a  submitted job - if you want to delete a submitted
    job you may use the shell commands as listed below

    *Delete one job:*
    {{< highlight bash >}}[apple@login.crane ~] $ condor_rm 1013034.4{{< /highlight >}}

    *Delete all jobs of a user apple:*
    {{< highlight bash >}}[apple@login.crane ~] $ condor_rm apple{{< /highlight >}}

5.  How to get help form HTCondor command

    You can use man to get detail explanation of HTCondor command

    {{% panel theme="info" header="Example of help of condor_q" %}}
    [apple@glidein ~]man condor_q
    {{% /panel %}}

    {{% panel theme="info" header="Output of `man condor_q`" %}}
    just-man-pages/condor_q(1)                          just-man-pages/condor_q(1)
    Name
           condor_q Display information about jobs in queue
    Synopsis
           condor_q [ -help ]
           condor_q  [  -debug  ]  [  -global ] [ -submitter submitter ] [ -name name ] [ -pool centralmanagerhost-
           name[:portnumber] ] [ -analyze ] [ -run ] [ -hold ] [ -globus ] [ -goodput ] [ -io ] [ -dag ] [ -long  ]
           [  -xml  ]  [ -attributes Attr1 [,Attr2 ... ] ] [ -format fmt attr ] [ -autoformat[:tn,lVh] attr1 [attr2
           ...]  ] [ -cputime ] [ -currentrun ] [ -avgqueuetime ] [ -jobads file ] [ -machineads file ] [  -stream-
           results ] [ -wide ] [ {cluster | cluster.process | owner | -constraint expression ... } ]
    Description
           condor_q displays information about jobs in the Condor job queue. By default, condor_q queries the local
           job queue but this behavior may be modified by specifying:
              * the -global option, which queries all job queues in the pool
              * a schedd name with the -name option, which causes the queue of the named schedd to be queried
    {{% /panel %}}


    Next: [A simple example of submitting an HTCondorjob]({{< relref "a_simple_example_of_submitting_an_htcondor_job" >}})