Skip to content
Snippets Groups Projects
Verified Commit f42cc2a3 authored by Adam Caprez's avatar Adam Caprez
Browse files

Add OSG section.

parent f7eb5ef3
No related branches found
No related tags found
1 merge request!54Add OSG section.
+++
title = "The Open Science Grid"
description = "How to utilize the Open Science Grid (OSG)."
weight = "40"
+++
If you find that you are not getting access to the volume of computing
resources needed for your research through HCC, you might also consider
submitting your jobs to the Open Science Grid (OSG).
### What is the Open Science Grid?
The [Open Science Grid](http://opensciencegrid.org) advances
science through open distributed computing. The OSG is a
multi-disciplinary partnership to federate local, regional, community
and national cyber infrastructures to meet the needs of research and
academic communities at all scales. HCC participates in the OSG as a
resource provider and a resource user. We provide HCC users with a
gateway to running jobs on the OSG.
The map below shows the Open Science Grid sites located across the U.S.
{{< figure src="/images/17044917.png" >}}
This help document is divided into four sections, namely:
- [Characteristics of an OSG friendly job]({{< relref "characteristics_of_an_osg_friendly_job" >}})
- [How to submit an OSG Job with HTCondor]({{< relref "how_to_submit_an_osg_job_with_htcondor" >}})
- [A simple example of submitting an HTCondorjob]({{< relref "a_simple_example_of_submitting_an_htcondor_job" >}})
- [Using Distributed Environment Modules on OSG]({{< relref "using_distributed_environment_modules_on_osg" >}})
+++
title = "A simple example of submitting an HTCondor job"
description = "A simple example of submitting an HTCondor job."
+++
This page describes a complete example of submitting an HTCondor job.
1. SSH to Tusker or Crane
{{% panel theme="info" header="ssh command" %}}
[apple@localhost]ssh apple@crane.unl.edu
{{% /panel %}}
{{% panel theme="info" header="output" %}}
[apple@login.crane~]$
{{% /panel %}}
2. Write a simple python program in a file "hello.py" that we wish to
run using HTCondor
{{% panel theme="info" header="edit a python code named 'hello.py'" %}}
[apple@login.crane ~]$ vim hello.py
{{% /panel %}}
Then in the edit window, please input the code below:
{{% panel theme="info" header="hello.py" %}}
#!/usr/bin/env python
import sys
import time
i=1
while i<=6:
print i
i+=1
time.sleep(1)
print 2**8
print "hello world received argument = " +sys.argv[1]
{{% /panel %}}
This program will print 1 through 6 on stdout, then print the number
256, and finally print `hello world received argument = <Command
Line Argument Sent to the hello.py>`
3. Write an HTCondor submit script named "hello.submit"
{{% panel theme="info" header="hello.submit" %}}
Universe = vanilla
Executable = hello.py
Output = OUTPUT/hello.out.$(Cluster).$(Process).txt
Error = OUTPUT/hello.error.$(Cluster).$(Process).txt
Log = OUTPUT/hello.log.$(Cluster).$(Process).txt
notification = Never
Arguments = $(Process)
PeriodicRelease = ((JobStatus==5) && (CurentTime - EnteredCurrentStatus) > 30)
OnExitRemove = (ExitStatus == 0)
Queue 4
{{% /panel %}}
4. Create an OUTPUT directory to receive all output files that
generated by your job (OUTPUT folder is used in the submit script
above )
{{% panel theme="info" header="create output directory" %}}
[apple@login.crane ~]$ mkdir OUTPUT
{{% /panel %}}
5. Submit your job
{{% panel theme="info" header="condor_submit" %}}
[apple@login.crane ~]$ condor_submit hello.submit
{{% /panel %}}
{{% panel theme="info" header="Output of submit" %}}
Submitting job(s)
....
4 job(s) submitted to cluster 1013054.
{{% /panel %}}
6. Check status of `condor_q`
{{% panel theme="info" header="condor_q" %}}
[apple@login.crane ~]$ condor_q
{{% /panel %}}
{{% panel theme="info" header="Output of `condor_q`" %}}
-- Schedd: login.crane.hcc.unl.edu : <129.93.227.113:9619?...
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
720587.0 logan 12/15 10:48 33+14:41:17 H 0 0.0 continuous.cron 20
720588.0 logan 12/15 10:48 200+02:40:08 H 0 0.0 checkprogress.cron
1012864.0 jthiltge 2/15 16:48 0+00:00:00 H 0 0.0 test.sh
1013054.0 jennyshao 4/3 17:58 0+00:00:00 R 0 0.0 hello.py 0
1013054.1 jennyshao 4/3 17:58 0+00:00:00 R 0 0.0 hello.py 1
1013054.2 jennyshao 4/3 17:58 0+00:00:00 I 0 0.0 hello.py 2
1013054.3 jennyshao 4/3 17:58 0+00:00:00 I 0 0.0 hello.py 3
7 jobs; 0 completed, 0 removed, 0 idle, 4 running, 3 held, 0 suspended
{{% /panel %}}
Listed below are the three status of the jobs as observed in the
above output
| Symbol | Representation |
|--------|------------------|
| H | Held |
| R | Running |
| I | Idle and waiting |
7. Explanation of the `$(Cluster)` and `$(Process)` in HTCondor script
`$(Cluster)` and `$(Process)` are variables that are available in the
variable name space in the HTCondor script. `$(Cluster)` means the
prefix of your job ID and `$(Process)` varies from `0` through number of
jobs called with `Queue - 1`. If your job is a single job, then
`$(Cluster) =` `<job ID>` else, your job ID is combined with `$(Cluster)` and
`$(Process)`.
In this example, `$(Cluster)`="1013054" and `$(Process)` varies from "0"
to "3" for the above HTCondor script.
In majority of the cases one will use these variables for modifying
the behavior of each individual task of the HTCondor submission, for
example one may vary the input/output file/parameters for the run
program. In this example we are simply passing the `$(Process)` as
arguments as `sys.argv[1]` in `hello.py`.
The lines of interest for this discussion from file the HTCondor
script "hello.submit" are listed below in the code section :
{{% panel theme="info" header="for `$(Process)`" %}}
Output= hello.out.$(Cluster).$(Process).txt
Arguments = $(Process)
Queue 4
{{% /panel %}}
The line of interest for this discussion from file "hello.py" is
listed in the code section below:
{{% panel theme="info" header="for `$(Process)`" %}}
print "hello world received argument = " +sys.argv[1]
{{% /panel %}}
8. Viewing the results of your job
After your job is completed you may use Linux "cat" or "vim" command
to view the job output.
For example in the file `hello.out.1013054.2.txt`, "1013054" means
`$(Cluster)`, and "2" means `$(Process)` the output looks like.
**example of one output file "hello.out.1013054.2.txt"**
{{% panel theme="info" header="example of one output file `hello.out.1013054.2.txt`" %}}
1
2
3
4
5
6
256
hello world received argument = 2
{{% /panel %}}
9. Please see the link below for one more example:
http://research.cs.wisc.edu/htcondor/tutorials/intl-grid-school-3/submit_first.html
Next: [Using Distributed Environment Modules on OSG]({{< relref "using_distributed_environment_modules_on_osg" >}})
+++
title = "Characteristics of an OSG friendly job"
description = "Characteristics of an OSG friendly job"
+++
The OSG is a Distributed High Throughput Computing (DHTC) environment,
which means that users can access compute cores on over 100 different
computing sites across the nation with a single job submission. This
also means that your jobs must fit a set of criteria in order to be
eligible to run on OSG. The list below provides some rule of thumb
characteristics that can help us make a decision if using OSG for a
given job is a viable option.
| Characteristics of an OSG friendly job |
| -------------------------------------- |
| Variable | Suggested Values |
| -------------------------------------- | ------------------------------------------------------------------------------------- |
| Memory/Process | <= 2GB |
| Type of job | serial (i.e. mostly single core) |
| Network traffic<br>(input or output files) | <= 2GB each side |
| Running Time | Open Source non restrictive licensing that allows running code on 3rd party machines |
| Runtime Disk Usage | <= 10GB |
| Binary Type | Portable RHEL6/7 |
| Total CPU Time (of job workflow) | Large, typically >= 1000 hours |
### OSG Job Runtime
The relatively short runtime is necessary due to job pre-emption. Jobs
belonging to resource owners on the machine where your job is running
may pre-empt (or kill) your job unexpectedly. When this happens, your
job's progress is not automatically saved, and it will have to start
over from the beginning. For this reason, it is good practice to build
automatic checkpointing into your job, or break a large job into
multiple small jobs if it is at all possible.
Next: [How to submit an OSG Job with HTCondor]({{< relref "how_to_submit_an_osg_job_with_htcondor" >}})
+++
title = "How to submit an OSG job with HTCondor"
description = "How to submit an OSG job with HTCondor"
+++
{{% notice info%}}Jobs can be submitted to the OSG from Crane or Tusker, so
there is no need to logon to a different submit host or get a grid
certificate!
{{% /notice %}}
### What is HTCondor?
The [HTCondor](http://research.cs.wisc.edu/htcondor)
project provides software to schedule individual applications,
workflows, and for sites to manage resources. It is designed to enable
High Throughput Computing (HTC) on large collections of distributed
resources for users and serves as the job scheduler used on the OSG.
Jobs are submitted from either the Crane or Tusker login nodes to the
OSG using an HTCondor submission script. For those who are used to
submitting jobs with SLURM, there are a few key differences to be aware
of:
### When using HTCondor
- All files (scripts, code, executables, libraries, etc) that are
needed by the job are transferred to the remote compute site when
the job is scheduled. Therefore, all of the files required by the
job must be specified in the HTCondor submit script. Paths can be
absolute or relative to the local directory from which the job is
submitted. The main executable (specified on the `Executable` line
of the submit script) is transferred automatically with the job.
All other files need to be listed on the `transfer_input_files`
line (see example below).
- All files that are created by
the job on the remote host will be transferred automatically back to
the submit host when the job has completed. This includes
temporary/scratch and intermediate files that are not removed by
your job. If you do not want to keep these files, clean up the work
space on the remote host by removing these files before the job
exits (this can be done using a wrapper script for example).
Specific output file names can be specified with the
`transfer_input_files` option. If these files do
not exist on the remote
host when the job exits, then the job will not complete successfully
(it will be place in the *held* state).
- HTCondor scripts can queue
(submit) as many jobs as you like. All jobs queued from a single
submit script will be identical except for the `Arguments` used.
The submit script in the example below queues 5 jobs with the first
set of specified arguments, and 1 job with the second set of
arguments. By default, `Queue` when it is not followed by a number
will submit 1 job.
For more information and advanced usage, see the
[HTCondor Manual](http://research.cs.wisc.edu/htcondor/manual/v8.3/index.html).
### Creating an HTCondor Script
HTCondor, much like Slurm, needs a script to tell it how to do what the
user wants. The example below is a basic script in a file say
'applejob.txt' that can be used to handle most jobs submitted to
HTCondor.
{{% panel theme="info" header="Example of a HTCondor script" %}}
{{< highlight batch >}}
#with executable, stdin, stderr and log
Universe = vanilla
Executable = a.out
Arguments = file_name 12
Output = a.out.out
Error = a.out.err
Log = a.out.log
Queue
{{< /highlight >}}
{{% /panel %}}
The table below explains the various attributes/keywords used in the above script.
| Attribute/Keyword | Explanation |
| ----------------- | ----------------------------------------------------------------------------------------- |
| # | Lines starting with '#' are considered as comments by HTCondor. |
| Universe | is the way HTCondor manages different ways it can run, or what is called in the HTCondor documentation, a runtime environment. The vanilla universe is where most jobs should be run. |
| Executable | is the name of the executable you want to run on HTCondor. |
| Arguments | are the command line arguments for your program. For example, if one was to run `ls -l /` on HTCondor. The Executable would be `ls` and the Arguments would be `-l /`. |
| Output | is the file where the information printed to stdout will be sent. |
| Error | is the file where the information printed to stderr will be sent. |
| Log | is the file where information about your HTCondor job will be sent. Information like if the job is running, if it was halted or, if running in the standard universe, if the file was check-pointed or moved. |
| Queue | is the command to send the job to HTCondor's scheduler. |
Suppose you would like to submit a job e.g. a Monte-Carlo simulation,
where the same program needs to be run several times with the same
parameters the script above can be used with the following modification.
Modify the `Queue` command by giving it the number of times the job must
be run (and hence queued in HTCondor). Thus if the `Queue` command is
changed to `Queue 5`, a.out will be run 5 times with the exact same
parameters.
In another scenario if you would like to submit the same job but with
different parameters, HTCondor accepts files with multiple `Queue`
statements. Only the parameters that need to be changed should be
changed in the HTCondor script before calling the `Queue`.
Please see "A simple example " in next chapter for the detail use of
`$(Process)`
{{% panel theme="info" header="Another Example of a HTCondor script" %}}
{{< highlight batch >}}
#with executable, stdin, stderr and log
#and multiple Argument parameters
Universe = vanilla
Executable = a.out
Arguments = file_name 10
Output = a.out.$(Process).out
Error = a.out.$(Process).err
Log = a.out.$(Process).log
Queue
Arguments = file_name 20
Queue
Arguments = file_name 30
Queue
{{< /highlight >}}
{{% /panel %}}
### How to Submit and View Your job
The steps below describe how to submit a job and other important job
management tasks that you may need in order to monitor and/or control
the submitted job:
1. How to submit a job to OSG - assuming that you named your HTCondor
script as a file applejob.txt
{{< highlight bash >}}[apple@login.crane ~] $ condor_submit applejob{{< /highlight >}}
You will see the following output after submitting the job
{{% panel theme="info" header="Example of condor_submit" %}}
Submitting job(s)
......
6 job(s) submitted to cluster 1013038
{{% /panel %}}
2. How to view your job status - to view the job status of your
submitted jobs use the following shell command
*Please note that by providing a user name as an argument to the
`condor_q` command you can limit the list of submitted jobs to the
ones that are owned by the named user*
{{< highlight bash >}}[apple@login.crane ~] $ condor_q apple{{< /highlight >}}
The code section below shows a typical output. You may notice that
the column ST represents the status of the job (H: Held and I: Idle
or waiting)
{{% panel theme="info" header="Example of condor_q" %}}
-- Schedd: login.crane.hcc.unl.edu : <129.93.227.113:9619?...
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
1013034.4 apple 3/26 16:34 0+00:21:00 H 0 0.0 sjrun.py INPUT/INP
1013038.0 apple 4/3 11:34 0+00:00:00 I 0 0.0 sjrun.py INPUT/INP
1013038.1 apple 4/3 11:34 0+00:00:00 I 0 0.0 sjrun.py INPUT/INP
1013038.2 apple 4/3 11:34 0+00:00:00 I 0 0.0 sjrun.py INPUT/INP
1013038.3 apple 4/3 11:34 0+00:00:00 I 0 0.0 sjrun.py INPUT/INP
...
16 jobs; 0 completed, 0 removed, 12 idle, 0 running, 4 held, 0 suspended
{{% /panel %}}
3. How to release a job - in a few cases a job may get held because of
reasons such as authentication failure or other non-fatal errors, in
those cases you may use the shell command below to release the job
from the held status so that it can be rescheduled by the HTCondor.
*Release one job:*
{{< highlight bash >}}[apple@login.crane ~] $ condor_release 1013034.4{{< /highlight >}}
*Release all jobs of a user apple:*
{{< highlight bash >}}[apple@login.crane ~] $ condor_release apple{{< /highlight >}}
4. How to delete a submitted job - if you want to delete a submitted
job you may use the shell commands as listed below
*Delete one job:*
{{< highlight bash >}}[apple@login.crane ~] $ condor_rm 1013034.4{{< /highlight >}}
*Delete all jobs of a user apple:*
{{< highlight bash >}}[apple@login.crane ~] $ condor_rm apple{{< /highlight >}}
5. How to get help form HTCondor command
You can use man to get detail explanation of HTCondor command
{{% panel theme="info" header="Example of help of condor_q" %}}
[apple@glidein ~]man condor_q
{{% /panel %}}
{{% panel theme="info" header="Output of `man condor_q`" %}}
just-man-pages/condor_q(1) just-man-pages/condor_q(1)
Name
condor_q Display information about jobs in queue
Synopsis
condor_q [ -help ]
condor_q [ -debug ] [ -global ] [ -submitter submitter ] [ -name name ] [ -pool centralmanagerhost-
name[:portnumber] ] [ -analyze ] [ -run ] [ -hold ] [ -globus ] [ -goodput ] [ -io ] [ -dag ] [ -long ]
[ -xml ] [ -attributes Attr1 [,Attr2 ... ] ] [ -format fmt attr ] [ -autoformat[:tn,lVh] attr1 [attr2
...] ] [ -cputime ] [ -currentrun ] [ -avgqueuetime ] [ -jobads file ] [ -machineads file ] [ -stream-
results ] [ -wide ] [ {cluster | cluster.process | owner | -constraint expression ... } ]
Description
condor_q displays information about jobs in the Condor job queue. By default, condor_q queries the local
job queue but this behavior may be modified by specifying:
* the -global option, which queries all job queues in the pool
* a schedd name with the -name option, which causes the queue of the named schedd to be queried
{{% /panel %}}
Next: [A simple example of submitting an HTCondorjob]({{< relref "a_simple_example_of_submitting_an_htcondor_job" >}})
+++
title = "Using Distributed Environment Modules on OSG"
description = "Using Distributed Environment Modules on OSG"
+++
Many commonly used software packages and libraries are provided on the
OSG through the `module` command. OSG modules are made available
through the OSG Application Software Installation Service (OASIS). The
set of modules provided on OSG can differ from those on the HCC
clusters. To switch to the OSG modules environment on an HCC machine:
{{< highlight bash >}}
[apple@login.crane~]$ source osg_oasis_init
{{< /highlight >}}
Use the module avail command to see what software and libraries are
available:
{{< highlight bash >}}
[apple@login.crane~]$ module avail
------------------- /cvmfs/oasis.opensciencegrid.org/osg/modules/modulefiles/Core --------------------
abyss/2.0.2 gnome_libs/1.0 pegasus/4.7.1
ant/1.9.4 gnuplot/4.6.5 pegasus/4.7.3
ANTS/1.9.4 graphviz/2.38.0 pegasus/4.7.4 (D)
ANTS/2.1.0 (D) grass/6.4.4 phenix/1.10
apr/1.5.1 gromacs/4.6.5 poppler/0.24.1 (D)
aprutil/1.5.3 gromacs/5.0.0 (D) poppler/0.32
arc-lite/2015 gromacs/5.0.5.cuda povray/3.7
atlas/3.10.1 gromacs/5.0.5 proj/4.9.1
atlas/3.10.2 (D) gromacs/5.1.2-cuda proot/2014
autodock/4.2.6 gsl/1.16 protobuf/2.5
{{< /highlight >}}
Loading modules is done with the `module load` command:
{{< highlight bash >}}
[apple@login.crane~]$ module load python/2.7
{{< /highlight >}}
There are two things required in order to use modules in your HTCondor
job.
1. Create a *wrapper script* for the job. This script will be the
executable for your job and will load the module before running the
main application.
2. Include the following requirements in the HTCondor submission
script:
{{< highlight batch >}}Requirements = (HAS_MODULES =?= TRUE){{< /highlight >}}
or
{{< highlight batch >}}Requirements = [Other requirements ] && (HAS_MODULES =?= TRUE){{< /highlight >}}
### A simple example using modules on OSG
The following example will demonstrate how to use modules on OSG with an
R script that implements a Monte-Carlo estimation of Pi (`mcpi.R`).
First, create a file called `mcpi.R`:
{{% panel theme="info" header="mcpi.R" %}}{{< highlight R >}}
montecarloPi <- function(trials) {
count = 0
for(i in 1:trials) {
if((runif(1,0,1)^2 + runif(1,0,1)^2)<1) {
count = count + 1
}
}
return((count*4)/trials)
}
montecarloPi(1000)
{{< /highlight >}}{{% /panel %}}
Next, create a wrapper script called `R-wrapper.sh` to load the required
modules (`R` and `libgfortran`), and execute the R script:
{{% panel theme="info" header="R-wrapper.sh" %}}{{< highlight bash >}}
#!/bin/bash
EXPECTED_ARGS=1
if [ $# -ne $EXPECTED_ARGS ]; then
echo "Usage: R-wrapper.sh file.R"
exit 1
else
module load R
module load libgfortran
Rscript $1
fi
{{< /highlight >}}{{% /panel %}}
This script takes the name of the R script (`mcpi.R`) as it's argument
and executes it in batch mode (using the `Rscript` command) after
loading the `R` and `libgfortran` modules.
Make the script executable:
{{< highlight bash >}}[apple@login.crane~]$ chmod a+x R-script.sh{{< /highlight >}}
Finally, create the HTCondor submit script, `R.submit`:
{{% panel theme="info" header="R.submit" %}}{{< highlight batch >}}
universe = vanilla
log = mcpi.log.$(Cluster).$(Process)
error = mcpi.err.$(Cluster).$(Process)
output = mcpi.out.$(Cluster).$(Process)
executable = R-wrapper.sh
transfer_input_files = mcpi.R
arguments = mcpi.R
Requirements = (HAS_MODULES =?= TRUE)
queue 100
{{< /highlight >}}{{% /panel %}}
This script will queue 100 identical jobs to estimate the value of Pi.
Notice that the wrapper script is transferred automatically with the
job because it is listed as the executable. However, the R script
(`mcpi.R`) must be listed after `transfer_input_files` in order to be
transferred with the job.
Submit the jobs with the `condor_submit` command:
{{< highlight bash >}}[apple@login.crane~]$ condor_submit R.submit{{< /highlight >}}
Check on the status of your jobs with `condor_q`:
{{< highlight bash >}}[apple@login.crane~]$ condor_q{{< /highlight >}}
When your jobs have completed, find the average estimate for Pi from all
100 jobs:
{{< highlight bash >}}
[apple@login.crane~]$ grep "[1]" mcpi.out.* | awk '{sum += $2} END { print "Average =", sum/NR}'
Average = 3.13821
{{< /highlight >}}
static/images/17044917.png

193 KiB

0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment