diff --git a/content/osg/_index.md b/content/osg/_index.md new file mode 100644 index 0000000000000000000000000000000000000000..034162b2d822a44bfe370ae3fc52b5ac79229e31 --- /dev/null +++ b/content/osg/_index.md @@ -0,0 +1,30 @@ ++++ +title = "The Open Science Grid" +description = "How to utilize the Open Science Grid (OSG)." +weight = "40" ++++ + +If you find that you are not getting access to the volume of computing +resources needed for your research through HCC, you might also consider +submitting your jobs to the Open Science Grid (OSG). + +### What is the Open Science Grid? + +The [Open Science Grid](http://opensciencegrid.org) advances +science through open distributed computing. The OSG is a +multi-disciplinary partnership to federate local, regional, community +and national cyber infrastructures to meet the needs of research and +academic communities at all scales. HCC participates in the OSG as a +resource provider and a resource user. We provide HCC users with a +gateway to running jobs on the OSG. + +The map below shows the Open Science Grid sites located across the U.S. + +{{< figure src="/images/17044917.png" >}} + +This help document is divided into four sections, namely: + +- [Characteristics of an OSG friendly job]({{< relref "characteristics_of_an_osg_friendly_job" >}}) +- [How to submit an OSG Job with HTCondor]({{< relref "how_to_submit_an_osg_job_with_htcondor" >}}) +- [A simple example of submitting an HTCondorjob]({{< relref "a_simple_example_of_submitting_an_htcondor_job" >}}) +- [Using Distributed Environment Modules on OSG]({{< relref "using_distributed_environment_modules_on_osg" >}}) diff --git a/content/osg/a_simple_example_of_submitting_an_htcondor_job.md b/content/osg/a_simple_example_of_submitting_an_htcondor_job.md new file mode 100644 index 0000000000000000000000000000000000000000..9398f48c5eb20ffd3b375f277cb475e7469f3f27 --- /dev/null +++ b/content/osg/a_simple_example_of_submitting_an_htcondor_job.md @@ -0,0 +1,167 @@ ++++ +title = "A simple example of submitting an HTCondor job" +description = "A simple example of submitting an HTCondor job." ++++ + +This page describes a complete example of submitting an HTCondor job. + +1. SSH to Tusker or Crane + + {{% panel theme="info" header="ssh command" %}} + [apple@localhost]ssh apple@crane.unl.edu + {{% /panel %}} + + {{% panel theme="info" header="output" %}} + [apple@login.crane~]$ + {{% /panel %}} + +2. Write a simple python program in a file "hello.py" that we wish to + run using HTCondor + + {{% panel theme="info" header="edit a python code named 'hello.py'" %}} + [apple@login.crane ~]$ vim hello.py + {{% /panel %}} + + Then in the edit window, please input the code below: + + {{% panel theme="info" header="hello.py" %}} + #!/usr/bin/env python + import sys + import time + i=1 + while i<=6: + print i + i+=1 + time.sleep(1) + print 2**8 + print "hello world received argument = " +sys.argv[1] + {{% /panel %}} + + This program will print 1 through 6 on stdout, then print the number + 256, and finally print `hello world received argument = <Command + Line Argument Sent to the hello.py>` + + + +3. Write an HTCondor submit script named "hello.submit" + + {{% panel theme="info" header="hello.submit" %}} + Universe = vanilla + Executable = hello.py + Output = OUTPUT/hello.out.$(Cluster).$(Process).txt + Error = OUTPUT/hello.error.$(Cluster).$(Process).txt + Log = OUTPUT/hello.log.$(Cluster).$(Process).txt + notification = Never + Arguments = $(Process) + PeriodicRelease = ((JobStatus==5) && (CurentTime - EnteredCurrentStatus) > 30) + OnExitRemove = (ExitStatus == 0) + Queue 4 + {{% /panel %}} + +4. Create an OUTPUT directory to receive all output files that + generated by your job (OUTPUT folder is used in the submit script + above ) + + {{% panel theme="info" header="create output directory" %}} + [apple@login.crane ~]$ mkdir OUTPUT + {{% /panel %}} + +5. Submit your job + + {{% panel theme="info" header="condor_submit" %}} + [apple@login.crane ~]$ condor_submit hello.submit + {{% /panel %}} + + {{% panel theme="info" header="Output of submit" %}} + Submitting job(s) + + .... + 4 job(s) submitted to cluster 1013054. + {{% /panel %}} + +6. Check status of `condor_q` + + {{% panel theme="info" header="condor_q" %}} + [apple@login.crane ~]$ condor_q + {{% /panel %}} + + {{% panel theme="info" header="Output of `condor_q`" %}} + -- Schedd: login.crane.hcc.unl.edu : <129.93.227.113:9619?... + ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD + 720587.0 logan 12/15 10:48 33+14:41:17 H 0 0.0 continuous.cron 20 + 720588.0 logan 12/15 10:48 200+02:40:08 H 0 0.0 checkprogress.cron + 1012864.0 jthiltge 2/15 16:48 0+00:00:00 H 0 0.0 test.sh + 1013054.0 jennyshao 4/3 17:58 0+00:00:00 R 0 0.0 hello.py 0 + 1013054.1 jennyshao 4/3 17:58 0+00:00:00 R 0 0.0 hello.py 1 + 1013054.2 jennyshao 4/3 17:58 0+00:00:00 I 0 0.0 hello.py 2 + 1013054.3 jennyshao 4/3 17:58 0+00:00:00 I 0 0.0 hello.py 3 + 7 jobs; 0 completed, 0 removed, 0 idle, 4 running, 3 held, 0 suspended + {{% /panel %}} + + Listed below are the three status of the jobs as observed in the + above output + + | Symbol | Representation | + |--------|------------------| + | H | Held | + | R | Running | + | I | Idle and waiting | + + +7. Explanation of the `$(Cluster)` and `$(Process)` in HTCondor script + + `$(Cluster)` and `$(Process)` are variables that are available in the + variable name space in the HTCondor script. `$(Cluster)` means the + prefix of your job ID and `$(Process)` varies from `0` through number of + jobs called with `Queue - 1`. If your job is a single job, then + `$(Cluster) =` `<job ID>` else, your job ID is combined with `$(Cluster)` and + `$(Process)`. + + In this example, `$(Cluster)`="1013054" and `$(Process)` varies from "0" + to "3" for the above HTCondor script. + In majority of the cases one will use these variables for modifying + the behavior of each individual task of the HTCondor submission, for + example one may vary the input/output file/parameters for the run + program. In this example we are simply passing the `$(Process)` as + arguments as `sys.argv[1]` in `hello.py`. + The lines of interest for this discussion from file the HTCondor + script "hello.submit" are listed below in the code section : + + {{% panel theme="info" header="for `$(Process)`" %}} + Output= hello.out.$(Cluster).$(Process).txt + Arguments = $(Process) + Queue 4 + {{% /panel %}} + + The line of interest for this discussion from file "hello.py" is + listed in the code section below: + + {{% panel theme="info" header="for `$(Process)`" %}} + print "hello world received argument = " +sys.argv[1] + {{% /panel %}} + +8. Viewing the results of your job + + After your job is completed you may use Linux "cat" or "vim" command + to view the job output. + + For example in the file `hello.out.1013054.2.txt`, "1013054" means + `$(Cluster)`, and "2" means `$(Process)` the output looks like. + + **example of one output file "hello.out.1013054.2.txt"** + {{% panel theme="info" header="example of one output file `hello.out.1013054.2.txt`" %}} + 1 + 2 + 3 + 4 + 5 + 6 + 256 + hello world received argument = 2 + {{% /panel %}} + +9. Please see the link below for one more example: + + http://research.cs.wisc.edu/htcondor/tutorials/intl-grid-school-3/submit_first.html + +Next: [Using Distributed Environment Modules on OSG]({{< relref "using_distributed_environment_modules_on_osg" >}}) diff --git a/content/osg/characteristics_of_an_osg_friendly_job.md b/content/osg/characteristics_of_an_osg_friendly_job.md new file mode 100644 index 0000000000000000000000000000000000000000..c99fc6ed6438446c2b62834e7aa9b232cf24520d --- /dev/null +++ b/content/osg/characteristics_of_an_osg_friendly_job.md @@ -0,0 +1,39 @@ ++++ +title = "Characteristics of an OSG friendly job" +description = "Characteristics of an OSG friendly job" ++++ + +The OSG is a Distributed High Throughput Computing (DHTC) environment, +which means that users can access compute cores on over 100 different +computing sites across the nation with a single job submission. This +also means that your jobs must fit a set of criteria in order to be +eligible to run on OSG. The list below provides some rule of thumb +characteristics that can help us make a decision if using OSG for a +given job is a viable option. + + +| Characteristics of an OSG friendly job | +| -------------------------------------- | + +| Variable | Suggested Values | +| -------------------------------------- | ------------------------------------------------------------------------------------- | +| Memory/Process | <= 2GB | +| Type of job | serial (i.e. mostly single core) | +| Network traffic<br>(input or output files) | <= 2GB each side | +| Running Time | Open Source non restrictive licensing that allows running code on 3rd party machines | +| Runtime Disk Usage | <= 10GB | +| Binary Type | Portable RHEL6/7 | +| Total CPU Time (of job workflow) | Large, typically >= 1000 hours | + +### OSG Job Runtime + +The relatively short runtime is necessary due to job pre-emption. Jobs +belonging to resource owners on the machine where your job is running +may pre-empt (or kill) your job unexpectedly. When this happens, your +job's progress is not automatically saved, and it will have to start +over from the beginning. For this reason, it is good practice to build +automatic checkpointing into your job, or break a large job into +multiple small jobs if it is at all possible. + +Next: [How to submit an OSG Job with HTCondor]({{< relref "how_to_submit_an_osg_job_with_htcondor" >}}) + diff --git a/content/osg/how_to_submit_an_osg_job_with_htcondor.md b/content/osg/how_to_submit_an_osg_job_with_htcondor.md new file mode 100644 index 0000000000000000000000000000000000000000..2e962d0e9a4692f5108754ebd574aa20a33307fe --- /dev/null +++ b/content/osg/how_to_submit_an_osg_job_with_htcondor.md @@ -0,0 +1,216 @@ ++++ +title = "How to submit an OSG job with HTCondor" +description = "How to submit an OSG job with HTCondor" ++++ + +{{% notice info%}}Jobs can be submitted to the OSG from Crane or Tusker, so +there is no need to logon to a different submit host or get a grid +certificate! +{{% /notice %}} + +### What is HTCondor? + +The [HTCondor](http://research.cs.wisc.edu/htcondor) +project provides software to schedule individual applications, +workflows, and for sites to manage resources. It is designed to enable +High Throughput Computing (HTC) on large collections of distributed +resources for users and serves as the job scheduler used on the OSG. + Jobs are submitted from either the Crane or Tusker login nodes to the +OSG using an HTCondor submission script. For those who are used to +submitting jobs with SLURM, there are a few key differences to be aware +of: + +### When using HTCondor + +- All files (scripts, code, executables, libraries, etc) that are + needed by the job are transferred to the remote compute site when + the job is scheduled. Therefore, all of the files required by the + job must be specified in the HTCondor submit script. Paths can be + absolute or relative to the local directory from which the job is + submitted. The main executable (specified on the `Executable` line + of the submit script) is transferred automatically with the job. + All other files need to be listed on the `transfer_input_files` + line (see example below). +- All files that are created by + the job on the remote host will be transferred automatically back to + the submit host when the job has completed. This includes + temporary/scratch and intermediate files that are not removed by + your job. If you do not want to keep these files, clean up the work + space on the remote host by removing these files before the job + exits (this can be done using a wrapper script for example). + Specific output file names can be specified with the + `transfer_input_files` option. If these files do + not exist on the remote + host when the job exits, then the job will not complete successfully + (it will be place in the *held* state). +- HTCondor scripts can queue + (submit) as many jobs as you like. All jobs queued from a single + submit script will be identical except for the `Arguments` used. + The submit script in the example below queues 5 jobs with the first + set of specified arguments, and 1 job with the second set of + arguments. By default, `Queue` when it is not followed by a number + will submit 1 job. + +For more information and advanced usage, see the +[HTCondor Manual](http://research.cs.wisc.edu/htcondor/manual/v8.3/index.html). + +### Creating an HTCondor Script + +HTCondor, much like Slurm, needs a script to tell it how to do what the +user wants. The example below is a basic script in a file say +'applejob.txt' that can be used to handle most jobs submitted to +HTCondor. + +{{% panel theme="info" header="Example of a HTCondor script" %}} +{{< highlight batch >}} +#with executable, stdin, stderr and log +Universe = vanilla +Executable = a.out +Arguments = file_name 12 +Output = a.out.out +Error = a.out.err +Log = a.out.log +Queue +{{< /highlight >}} +{{% /panel %}} + +The table below explains the various attributes/keywords used in the above script. + +| Attribute/Keyword | Explanation | +| ----------------- | ----------------------------------------------------------------------------------------- | +| # | Lines starting with '#' are considered as comments by HTCondor. | +| Universe | is the way HTCondor manages different ways it can run, or what is called in the HTCondor documentation, a runtime environment. The vanilla universe is where most jobs should be run. | +| Executable | is the name of the executable you want to run on HTCondor. | +| Arguments | are the command line arguments for your program. For example, if one was to run `ls -l /` on HTCondor. The Executable would be `ls` and the Arguments would be `-l /`. | +| Output | is the file where the information printed to stdout will be sent. | +| Error | is the file where the information printed to stderr will be sent. | +| Log | is the file where information about your HTCondor job will be sent. Information like if the job is running, if it was halted or, if running in the standard universe, if the file was check-pointed or moved. | +| Queue | is the command to send the job to HTCondor's scheduler. | + + +Suppose you would like to submit a job e.g. a Monte-Carlo simulation, +where the same program needs to be run several times with the same +parameters the script above can be used with the following modification. + +Modify the `Queue` command by giving it the number of times the job must +be run (and hence queued in HTCondor). Thus if the `Queue` command is +changed to `Queue 5`, a.out will be run 5 times with the exact same +parameters. + +In another scenario if you would like to submit the same job but with +different parameters, HTCondor accepts files with multiple `Queue` +statements. Only the parameters that need to be changed should be +changed in the HTCondor script before calling the `Queue`. + +Please see "A simple example " in next chapter for the detail use of +`$(Process)` + +{{% panel theme="info" header="Another Example of a HTCondor script" %}} +{{< highlight batch >}} +#with executable, stdin, stderr and log +#and multiple Argument parameters +Universe = vanilla +Executable = a.out +Arguments = file_name 10 +Output = a.out.$(Process).out +Error = a.out.$(Process).err +Log = a.out.$(Process).log +Queue +Arguments = file_name 20 +Queue +Arguments = file_name 30 +Queue +{{< /highlight >}} +{{% /panel %}} + +### How to Submit and View Your job + +The steps below describe how to submit a job and other important job +management tasks that you may need in order to monitor and/or control +the submitted job: + +1. How to submit a job to OSG - assuming that you named your HTCondor + script as a file applejob.txt + + {{< highlight bash >}}[apple@login.crane ~] $ condor_submit applejob{{< /highlight >}} + + You will see the following output after submitting the job + {{% panel theme="info" header="Example of condor_submit" %}} + Submitting job(s) + ...... + 6 job(s) submitted to cluster 1013038 + {{% /panel %}} + +2. How to view your job status - to view the job status of your + submitted jobs use the following shell command + *Please note that by providing a user name as an argument to the + `condor_q` command you can limit the list of submitted jobs to the + ones that are owned by the named user* + + + {{< highlight bash >}}[apple@login.crane ~] $ condor_q apple{{< /highlight >}} + + The code section below shows a typical output. You may notice that + the column ST represents the status of the job (H: Held and I: Idle + or waiting) + + {{% panel theme="info" header="Example of condor_q" %}} + -- Schedd: login.crane.hcc.unl.edu : <129.93.227.113:9619?... + ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD + 1013034.4 apple 3/26 16:34 0+00:21:00 H 0 0.0 sjrun.py INPUT/INP + 1013038.0 apple 4/3 11:34 0+00:00:00 I 0 0.0 sjrun.py INPUT/INP + 1013038.1 apple 4/3 11:34 0+00:00:00 I 0 0.0 sjrun.py INPUT/INP + 1013038.2 apple 4/3 11:34 0+00:00:00 I 0 0.0 sjrun.py INPUT/INP + 1013038.3 apple 4/3 11:34 0+00:00:00 I 0 0.0 sjrun.py INPUT/INP + ... + 16 jobs; 0 completed, 0 removed, 12 idle, 0 running, 4 held, 0 suspended + {{% /panel %}} + +3. How to release a job - in a few cases a job may get held because of + reasons such as authentication failure or other non-fatal errors, in + those cases you may use the shell command below to release the job + from the held status so that it can be rescheduled by the HTCondor. + + *Release one job:* + {{< highlight bash >}}[apple@login.crane ~] $ condor_release 1013034.4{{< /highlight >}} + + *Release all jobs of a user apple:* + {{< highlight bash >}}[apple@login.crane ~] $ condor_release apple{{< /highlight >}} + +4. How to delete a submitted job - if you want to delete a submitted + job you may use the shell commands as listed below + + *Delete one job:* + {{< highlight bash >}}[apple@login.crane ~] $ condor_rm 1013034.4{{< /highlight >}} + + *Delete all jobs of a user apple:* + {{< highlight bash >}}[apple@login.crane ~] $ condor_rm apple{{< /highlight >}} + +5. How to get help form HTCondor command + + You can use man to get detail explanation of HTCondor command + + {{% panel theme="info" header="Example of help of condor_q" %}} + [apple@glidein ~]man condor_q + {{% /panel %}} + + {{% panel theme="info" header="Output of `man condor_q`" %}} + just-man-pages/condor_q(1) just-man-pages/condor_q(1) + Name + condor_q Display information about jobs in queue + Synopsis + condor_q [ -help ] + condor_q [ -debug ] [ -global ] [ -submitter submitter ] [ -name name ] [ -pool centralmanagerhost- + name[:portnumber] ] [ -analyze ] [ -run ] [ -hold ] [ -globus ] [ -goodput ] [ -io ] [ -dag ] [ -long ] + [ -xml ] [ -attributes Attr1 [,Attr2 ... ] ] [ -format fmt attr ] [ -autoformat[:tn,lVh] attr1 [attr2 + ...] ] [ -cputime ] [ -currentrun ] [ -avgqueuetime ] [ -jobads file ] [ -machineads file ] [ -stream- + results ] [ -wide ] [ {cluster | cluster.process | owner | -constraint expression ... } ] + Description + condor_q displays information about jobs in the Condor job queue. By default, condor_q queries the local + job queue but this behavior may be modified by specifying: + * the -global option, which queries all job queues in the pool + * a schedd name with the -name option, which causes the queue of the named schedd to be queried + {{% /panel %}} + + + Next: [A simple example of submitting an HTCondorjob]({{< relref "a_simple_example_of_submitting_an_htcondor_job" >}}) diff --git a/content/osg/using_distributed_environment_modules_on_osg.md b/content/osg/using_distributed_environment_modules_on_osg.md new file mode 100644 index 0000000000000000000000000000000000000000..bcad43ded8fd9da4e62eefec95c7f70100fc29f7 --- /dev/null +++ b/content/osg/using_distributed_environment_modules_on_osg.md @@ -0,0 +1,138 @@ ++++ +title = "Using Distributed Environment Modules on OSG" +description = "Using Distributed Environment Modules on OSG" ++++ + +Many commonly used software packages and libraries are provided on the +OSG through the `module` command. OSG modules are made available +through the OSG Application Software Installation Service (OASIS). The +set of modules provided on OSG can differ from those on the HCC +clusters. To switch to the OSG modules environment on an HCC machine: + +{{< highlight bash >}} +[apple@login.crane~]$ source osg_oasis_init +{{< /highlight >}} + +Use the module avail command to see what software and libraries are +available: + +{{< highlight bash >}} +[apple@login.crane~]$ module avail +------------------- /cvmfs/oasis.opensciencegrid.org/osg/modules/modulefiles/Core -------------------- + + abyss/2.0.2 gnome_libs/1.0 pegasus/4.7.1 + ant/1.9.4 gnuplot/4.6.5 pegasus/4.7.3 + ANTS/1.9.4 graphviz/2.38.0 pegasus/4.7.4 (D) + ANTS/2.1.0 (D) grass/6.4.4 phenix/1.10 + apr/1.5.1 gromacs/4.6.5 poppler/0.24.1 (D) + aprutil/1.5.3 gromacs/5.0.0 (D) poppler/0.32 + arc-lite/2015 gromacs/5.0.5.cuda povray/3.7 + atlas/3.10.1 gromacs/5.0.5 proj/4.9.1 + atlas/3.10.2 (D) gromacs/5.1.2-cuda proot/2014 + autodock/4.2.6 gsl/1.16 protobuf/2.5 +{{< /highlight >}} + +Loading modules is done with the `module load` command: + +{{< highlight bash >}} +[apple@login.crane~]$ module load python/2.7 +{{< /highlight >}} + +There are two things required in order to use modules in your HTCondor +job. + +1. Create a *wrapper script* for the job. This script will be the + executable for your job and will load the module before running the + main application. +2. Include the following requirements in the HTCondor submission + script: + + {{< highlight batch >}}Requirements = (HAS_MODULES =?= TRUE){{< /highlight >}} + + or + + {{< highlight batch >}}Requirements = [Other requirements ] && (HAS_MODULES =?= TRUE){{< /highlight >}} + +### A simple example using modules on OSG + +The following example will demonstrate how to use modules on OSG with an +R script that implements a Monte-Carlo estimation of Pi (`mcpi.R`). + +First, create a file called `mcpi.R`: + +{{% panel theme="info" header="mcpi.R" %}}{{< highlight R >}} +montecarloPi <- function(trials) { + count = 0 + for(i in 1:trials) { + if((runif(1,0,1)^2 + runif(1,0,1)^2)<1) { + count = count + 1 + } + } + return((count*4)/trials) +} + +montecarloPi(1000) +{{< /highlight >}}{{% /panel %}} + +Next, create a wrapper script called `R-wrapper.sh` to load the required +modules (`R` and `libgfortran`), and execute the R script: + +{{% panel theme="info" header="R-wrapper.sh" %}}{{< highlight bash >}} +#!/bin/bash + +EXPECTED_ARGS=1 + +if [ $# -ne $EXPECTED_ARGS ]; then + echo "Usage: R-wrapper.sh file.R" + exit 1 +else + module load R + module load libgfortran + Rscript $1 +fi +{{< /highlight >}}{{% /panel %}} + +This script takes the name of the R script (`mcpi.R`) as it's argument +and executes it in batch mode (using the `Rscript` command) after +loading the `R` and `libgfortran` modules. + +Make the script executable: + +{{< highlight bash >}}[apple@login.crane~]$ chmod a+x R-script.sh{{< /highlight >}} + +Finally, create the HTCondor submit script, `R.submit`: + +{{% panel theme="info" header="R.submit" %}}{{< highlight batch >}} +universe = vanilla +log = mcpi.log.$(Cluster).$(Process) +error = mcpi.err.$(Cluster).$(Process) +output = mcpi.out.$(Cluster).$(Process) +executable = R-wrapper.sh +transfer_input_files = mcpi.R +arguments = mcpi.R + +Requirements = (HAS_MODULES =?= TRUE) +queue 100 +{{< /highlight >}}{{% /panel %}} + +This script will queue 100 identical jobs to estimate the value of Pi. +Notice that the wrapper script is transferred automatically with the +job because it is listed as the executable. However, the R script +(`mcpi.R`) must be listed after `transfer_input_files` in order to be +transferred with the job. + +Submit the jobs with the `condor_submit` command: + +{{< highlight bash >}}[apple@login.crane~]$ condor_submit R.submit{{< /highlight >}} + +Check on the status of your jobs with `condor_q`: + +{{< highlight bash >}}[apple@login.crane~]$ condor_q{{< /highlight >}} + +When your jobs have completed, find the average estimate for Pi from all +100 jobs: + +{{< highlight bash >}} +[apple@login.crane~]$ grep "[1]" mcpi.out.* | awk '{sum += $2} END { print "Average =", sum/NR}' +Average = 3.13821 +{{< /highlight >}} diff --git a/static/images/17044917.png b/static/images/17044917.png new file mode 100644 index 0000000000000000000000000000000000000000..62dcb2f6f2f512a16965d6d1ac5776a1fb591dc4 Binary files /dev/null and b/static/images/17044917.png differ