Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found
Select Git revision
  • FAQ
  • RDPv10
  • UNL_OneDrive
  • atticguidelines
  • data_share
  • globus-auto-backups
  • good-hcc-practice-rep-workflow
  • hchen2016-faq-home-is-full
  • ipynb-doc
  • master
  • rclone-fix
  • sislam2-master-patch-51693
  • sislam2-master-patch-86974
  • site_url
  • test
15 results

Target

Select target project
  • dweitzel2/hcc-docs
  • OMCCLUNG2/hcc-docs
  • salmandjing/hcc-docs
  • hcc/hcc-docs
4 results
Select Git revision
  • 26-add-screenshots-for-newer-rdp-v10-client
  • 28-overview-page-for-connecting-2
  • AddExamples
  • OMCCLUNG2-master-patch-74599
  • RDPv10
  • globus-auto-backups
  • gpu_update
  • master
  • mtanash2-master-patch-75717
  • mtanash2-master-patch-83333
  • mtanash2-master-patch-87890
  • mtanash2-master-patch-96320
  • patch-1
  • patch-2
  • patch-3
  • runTime
  • submitting-jobs-overview
  • tharvill1-master-patch-26973
18 results
Show changes
Showing
with 0 additions and 1889 deletions
+++
title = "Sandstone"
description = "How to use HCC's sandstone environment"
weight = "45"
+++
### Overview
The HCC Sandstone environment is a GUI interface to the Crane cluster featuring a file browser, text editor, web terminal and SLURM script helper,
To login to the Sandstone environment, go to [crane.unl.edu](https://crane.unl.edu) in your web browser and sign in using your HCC Login Info and DUO authentication.
Upon login, you will land at the File Browser.
### File Browser
The file browser allows you to view, access, and transfer files on Crane. On the left side you will have your available spaces, both your home and work directories. In the upper right of the page, you have buttons to upload files, create a file, and create a directory.
{{< figure src="/images/SandstonefileBrowserOver.png">}}
Clicking on either box under "My Spaces" will change your current directory to either your home or work directory and display your user/group usage and quotas. You can then navigate directories by clicking through them in a similar manner as you would with Windows or MacOS.
{{< figure src="/images/SandstonefileOptions.png">}}
Clicking on a file or directory will bring up some options such as the permissions and actions to do such as editing the file, duplicating or moving it, deleting it, and downloading it.
### Editor
The editor is a basic text editor that allows you to have multiple files loaded and manipulate the files. A small file explorer is available on the left side to access more files. There are similar actions available for files above the mini file browser.
{{< figure src="/images/Sandstoneeditor.png">}}
Like most text editors, basic functions exist to undo and redo changes, find and replace, and most importantly, to save the file.
{{< figure src="/images/SandstoneedtiorDropDown.png">}}
### Terminal
The terminal gives you access to the linux command line on crane, similar to what you would have if you SSH'd directly into Crane. Once the login and quote screen, you can enter commands and interact as you would with a standard terminal.
{{< figure src="/images/SandstoneTerminal.png">}}
### Slurm Assist
Slurm assist is a tool to help create and run slurm submit scripts. The first step is to select a base profile from the profile dropdown menu. Options will appear and the directives will automatically appear. The options are editable to better fit to your specific job with more details found in our submitting jobs documentation. After the directives are filled out, you can then add the commands to start your job in the script section. To save the job, select 'save script for later' and save the script in a known location for later.
{{< figure src="/images/SandstoneSASettings.png">}}
From here, you can also schedule the script recently create, by selecting "Schedule Job". A confirmation will appear with the Job ID and then an instruction on how to view the status of your job.
{{< figure src="/images/SandstoneJobConf.png">}}
{{< figure src="/images/SandstoneSAStatus.png">}}
You can view the progress of other jobs from slurm assist by going to the status page. Here you will see the State of the job, its ID, name, group name, runtime, and the start and end times.
{{< figure src="/images/SandstoneSAStatusPage.png">}}
{{< figure src="/images/SandstoneSAStatuses.png">}}
\ No newline at end of file
+++
title = "Condor Jobs on HCC"
description = "How to run jobs using Condor on HCC machines"
weight = "54"
+++
This quick start demonstrates how to run multiple copies of Fortran/C program
using Condor on HCC supercomputers. The sample codes and submit scripts
can be downloaded from [condor_dir.zip](/attachments/3178558.zip).
#### Login to a HCC Cluster
Log in to a HCC cluster through PuTTY ([For Windows Users]({{< relref "/quickstarts/connecting/for_windows_users">}})) or Terminal ([For Mac/Linux Users]({{< relref "/quickstarts/connecting/for_maclinux_users">}})) and make a subdirectory called `condor_dir` under the `$WORK` directory. In the subdirectory `condor_dir`, create job subdirectories that host the input data files. Here we create two job subdirectories, `job_0` and `job_1`, and put a data file (`data.dat`) in each subdirectory. The data file in `job_0` has a column of data listing the integers from 1 to 5. The data file in `job_1` has a integer list from 6 to 10.
{{< highlight bash >}}
$ cd $WORK
$ mkdir condor_dir
$ cd condor_dir
$ mkdir job_0
$ mkdir job_1
{{< /highlight >}}
In the subdirectory condor`_dir`, save all the relevant codes. Here we
include two demo programs, `demo_f_condor.f90` and `demo_c_condor.c`,
that compute the sum of the data stored in each job subdirectory
(`job_0` and `job_1`). The parallelization scheme here is as the
following. First, the master computer node send out many copies of the
executable from the `condor_dir` subdirectory and a copy of the data
file in each job subdirectories. The number of executable copies is
specified in the submit script (`queue`), and it usually matches with
the number of job subdirectories. Next, the workload is distributed
among a pool of worker computer nodes. At any given time, the number of
available worker nodes may vary. Each worker node executes the jobs
independent of other worker nodes. The output files are separately
stored in the job subdirectory. No additional coding are needed to make
the serial code turned "parallel". Parallelization here is achieved
through the submit script.
{{%expand "demo_condor.f90" %}}
{{< highlight fortran >}}
Program demo_f_condor
implicit none
integer, parameter :: N = 5
real*8 w
integer i
common/sol/ x
real*8 x
real*8, dimension(N) :: y_local
real*8, dimension(N) :: input_data
open(10, file='data.dat')
do i = 1,N
read(10,*) input_data(i)
enddo
do i = 1,N
w = input_data(i)*1d0
call proc(w)
y_local(i) = x
write(6,*) 'i,x = ', i, y_local(i)
enddo
write(6,*) 'sum(y) =',sum(y_local)
Stop
End Program
Subroutine proc(w)
real*8, intent(in) :: w
common/sol/ x
real*8 x
x = w
Return
End Subroutine
{{< /highlight >}}
{{% /expand %}}
{{%expand "demo_c_condor.c" %}}
{{< highlight c >}}
//demo_c_condor
#include <stdio.h>
double proc(double w){
double x;
x = w;
return x;
}
int main(int argc, char* argv[]){
int N=5;
double w;
int i;
double x;
double y_local[N];
double sum;
double input_data[N];
FILE *fp;
fp = fopen("data.dat","r");
for (i = 1; i<= N; i++){
fscanf(fp, "%lf", &input_data[i-1]);
}
for (i = 1; i <= N; i++){
w = input_data[i-1]*1e0;
x = proc(w);
y_local[i-1] = x;
printf("i,x= %d %lf\n", i, y_local[i-1]) ;
}
sum = 0e0;
for (i = 1; i<= N; i++){
sum = sum + y_local[i-1];
}
printf("sum(y)= %lf\n", sum);
return 0;
}
{{< /highlight >}}
{{% /expand %}}
---
#### Compiling the Code
The compiled executable needs to match the "standard" environment of the
worker node. The easies way is to directly use the compilers installed
on the HCC supercomputer without loading extra modules. The standard
compiler of the HCC supercomputer is GNU Compier Collection. The version
can be looked up by the command lines `gcc -v` or `gfortran -v`.
{{< highlight bash >}}
$ gfortran demo_f_condor.f90 -o demo_f_condor.x
$ gcc demo_c_condor.c -o demo_c_condor.x
{{< /highlight >}}
#### Creating a Submit Script
Create a submit script to request 2 jobs (queue). The name of the job
subdirectories is specified in the line `initialdir`. The
`$(process)` macro assigns integer numbers to the job subdirectory
name `job_`. The numbers run form `0` to `queue-1`. The name of the input
data file is specified in the line `transfer_input_files`.
{{% panel header="`submit_f.condor`"%}}
{{< highlight bash >}}
universe = grid
grid_resource = pbs
batch_queue = guest
should_transfer_files = yes
when_to_transfer_output = on_exit
executable = demo_f_condor.x
output = Fortran_$(process).out
error = Fortran_$(process).err
initialdir = job_$(process)
transfer_input_files = data.dat
queue 2
{{< /highlight >}}
{{% /panel %}}
{{% panel header="`submit_c.condor`"%}}
{{< highlight bash >}}
universe = grid
grid_resource = pbs
batch_queue = guest
should_transfer_files = yes
when_to_transfer_output = on_exit
executable = demo_c_condor.x
output = C_$(process).out
error = C_$(process).err
initialdir = job_$(process)
transfer_input_files = data.dat
queue 2
{{< /highlight >}}
{{% /panel %}}
#### Submit the Job
The job can be submitted through the command `condor_submit`. The job
status can be monitored by entering `condor_q` followed by the
username.
{{< highlight bash >}}
$ condor_submit submit_f.condor
$ condor_submit submit_c.condor
$ condor_q <username>
{{< /highlight >}}
Replace `<username>` with your HCC username.
Sample Output
-------------
In the job subdirectory `job_0`, the sum from 1 to 5 is computed and
printed to the `.out` file. In the job subdirectory `job_1`, the sum
from 6 to 10 is computed and printed to the `.out` file.
{{%expand "Fortran_0.out" %}}
{{< highlight batchfile>}}
i,x = 1 1.0000000000000000
i,x = 2 2.0000000000000000
i,x = 3 3.0000000000000000
i,x = 4 4.0000000000000000
i,x = 5 5.0000000000000000
sum(y) = 15.000000000000000
{{< /highlight >}}
{{% /expand %}}
{{%expand "Fortran_1.out" %}}
{{< highlight batchfile>}}
i,x = 1 6.0000000000000000
i,x = 2 7.0000000000000000
i,x = 3 8.0000000000000000
i,x = 4 9.0000000000000000
i,x = 5 10.000000000000000
sum(y) = 40.000000000000000
{{< /highlight >}}
{{% /expand %}}
+++
title = "Monitoring Jobs"
description = "How to find out information about running and completed jobs."
+++
Careful examination of running times, memory usage and output files will
allow you to ensure the job completed correctly and give you a good idea
of what memory and time limits to request in the future.
### Monitoring Completed Jobs:
To see the runtime and memory usage of a job that has completed, use the
sacct command:
{{< highlight bash >}}
sacct
{{< /highlight >}}
Lists all jobs by the current user and displays information such as
JobID, JobName, State, and ExitCode.
{{< figure src="/images/21070053.png" height="150" >}}
Coupling this command with the --format flag will allow you to see more
than the default information about a job. Fields to display should be
listed as a comma separated list after the --format flag (without
spaces). For example, to see the Elapsed time and Maximum used memory by
a job, this command can be used:
{{< highlight bash >}}
sacct --format JobID,JobName,Elapsed,MaxRSS
{{< /highlight >}}
{{< figure src="/images/21070054.png" height="150" >}}
Additional arguments and format field information can be found in
[the SLURM documentation](https://slurm.schedmd.com/sacct.html).
### Monitoring Running Jobs:
There are two ways to monitor running jobs, the top command and
monitoring the cgroup files. Top is helpful when monitoring
multi-process jobs, whereas the cgroup files provide information on
memory usage. Both of these tools require the use of an interactive job
on the same node as the job to be monitored.
{{% notice warning %}}
If the job to be monitored is using all available resources for a node,
the user will not be able to obtain a simultaneous interactive job.
{{% /notice %}}
After the job to be monitored is submitted and has begun to run, request
an interactive job on the same node using the srun command:
{{< highlight bash >}}
srun --jobid=<JOB_ID> --pty bash
{{< /highlight >}}
Where `<JOB_ID>` is replaced by the job id for the monitored job as
assigned by SLURM.
Alternately, you can request the interactive job by nodename as follows:
{{< highlight bash >}}
srun --nodelist=<NODE_ID> --pty bash
{{< /highlight >}}
Where `<NODE_ID>` is replaced by the node name that the monitored
job is running. This information can be found out by looking at the
squeue output under the `NODELIST` column.
{{< figure src="/images/21070055.png" width="700" >}}
Once the interactive job begins, you can run top to view the processes
on the node you are on:
{{< figure src="/images/21070056.png" height="400" >}}
Output for top displays each running process on the node. From the above
image, we can see the various MATLAB processes being run by user
cathrine98. To filter the list of processes, you can type `u` followed
by the username of the user who owns the processes. To exit this screen,
press `q`.
During a running job, the cgroup folder is created which contains much
of the information used by sacct. These files can provide a live
overview of resources used for a running job. To access the cgroup
files, you will need to be in an interactive job on the same node as the
monitored job. To view specific files, and information, use one of the
following commands:
##### To view current memory usage:
{{< highlight bash >}}
less /cgroup/memory/slurm/uid_<UID>/job_<SLURM_JOB_ID>/memory.usage_in_bytes
{{< /highlight >}}
Where `<UID>` is replaced by your UID and `<SLURM_JOB_ID>` is
replaced by the monitored job's Job ID as assigned by Slurm.
{{% notice note %}}
To find your uid, use the command `id -u`. Your UID never changes and is
the same on all HCC clusters (*not* on Anvil, however!).
{{% /notice %}}
##### To view maximum memory usage from start of job to current point:
{{< highlight bash >}}
cat /cgroup/memory/slurm/uid_${UID}/job_${SLURM_JOBID}/memory.max_usage_in_bytes
{{< /highlight >}}
+++
title = "Partitions"
description = "Listing of partitions on Crane and Rhino."
scripts = ["https://cdnjs.cloudflare.com/ajax/libs/jquery.tablesorter/2.31.1/js/jquery.tablesorter.min.js", "https://cdnjs.cloudflare.com/ajax/libs/jquery.tablesorter/2.31.1/js/widgets/widget-pager.min.js","https://cdnjs.cloudflare.com/ajax/libs/jquery.tablesorter/2.31.1/js/widgets/widget-filter.min.js","/js/sort-table.js"]
css = ["http://mottie.github.io/tablesorter/css/theme.default.css","https://mottie.github.io/tablesorter/css/theme.dropbox.css", "https://cdnjs.cloudflare.com/ajax/libs/jquery.tablesorter/2.31.1/css/jquery.tablesorter.pager.min.css","https://cdnjs.cloudflare.com/ajax/libs/jquery.tablesorter/2.31.1/css/filter.formatter.min.css"]
+++
Partitions are used on Crane and Rhino to distinguish different
resources. You can view the partitions with the command `sinfo`.
### Crane:
[Full list for Crane]({{< relref "crane_available_partitions" >}})
### Rhino:
[Full list for Rhino]({{< relref "rhino_available_partitions" >}})
#### Priority for short jobs
To run short jobs for testing and development work, a job can specify a
different quality of service (QoS). The *short* QoS increases a jobs
priority so it will run as soon as possible.
| SLURM Specification |
|----------------------- |
| `#SBATCH --qos=short` |
{{% panel theme="warning" header="Limits per user for 'short' QoS" %}}
- 6 hour job run time
- 2 jobs of 16 CPUs or fewer
- No more than 256 CPUs in use for *short* jobs from all users
{{% /panel %}}
### Limitations of Jobs
Overall limitations of maximum job wall time. CPUs, etc. are set for
all jobs with the default setting (when thea "–qos=" section is omitted)
and "short" jobs (described as above) on Crane and Rhino.
The limitations are shown in the following form.
| | SLURM Specification | Max Job Run Time | Max CPUs per User | Max Jobs per User |
| ------- | -------------------- | ---------------- | ----------------- | ----------------- |
| Default | Leave blank | 7 days | 2000 | 1000 |
| Short | #SBATCH --qos=short | 6 hours | 16 | 2 |
Please also note that the memory and
local hard drive limits are subject to the physical limitations of the
nodes, described in the resources capabilities section of the
[HCC Documentation]({{< relref "/#resource-capabilities" >}})
and the partition sections above.
### Owned Partitions
Partitions marked as owned by a group means only specific groups are
allowed to submit jobs to that partition. Groups are manually added to
the list allowed to submit jobs to the partition. If you are unable to
submit jobs to a partition, and you feel that you should be, please
contact {{< icon name="envelope" >}}[hcc-support@unl.edu] (mailto:hcc-support@unl.edu).
### Guest Partition
The `guest` partition can be used by users and groups that do not own
dedicated resources on Crane or Rhino. Jobs running in the `guest` partition
will run on the owned resources with Intel OPA interconnect. The jobs
are preempted when the resources are needed by the resource owners and
are restarted on another node.
### tmp_anvil Partition
We have put Anvil nodes which are not running Openstack in this
partition. They have Intel Xeon E5-2650 v3 2.30GHz 2 CPU/20 cores and
256GB memory per node. However, they don't have Infiniband or OPA
interconnect. They are suitable for serial or single node parallel jobs.
The nodes in this partition are subjected to be drained and move to our
Openstack cloud when more cloud resources are needed without notice in
advance.
### Use of Infiniband or OPA
Crane nodes use either Infiniband or Intel Omni-Path interconnects in
the batch partition. Most users don't need to worry about which one to
choose. Jobs will automatically be scheduled for either of them by the
scheduler. However, if the user wants to use one of the interconnects
exclusively, the SLURM constraint keyword is available. Here are the
examples:
{{% panel theme="info" header="SLURM Specification: Omni-Path" %}}
{{< highlight bash >}}
#SBATCH --constraint=opa
{{< /highlight >}}
{{% /panel %}}
{{% panel theme="info" header="SLURM Specification: Infiniband" %}}
{{< highlight bash >}}
#SBATCH --constraint=ib
{{< /highlight >}}
{{% /panel %}}
+++
title = "Available Partitions for Crane"
description = "List of available partitions for crane.unl.edu."
scripts = ["https://cdnjs.cloudflare.com/ajax/libs/jquery.tablesorter/2.31.1/js/jquery.tablesorter.min.js", "https://cdnjs.cloudflare.com/ajax/libs/jquery.tablesorter/2.31.1/js/widgets/widget-pager.min.js","https://cdnjs.cloudflare.com/ajax/libs/jquery.tablesorter/2.31.1/js/widgets/widget-filter.min.js","/js/sort-table.js"]
css = ["http://mottie.github.io/tablesorter/css/theme.default.css","https://mottie.github.io/tablesorter/css/theme.dropbox.css", "https://cdnjs.cloudflare.com/ajax/libs/jquery.tablesorter/2.31.1/css/jquery.tablesorter.pager.min.css","https://cdnjs.cloudflare.com/ajax/libs/jquery.tablesorter/2.31.1/css/filter.formatter.min.css"]
+++
### Crane:
{{< table url="http://crane-head.unl.edu:8192/slurm/partitions/json" >}}
+++
title = "Available Partitions for Rhino"
description = "List of available partitions for rhino.unl.edu."
scripts = ["https://cdnjs.cloudflare.com/ajax/libs/jquery.tablesorter/2.31.1/js/jquery.tablesorter.min.js", "https://cdnjs.cloudflare.com/ajax/libs/jquery.tablesorter/2.31.1/js/widgets/widget-pager.min.js","https://cdnjs.cloudflare.com/ajax/libs/jquery.tablesorter/2.31.1/js/widgets/widget-filter.min.js","/js/sort-table.js"]
css = ["http://mottie.github.io/tablesorter/css/theme.default.css","https://mottie.github.io/tablesorter/css/theme.dropbox.css", "https://cdnjs.cloudflare.com/ajax/libs/jquery.tablesorter/2.31.1/css/jquery.tablesorter.pager.min.css","https://cdnjs.cloudflare.com/ajax/libs/jquery.tablesorter/2.31.1/css/filter.formatter.min.css"]
+++
### Rhino:
{{< table url="http://rhino-head.unl.edu:8192/slurm/partitions/json" >}}
+++
title = "Submitting an Interactive Job"
description = "How to run an interactive job on HCC resources."
+++
{{% notice info %}}
The `/home` directories are read-only on the worker nodes. You will need
to compile or run your processing in `/work`.
{{% /notice %}}
Submitting an interactive job is done with the command `srun`.
{{< highlight bash >}}
$ srun --pty $SHELL
{{< /highlight >}}
or to allocate 4 cores per node:
{{< highlight bash >}}
$ srun --nodes=1 --ntasks-per-node=4 --mem-per-cpu=1024 --pty $SHELL
{{< /highlight >}}
Submitting an interactive job is useful if you require extra resources
to run some processing by hand. It is also very useful to debug your
processing.
And interactive job is scheduled onto a worker node just like a regular
job. You can provide options to the interactive job just as you would a
regular SLURM job.
### Priority for short jobs
To run short jobs for testing and development work, a job can specify a
different quality of service (QoS). The *short* QoS increases a jobs
priority so it will run as soon as possible.
| SLURM Specification |
|---------------------|
| `--qos=short` |
{{% panel theme="warning" header="Limits per user for 'short' QoS" %}}
- 6 hour job run time
- 2 jobs of 16 CPUs or fewer
- No more than 256 CPUs in use for *short* jobs from all users
{{% /panel %}}
{{% panel theme="info" header="Using the short QoS" %}}
{{< highlight bash >}}
srun --qos=short --nodes=1 --ntasks-per-node=1 --mem-per-cpu=1024 --pty $SHELL
{{< /highlight >}}
{{% /panel %}}
+++
title = "Submitting an MPI Job"
description = "How to submit an MPI job on HCC resources."
+++
This script requests 16 cores on nodes with InfiniBand:
{{% panel theme="info" header="mpi.submit" %}}
{{< highlight batch >}}
#!/bin/sh
#SBATCH --ntasks=16
#SBATCH --mem-per-cpu=1024
#SBATCH --time=03:15:00
#SBATCH --error=/work/[groupname]/[username]/job.%J.err
#SBATCH --output=/work/[groupname]/[username]/job.%J.out
module load compiler/gcc/8.2 openmpi/2.1
mpirun /home/[groupname]/[username]/mpiprogram
{{< /highlight >}}
{{% /panel %}}
The above job will allocate 16 cores on the default partition. The 16
cores could be on any of the nodes in the partition, even split between
multiple nodes.
### Advanced Submission
Some users may prefer to specify more details. This will allocate 32
tasks, 16 on each of two nodes:
{{% panel theme="info" header="mpi.submit" %}}
{{< highlight batch >}}
#!/bin/sh
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=16
#SBATCH --mem-per-cpu=1024
#SBATCH --time=03:15:00
#SBATCH --error=/work/[groupname]/[username]/job.%J.err
#SBATCH --output=/work/[groupname]/[username]/job.%J.out
module load compiler/gcc/8.2 openmpi/2.1
mpirun /home/[groupname]/[username]/mpiprogram
{{< /highlight >}}
{{% /panel %}}
+++
title = "Submitting CUDA or OpenACC Jobs"
description = "How to submit GPU (CUDA/OpenACC) jobs on HCC resources."
+++
### Available GPUs
Crane has four types of GPUs available in the **gpu** partition. The
type of GPU is configured as a SLURM feature, so you can specify a type
of GPU in your job resource requirements if necessary.
| Description | SLURM Feature | Available Hardware |
| -------------------- | ------------- | ---------------------------- |
| Tesla K20, non-IB | gpu_k20 | 3 nodes - 2 GPUs with 4 GB mem per node |
| Teska K20, with IB | gpu_k20 | 3 nodes - 3 GPUs with 4 GB mem per node |
| Tesla K40, with IB | gpu_k40 | 5 nodes - 4 K40M GPUs with 11 GB mem per node<br> 1 node - 2 K40C GPUs |
| Tesla P100, with OPA | gpu_p100 | 2 nodes - 2 GPUs with 12 GB per node |
| Tesla V100, with 10GbE | gpu_v100 | 1 node - 4 GPUs with 16 GB per node |
To run your job on the next available GPU regardless of type, add the
following options to your srun or sbatch command:
{{< highlight batch >}}
--partition=gpu --gres=gpu
{{< /highlight >}}
To run on a specific type of GPU, you can constrain your job to require
a feature. To run on K40 GPUs for example:
{{< highlight batch >}}
--partition=gpu --gres=gpu --constraint=gpu_k40
{{< /highlight >}}
{{% notice info %}}
You may request multiple GPUs by changing the` --gres` value to
-`-gres=gpu:2`. Note that this value is **per node**. For example,
`--nodes=2 --gres=gpu:2 `will request 2 nodes with 2 GPUs each, for a
total of 4 GPUs.
{{% /notice %}}
### Compiling
Compilation of CUDA or OpenACC jobs must be performed on the GPU nodes.
Therefore, you must run an [interactive job]({{< relref "submitting_an_interactive_job" >}})
to compile. An example command to compile in the **gpu** partition could be:
{{< highlight batch >}}
$ srun --partition=gpu --gres=gpu --mem-per-cpu=1024 --ntasks-per-node=6 --nodes=1 --pty $SHELL
{{< /highlight >}}
The above command will start a shell on a GPU node with 6 cores and 6GB
of ram in order to compile a GPU job. The above command could also be
useful if you want to run a test GPU job interactively.
### Submitting Jobs
CUDA and OpenACC submissions require running on GPU nodes.
{{% panel theme="info" header="cuda.submit" %}}
{{< highlight batch >}}
#!/bin/sh
#SBATCH --time=03:15:00
#SBATCH --mem-per-cpu=1024
#SBATCH --job-name=cuda
#SBATCH --partition=gpu
#SBATCH --gres=gpu
#SBATCH --error=/work/[groupname]/[username]/job.%J.err
#SBATCH --output=/work/[groupname]/[username]/job.%J.out
module load cuda/8.0
./cuda-app.exe
{{< /highlight >}}
{{% /panel %}}
OpenACC submissions require loading the PGI compiler (which is currently
required to compile as well).
{{% panel theme="info" header="openacc.submit" %}}
{{< highlight batch >}}
#!/bin/sh
#SBATCH --time=03:15:00
#SBATCH --mem-per-cpu=1024
#SBATCH --job-name=cuda-acc
#SBATCH --partition=gpu
#SBATCH --gres=gpu
#SBATCH --error=/work/[groupname]/[username]/job.%J.err
#SBATCH --output=/work/[groupname]/[username]/job.%J.out
module load cuda/8.0 compiler/pgi/16
./acc-app.exe
{{< /highlight >}}
{{% /panel %}}
+++
title = "Submitting HTCondor Jobs"
description = "How to submit HTCondor Jobs on HCC resources."
+++
If you require features of HTCondor, such as DAGMan or Pegasus,
[HTCondor](http://research.cs.wisc.edu/htcondor/) can
submit jobs using HTCondor's PBS integration. This can
be done by adding `grid_resource = pbs` to the submit file. An example
submission script is below:
{{% panel theme="info" header="submit.condor" %}}
{{< highlight batch >}}
universe = grid
grid_resource = pbs
executable = test.sh
output = stuff.out
error = stuff.err
log = stuff.log
batch_queue = guest
queue
{{< /highlight >}}
{{% /panel %}}
The above script will translate the condor submit file into a SLURM
submit file, and execute the `test.sh` executable on a worker node.
{{% notice warning %}}
The `/home` directories are read only on the worker nodes. You
have to submit your jobs from the `/work` directory just as you would in
SLURM.
{{% /notice %}}
### Using Pegasus
If you are using [Pegasus](http://pegasus.isi.edu),
instructions on using the *glite* interface (as shown above) are
available on the
[User Guide](http://pegasus.isi.edu/wms/docs/latest/execution_environments.php#glite).
+++
title = "The Open Science Grid"
description = "How to utilize the Open Science Grid (OSG)."
weight = "40"
+++
If you find that you are not getting access to the volume of computing
resources needed for your research through HCC, you might also consider
submitting your jobs to the Open Science Grid (OSG).
### What is the Open Science Grid?
The [Open Science Grid](http://opensciencegrid.org) advances
science through open distributed computing. The OSG is a
multi-disciplinary partnership to federate local, regional, community
and national cyber infrastructures to meet the needs of research and
academic communities at all scales. HCC participates in the OSG as a
resource provider and a resource user. We provide HCC users with a
gateway to running jobs on the OSG.
The map below shows the Open Science Grid sites located across the U.S.
{{< figure src="/images/17044917.png" >}}
This help document is divided into four sections, namely:
- [Characteristics of an OSG friendly job]({{< relref "characteristics_of_an_osg_friendly_job" >}})
- [How to submit an OSG Job with HTCondor]({{< relref "how_to_submit_an_osg_job_with_htcondor" >}})
- [A simple example of submitting an HTCondorjob]({{< relref "a_simple_example_of_submitting_an_htcondor_job" >}})
- [Using Distributed Environment Modules on OSG]({{< relref "using_distributed_environment_modules_on_osg" >}})
+++
title = "A simple example of submitting an HTCondor job"
description = "A simple example of submitting an HTCondor job."
+++
This page describes a complete example of submitting an HTCondor job.
1. SSH to Crane
{{% panel theme="info" header="ssh command" %}}
[apple@localhost]ssh apple@crane.unl.edu
{{% /panel %}}
{{% panel theme="info" header="output" %}}
[apple@login.crane~]$
{{% /panel %}}
2. Write a simple python program in a file "hello.py" that we wish to
run using HTCondor
{{% panel theme="info" header="edit a python code named 'hello.py'" %}}
[apple@login.crane ~]$ vim hello.py
{{% /panel %}}
Then in the edit window, please input the code below:
{{% panel theme="info" header="hello.py" %}}
#!/usr/bin/env python
import sys
import time
i=1
while i<=6:
print i
i+=1
time.sleep(1)
print 2**8
print "hello world received argument = " +sys.argv[1]
{{% /panel %}}
This program will print 1 through 6 on stdout, then print the number
256, and finally print `hello world received argument = <Command
Line Argument Sent to the hello.py>`
3. Write an HTCondor submit script named "hello.submit"
{{% panel theme="info" header="hello.submit" %}}
Universe = vanilla
Executable = hello.py
Output = OUTPUT/hello.out.$(Cluster).$(Process).txt
Error = OUTPUT/hello.error.$(Cluster).$(Process).txt
Log = OUTPUT/hello.log.$(Cluster).$(Process).txt
notification = Never
Arguments = $(Process)
PeriodicRelease = ((JobStatus==5) && (CurentTime - EnteredCurrentStatus) > 30)
OnExitRemove = (ExitStatus == 0)
Queue 4
{{% /panel %}}
4. Create an OUTPUT directory to receive all output files that
generated by your job (OUTPUT folder is used in the submit script
above )
{{% panel theme="info" header="create output directory" %}}
[apple@login.crane ~]$ mkdir OUTPUT
{{% /panel %}}
5. Submit your job
{{% panel theme="info" header="condor_submit" %}}
[apple@login.crane ~]$ condor_submit hello.submit
{{% /panel %}}
{{% panel theme="info" header="Output of submit" %}}
Submitting job(s)
....
4 job(s) submitted to cluster 1013054.
{{% /panel %}}
6. Check status of `condor_q`
{{% panel theme="info" header="condor_q" %}}
[apple@login.crane ~]$ condor_q
{{% /panel %}}
{{% panel theme="info" header="Output of `condor_q`" %}}
-- Schedd: login.crane.hcc.unl.edu : <129.93.227.113:9619?...
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
720587.0 logan 12/15 10:48 33+14:41:17 H 0 0.0 continuous.cron 20
720588.0 logan 12/15 10:48 200+02:40:08 H 0 0.0 checkprogress.cron
1012864.0 jthiltge 2/15 16:48 0+00:00:00 H 0 0.0 test.sh
1013054.0 jennyshao 4/3 17:58 0+00:00:00 R 0 0.0 hello.py 0
1013054.1 jennyshao 4/3 17:58 0+00:00:00 R 0 0.0 hello.py 1
1013054.2 jennyshao 4/3 17:58 0+00:00:00 I 0 0.0 hello.py 2
1013054.3 jennyshao 4/3 17:58 0+00:00:00 I 0 0.0 hello.py 3
7 jobs; 0 completed, 0 removed, 0 idle, 4 running, 3 held, 0 suspended
{{% /panel %}}
Listed below are the three status of the jobs as observed in the
above output
| Symbol | Representation |
|--------|------------------|
| H | Held |
| R | Running |
| I | Idle and waiting |
7. Explanation of the `$(Cluster)` and `$(Process)` in HTCondor script
`$(Cluster)` and `$(Process)` are variables that are available in the
variable name space in the HTCondor script. `$(Cluster)` means the
prefix of your job ID and `$(Process)` varies from `0` through number of
jobs called with `Queue - 1`. If your job is a single job, then
`$(Cluster) =` `<job ID>` else, your job ID is combined with `$(Cluster)` and
`$(Process)`.
In this example, `$(Cluster)`="1013054" and `$(Process)` varies from "0"
to "3" for the above HTCondor script.
In majority of the cases one will use these variables for modifying
the behavior of each individual task of the HTCondor submission, for
example one may vary the input/output file/parameters for the run
program. In this example we are simply passing the `$(Process)` as
arguments as `sys.argv[1]` in `hello.py`.
The lines of interest for this discussion from file the HTCondor
script "hello.submit" are listed below in the code section :
{{% panel theme="info" header="for `$(Process)`" %}}
Output= hello.out.$(Cluster).$(Process).txt
Arguments = $(Process)
Queue 4
{{% /panel %}}
The line of interest for this discussion from file "hello.py" is
listed in the code section below:
{{% panel theme="info" header="for `$(Process)`" %}}
print "hello world received argument = " +sys.argv[1]
{{% /panel %}}
8. Viewing the results of your job
After your job is completed you may use Linux "cat" or "vim" command
to view the job output.
For example in the file `hello.out.1013054.2.txt`, "1013054" means
`$(Cluster)`, and "2" means `$(Process)` the output looks like.
**example of one output file "hello.out.1013054.2.txt"**
{{% panel theme="info" header="example of one output file `hello.out.1013054.2.txt`" %}}
1
2
3
4
5
6
256
hello world received argument = 2
{{% /panel %}}
9. Please see the link below for one more example:
http://research.cs.wisc.edu/htcondor/tutorials/intl-grid-school-3/submit_first.html
Next: [Using Distributed Environment Modules on OSG]({{< relref "using_distributed_environment_modules_on_osg" >}})
+++
title = "How to submit an OSG job with HTCondor"
description = "How to submit an OSG job with HTCondor"
+++
{{% notice info%}}Jobs can be submitted to the OSG from Crane, so
there is no need to logon to a different submit host or get a grid
certificate!
{{% /notice %}}
### What is HTCondor?
The [HTCondor](http://research.cs.wisc.edu/htcondor)
project provides software to schedule individual applications,
workflows, and for sites to manage resources. It is designed to enable
High Throughput Computing (HTC) on large collections of distributed
resources for users and serves as the job scheduler used on the OSG.
Jobs are submitted from the Crane login node to the
OSG using an HTCondor submission script. For those who are used to
submitting jobs with SLURM, there are a few key differences to be aware
of:
### When using HTCondor
- All files (scripts, code, executables, libraries, etc) that are
needed by the job are transferred to the remote compute site when
the job is scheduled. Therefore, all of the files required by the
job must be specified in the HTCondor submit script. Paths can be
absolute or relative to the local directory from which the job is
submitted. The main executable (specified on the `Executable` line
of the submit script) is transferred automatically with the job.
All other files need to be listed on the `transfer_input_files`
line (see example below).
- All files that are created by
the job on the remote host will be transferred automatically back to
the submit host when the job has completed. This includes
temporary/scratch and intermediate files that are not removed by
your job. If you do not want to keep these files, clean up the work
space on the remote host by removing these files before the job
exits (this can be done using a wrapper script for example).
Specific output file names can be specified with the
`transfer_input_files` option. If these files do
not exist on the remote
host when the job exits, then the job will not complete successfully
(it will be place in the *held* state).
- HTCondor scripts can queue
(submit) as many jobs as you like. All jobs queued from a single
submit script will be identical except for the `Arguments` used.
The submit script in the example below queues 5 jobs with the first
set of specified arguments, and 1 job with the second set of
arguments. By default, `Queue` when it is not followed by a number
will submit 1 job.
For more information and advanced usage, see the
[HTCondor Manual](http://research.cs.wisc.edu/htcondor/manual/v8.3/index.html).
### Creating an HTCondor Script
HTCondor, much like Slurm, needs a script to tell it how to do what the
user wants. The example below is a basic script in a file say
'applejob.txt' that can be used to handle most jobs submitted to
HTCondor.
{{% panel theme="info" header="Example of a HTCondor script" %}}
{{< highlight batch >}}
#with executable, stdin, stderr and log
Universe = vanilla
Executable = a.out
Arguments = file_name 12
Output = a.out.out
Error = a.out.err
Log = a.out.log
Queue
{{< /highlight >}}
{{% /panel %}}
The table below explains the various attributes/keywords used in the above script.
| Attribute/Keyword | Explanation |
| ----------------- | ----------------------------------------------------------------------------------------- |
| # | Lines starting with '#' are considered as comments by HTCondor. |
| Universe | is the way HTCondor manages different ways it can run, or what is called in the HTCondor documentation, a runtime environment. The vanilla universe is where most jobs should be run. |
| Executable | is the name of the executable you want to run on HTCondor. |
| Arguments | are the command line arguments for your program. For example, if one was to run `ls -l /` on HTCondor. The Executable would be `ls` and the Arguments would be `-l /`. |
| Output | is the file where the information printed to stdout will be sent. |
| Error | is the file where the information printed to stderr will be sent. |
| Log | is the file where information about your HTCondor job will be sent. Information like if the job is running, if it was halted or, if running in the standard universe, if the file was check-pointed or moved. |
| Queue | is the command to send the job to HTCondor's scheduler. |
Suppose you would like to submit a job e.g. a Monte-Carlo simulation,
where the same program needs to be run several times with the same
parameters the script above can be used with the following modification.
Modify the `Queue` command by giving it the number of times the job must
be run (and hence queued in HTCondor). Thus if the `Queue` command is
changed to `Queue 5`, a.out will be run 5 times with the exact same
parameters.
In another scenario if you would like to submit the same job but with
different parameters, HTCondor accepts files with multiple `Queue`
statements. Only the parameters that need to be changed should be
changed in the HTCondor script before calling the `Queue`.
Please see "A simple example " in next chapter for the detail use of
`$(Process)`
{{% panel theme="info" header="Another Example of a HTCondor script" %}}
{{< highlight batch >}}
#with executable, stdin, stderr and log
#and multiple Argument parameters
Universe = vanilla
Executable = a.out
Arguments = file_name 10
Output = a.out.$(Process).out
Error = a.out.$(Process).err
Log = a.out.$(Process).log
Queue
Arguments = file_name 20
Queue
Arguments = file_name 30
Queue
{{< /highlight >}}
{{% /panel %}}
### How to Submit and View Your job
The steps below describe how to submit a job and other important job
management tasks that you may need in order to monitor and/or control
the submitted job:
1. How to submit a job to OSG - assuming that you named your HTCondor
script as a file applejob.txt
{{< highlight bash >}}[apple@login.crane ~] $ condor_submit applejob{{< /highlight >}}
You will see the following output after submitting the job
{{% panel theme="info" header="Example of condor_submit" %}}
Submitting job(s)
......
6 job(s) submitted to cluster 1013038
{{% /panel %}}
2. How to view your job status - to view the job status of your
submitted jobs use the following shell command
*Please note that by providing a user name as an argument to the
`condor_q` command you can limit the list of submitted jobs to the
ones that are owned by the named user*
{{< highlight bash >}}[apple@login.crane ~] $ condor_q apple{{< /highlight >}}
The code section below shows a typical output. You may notice that
the column ST represents the status of the job (H: Held and I: Idle
or waiting)
{{% panel theme="info" header="Example of condor_q" %}}
-- Schedd: login.crane.hcc.unl.edu : <129.93.227.113:9619?...
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
1013034.4 apple 3/26 16:34 0+00:21:00 H 0 0.0 sjrun.py INPUT/INP
1013038.0 apple 4/3 11:34 0+00:00:00 I 0 0.0 sjrun.py INPUT/INP
1013038.1 apple 4/3 11:34 0+00:00:00 I 0 0.0 sjrun.py INPUT/INP
1013038.2 apple 4/3 11:34 0+00:00:00 I 0 0.0 sjrun.py INPUT/INP
1013038.3 apple 4/3 11:34 0+00:00:00 I 0 0.0 sjrun.py INPUT/INP
...
16 jobs; 0 completed, 0 removed, 12 idle, 0 running, 4 held, 0 suspended
{{% /panel %}}
3. How to release a job - in a few cases a job may get held because of
reasons such as authentication failure or other non-fatal errors, in
those cases you may use the shell command below to release the job
from the held status so that it can be rescheduled by the HTCondor.
*Release one job:*
{{< highlight bash >}}[apple@login.crane ~] $ condor_release 1013034.4{{< /highlight >}}
*Release all jobs of a user apple:*
{{< highlight bash >}}[apple@login.crane ~] $ condor_release apple{{< /highlight >}}
4. How to delete a submitted job - if you want to delete a submitted
job you may use the shell commands as listed below
*Delete one job:*
{{< highlight bash >}}[apple@login.crane ~] $ condor_rm 1013034.4{{< /highlight >}}
*Delete all jobs of a user apple:*
{{< highlight bash >}}[apple@login.crane ~] $ condor_rm apple{{< /highlight >}}
5. How to get help form HTCondor command
You can use man to get detail explanation of HTCondor command
{{% panel theme="info" header="Example of help of condor_q" %}}
[apple@glidein ~]man condor_q
{{% /panel %}}
{{% panel theme="info" header="Output of `man condor_q`" %}}
just-man-pages/condor_q(1) just-man-pages/condor_q(1)
Name
condor_q Display information about jobs in queue
Synopsis
condor_q [ -help ]
condor_q [ -debug ] [ -global ] [ -submitter submitter ] [ -name name ] [ -pool centralmanagerhost-
name[:portnumber] ] [ -analyze ] [ -run ] [ -hold ] [ -globus ] [ -goodput ] [ -io ] [ -dag ] [ -long ]
[ -xml ] [ -attributes Attr1 [,Attr2 ... ] ] [ -format fmt attr ] [ -autoformat[:tn,lVh] attr1 [attr2
...] ] [ -cputime ] [ -currentrun ] [ -avgqueuetime ] [ -jobads file ] [ -machineads file ] [ -stream-
results ] [ -wide ] [ {cluster | cluster.process | owner | -constraint expression ... } ]
Description
condor_q displays information about jobs in the Condor job queue. By default, condor_q queries the local
job queue but this behavior may be modified by specifying:
* the -global option, which queries all job queues in the pool
* a schedd name with the -name option, which causes the queue of the named schedd to be queried
{{% /panel %}}
Next: [A simple example of submitting an HTCondorjob]({{< relref "a_simple_example_of_submitting_an_htcondor_job" >}})
+++
title = "Quickstarts"
weight = "10"
+++
The quick start guides require that you already have a HCC account. You
can get a HCC account by applying on the
[HCC website] (http://hcc.unl.edu/newusers/)
{{% children %}}
+++
title = "How to Connect"
description = "What is a cluster and what is HPC"
weight = "9"
+++
High-Performance Computing is the use of groups of computers to solve computations a user or group would not be able to solve in a reasonable time-frame on their own desktop or laptop. This is often achieved by splitting one large job amongst numerous cores or 'workers'. This is similar to how a skyscraper is built by numerous individuals rather than a single person. Many fields take advantage of HPC including bioinformatics, chemistry, materials engineering, and newer fields such as educational psychology and philosophy.
{{< figure src="/images/cluster.png" height="450" >}}
HPC clusters consist of four primary parts, the login node, management node, workers, and a central storage array. All of these parts are bound together with a scheduler such as HTCondor or SLURM.
</br></br>
#### Login Node:
Users will automatically land on the login node when they log in to the clusters. You will [submit jobs] ({{< ref "/guides/submitting_jobs" >}}) using one of the schedulers and pull the results of your jobs. Running jobs on the login node directly will be stopped so others can use the login node to submit jobs.
</br></br>
#### Management Node:
The management node does as it sounds, it manages the cluster and provides a central point to manage the rest of the systems.
</br></br>
#### Worker Nodes:
The worker nodes are what run and process your jobs that are submitted from the schedulers. Through the use of the schedulers, more work can be efficiently done by squeezing in all jobs possible for the resources requested throughout the nodes. They also allow for fair use computing by making sure one user or group is not using the entire cluster at once and allowing others to use the clusters.
</br></br>
#### Central Storage Array:
The central storage array allows all of the nodes within the cluster to have access to the same files without needing to transfer them around. HCC has three arrays mounted on the clusters with more details [here]({{< ref "/guides/handling_data" >}}).
This diff is collapsed.
This diff is collapsed.
+++
title = "Reusing SSH connections in Linux/Mac"
description = "Reusing connections makes it easier to use multiple terminals"
weight = "37"
+++
To make it more convenient for users who use multiple terminal sessions
simultaneously, SSH can reuse an existing connection if connecting from
Linux or Mac. After the initial login, subsequent terminals can use
that connection, eliminating the need to enter the username and password
each time for every connection. To enable this feature, add the
following lines to your `~/.ssh/config `file:
{{% panel header="`~/.ssh/config`"%}}
{{< highlight bash >}}
Host *
ControlMaster auto
ControlPath /tmp/%r@%h:%p
ControlPersist 2h
{{< /highlight >}}
{{% /panel %}}
{{% notice info%}}
You may not have an existing `~/.ssh/config.` If not, simply create the
file and set the permissions appropriately first:
`touch ~/.ssh/config && chmod 600 ~/.ssh/config`
{{% /notice %}}
This will enable connection reuse when connecting to any host via SSH or
SCP.