condor_jobs_on_hcc.md 6.43 KB
Newer Older
Adam Caprez's avatar
Adam Caprez committed
1
2
3
4
5
+++
title = "Condor Jobs on HCC"
description = "How to run jobs using Condor on HCC machines"
weight = "54"
+++
6

Adam Caprez's avatar
Adam Caprez committed
7
This quick start demonstrates how to run multiple copies of Fortran/C program
8
using Condor on HCC supercomputers. The sample codes and submit scripts
Adam Caprez's avatar
Adam Caprez committed
9
can be downloaded from [condor_dir.zip](/attachments/3178558.zip).
10

Adam Caprez's avatar
Adam Caprez committed
11
#### Login to a HCC Cluster
12

Adam Caprez's avatar
Adam Caprez committed
13
Log in to a HCC cluster through PuTTY ([For Windows Users]({{< relref "/Quickstarts/connecting/for_windows_users">}})) or Terminal ([For Mac/Linux Users]({{< relref "/Quickstarts/connecting/for_maclinux_users">}})) and make a subdirectory called `condor_dir` under the `$WORK` directory.  In the subdirectory `condor_dir`, create job  subdirectories that host the input data files. Here we create two job subdirectories, `job_0` and `job_1`, and put a data file (`data.dat`) in each subdirectory. The data file in `job_0` has a column of data listing the integers from 1 to 5. The data file in `job_1` has a integer list from 6 to 10. 
14

Adam Caprez's avatar
Adam Caprez committed
15
{{< highlight bash >}}
16
17
18
19
20
$ cd $WORK
$ mkdir condor_dir
$ cd condor_dir
$ mkdir job_0
$ mkdir job_1
Adam Caprez's avatar
Adam Caprez committed
21
{{< /highlight >}}
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38

In the subdirectory condor`_dir`, save all the relevant codes. Here we
include two demo programs, `demo_f_condor.f90` and `demo_c_condor.c`,
that compute the sum of the data stored in each job subdirectory
(`job_0` and `job_1`). The parallelization scheme here is as the
following. First, the master computer node send out many copies of the
executable from the `condor_dir` subdirectory and a copy of the data
file in each job subdirectories. The number of executable copies is
specified in the submit script (`queue`), and it usually matches with
the number of job subdirectories. Next, the workload is distributed
among a pool of worker computer nodes. At any given time, the number of
available worker nodes may vary. Each worker node executes the jobs
independent of other worker nodes. The output files are separately
stored in the job subdirectory. No additional coding are needed to make
the serial code turned "parallel". Parallelization here is achieved
through the submit script. 

Adam Caprez's avatar
Adam Caprez committed
39
40
{{%expand "demo_condor.f90" %}}
{{< highlight fortran >}}
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
Program demo_f_condor
    implicit none
    integer, parameter :: N = 5
    real*8 w
    integer i
    common/sol/ x
    real*8 x
    real*8, dimension(N) :: y_local
    real*8, dimension(N) :: input_data
    
    open(10, file='data.dat')
    
    do i = 1,N
        read(10,*) input_data(i)
    enddo
    
    do i = 1,N
        w = input_data(i)*1d0
        call proc(w)
        y_local(i) = x      
        write(6,*) 'i,x = ', i, y_local(i)
    enddo
    write(6,*) 'sum(y) =',sum(y_local)
Stop
End Program
Subroutine proc(w)
    real*8, intent(in) :: w
    common/sol/ x
    real*8 x
    
    x = w
    
Return
End Subroutine
Adam Caprez's avatar
Adam Caprez committed
75
76
{{< /highlight >}}
{{% /expand %}}
77
78


Adam Caprez's avatar
Adam Caprez committed
79
80
{{%expand "demo_c_condor.c" %}}
{{< highlight c >}}
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
//demo_c_condor
#include <stdio.h>

double proc(double w){
        double x;       
        x = w;  
        return x;
}

int main(int argc, char* argv[]){
    int N=5;
    double w;
    int i;
    double x;
    double y_local[N];
    double sum; 
    double input_data[N];
    FILE *fp;
    fp = fopen("data.dat","r");
    for (i = 1; i<= N; i++){
    fscanf(fp, "%lf", &input_data[i-1]);
    }
    
    for (i = 1; i <= N; i++){        
        w = input_data[i-1]*1e0;
        x = proc(w);
        y_local[i-1] = x;
        printf("i,x= %d %lf\n", i, y_local[i-1]) ;
    }
    
    sum = 0e0;
    for (i = 1; i<= N; i++){
        sum = sum + y_local[i-1];   
    }
    
    printf("sum(y)= %lf\n", sum);    
return 0;
}
Adam Caprez's avatar
Adam Caprez committed
119
120
121
122
{{< /highlight >}}
{{% /expand %}}

---
123

Adam Caprez's avatar
Adam Caprez committed
124
#### Compiling the Code
125
126
127
128
129
130
131

The compiled executable needs to match the "standard" environment of the
worker node. The easies way is to directly use the compilers installed
on the HCC supercomputer without loading extra modules. The standard
compiler of the HCC supercomputer is GNU Compier Collection. The version
can be looked up by the command lines `gcc -v` or `gfortran -v`.

Adam Caprez's avatar
Adam Caprez committed
132
133

{{< highlight bash >}}
134
135
$ gfortran demo_f_condor.f90 -o demo_f_condor.x
$ gcc demo_c_condor.c -o demo_c_condor.x
Adam Caprez's avatar
Adam Caprez committed
136
{{< /highlight >}}
137

Adam Caprez's avatar
Adam Caprez committed
138
#### Creating a Submit Script
139
140
141
142

Create a submit script to request 2 jobs (queue). The name of the job
subdirectories is specified in the line `initialdir`. The
`$(process)` macro assigns integer numbers to the job subdirectory
Adam Caprez's avatar
Adam Caprez committed
143
name `job_`. The numbers run form `0` to `queue-1`. The name of the input
144
145
data file is specified in the line `transfer_input_files`.

Adam Caprez's avatar
Adam Caprez committed
146
147
{{% panel header="`submit_f.condor`"%}}
{{< highlight bash >}}
148
149
150
151
152
153
154
155
156
157
158
universe = grid
grid_resource = pbs
batch_queue = guest
should_transfer_files = yes
when_to_transfer_output = on_exit
executable = demo_f_condor.x
output = Fortran_$(process).out
error = Fortran_$(process).err
initialdir = job_$(process)
transfer_input_files = data.dat
queue 2
Adam Caprez's avatar
Adam Caprez committed
159
160
{{< /highlight >}}
{{% /panel %}}
161

Adam Caprez's avatar
Adam Caprez committed
162
163
{{% panel header="`submit_c.condor`"%}}
{{< highlight bash >}}
164
165
166
167
168
169
170
171
172
173
174
universe = grid
grid_resource = pbs
batch_queue = guest
should_transfer_files = yes
when_to_transfer_output = on_exit
executable = demo_c_condor.x
output = C_$(process).out
error = C_$(process).err
initialdir = job_$(process)
transfer_input_files = data.dat
queue 2
Adam Caprez's avatar
Adam Caprez committed
175
176
{{< /highlight >}}
{{% /panel %}}
177

Adam Caprez's avatar
Adam Caprez committed
178
#### Submit the Job
179
180
181
182
183

The job can be submitted through the command `condor_submit`. The job
status can be monitored by entering `condor_q` followed by the
username. 

Adam Caprez's avatar
Adam Caprez committed
184
{{< highlight bash >}}
185
186
187
$ condor_submit submit_f.condor
$ condor_submit submit_c.condor
$ condor_q <username>
Adam Caprez's avatar
Adam Caprez committed
188
189
190
{{< /highlight >}}

Replace `<username>` with your HCC username.
191
192
193
194
195
196
197
198

Sample Output
-------------

In the job subdirectory `job_0`, the sum from 1 to 5 is computed and
printed to the `.out` file. In the job subdirectory `job_1`, the sum
from 6 to 10 is computed and printed to the `.out` file. 

Adam Caprez's avatar
Adam Caprez committed
199
200
{{%expand "Fortran_0.out" %}}
{{< highlight batchfile>}}
201
202
203
204
205
206
 i,x =            1   1.0000000000000000     
 i,x =            2   2.0000000000000000     
 i,x =            3   3.0000000000000000     
 i,x =            4   4.0000000000000000     
 i,x =            5   5.0000000000000000     
 sum(y) =   15.000000000000000     
Adam Caprez's avatar
Adam Caprez committed
207
208
{{< /highlight >}}
{{% /expand %}}
209

Adam Caprez's avatar
Adam Caprez committed
210
211
{{%expand "Fortran_1.out" %}}
{{< highlight batchfile>}}
212
213
214
215
216
217
 i,x =            1   6.0000000000000000     
 i,x =            2   7.0000000000000000     
 i,x =            3   8.0000000000000000     
 i,x =            4   9.0000000000000000     
 i,x =            5   10.000000000000000     
 sum(y) =   40.000000000000000     
Adam Caprez's avatar
Adam Caprez committed
218
219
{{< /highlight >}}
{{% /expand %}}