mpi_jobs_on_hcc.md

+++
title = "MPI Jobs on HCC"
description = "How to compile and run MPI programs on HCC machines"
weight = "52"
+++

This quick start demonstrates how to implement a parallel (MPI)
Fortran/C program on HCC supercomputers. The sample codes and submit
scripts can be downloaded from [mpi_dir.zip](/attachments/mpi_dir.zip).

#### Login to a HCC Cluster

Connect to a HCC cluster]({{< relref "../../connecting/" >}}) and make a subdirectory 
and make a subdirectory called `mpi_dir` under your `$WORK` directory.

{{< highlight bash >}}
$ cd $WORK
$ mkdir mpi_dir
{{< /highlight >}}

In the subdirectory `mpi_dir`, save all the relevant codes. Here we
include two demo programs, `demo_f_mpi.f90` and `demo_c_mpi.c`, that
compute the sum from 1 to 20 through parallel processes. A
straightforward parallelization scheme is used for demonstration
purpose. First, the master core (i.e. `myid=0`) distributes equal
computation workload to a certain number of cores (as specified by
`--ntasks `in the submit script). Then, each worker core computes a
partial summation as output. Finally, the master core collects the
outputs from all worker cores and perform an overall summation. For easy
comparison with the serial code ([Fortran/C on HCC]({{< relref "fortran_c_on_hcc">}})), the
added lines in the parallel code (MPI) are marked with "!=" or "//=".

{{%expand "demo_f_mpi.f90" %}}
{{< highlight fortran >}}
Program demo_f_mpi
!====== MPI =====
    use mpi     
!================
    implicit none
    integer, parameter :: N = 20
    real*8 w
    integer i
    common/sol/ x
    real*8 x
    real*8, dimension(N) :: y 
!============================== MPI =================================
    integer ind
    real*8, dimension(:), allocatable :: y_local                    
    integer numnodes,myid,rc,ierr,start_local,end_local,N_local     
    real*8 allsum                                                   
!====================================================================
    
!============================== MPI =================================
    call mpi_init( ierr )                                           
    call mpi_comm_rank ( mpi_comm_world, myid, ierr )               
    call mpi_comm_size ( mpi_comm_world, numnodes, ierr )           
                                                                                                                                        !
    N_local = N/numnodes                                            
    allocate ( y_local(N_local) )                                   
    start_local = N_local*myid + 1                                  
    end_local =  N_local*myid + N_local                             
!====================================================================
    do i = start_local, end_local
        w = i*1d0
        call proc(w)
        ind = i - N_local*myid
        y_local(ind) = x
!       y(i) = x
!       write(6,*) 'i, y(i)', i, y(i)
    enddo   
!       write(6,*) 'sum(y) =',sum(y)
!============================================== MPI =====================================================
    call mpi_reduce( sum(y_local), allsum, 1, mpi_real8, mpi_sum, 0, mpi_comm_world, ierr )             
    call mpi_gather ( y_local, N_local, mpi_real8, y, N_local, mpi_real8, 0, mpi_comm_world, ierr )     
                                                                                                        
    if (myid == 0) then                                                                                 
        write(6,*) '-----------------------------------------'                                          
        write(6,*) '*Final output from... myid=', myid                                                  
        write(6,*) 'numnodes =', numnodes                                                               
        write(6,*) 'mpi_sum =', allsum  
        write(6,*) 'y=...'
        do i = 1, N
            write(6,*) y(i)
        enddo                                                                                       
        write(6,*) 'sum(y)=', sum(y)                                                                
    endif                                                                                               
                                                                                                        
    deallocate( y_local )                                                                               
    call mpi_finalize(rc)                                                                               
!========================================================================================================
    
Stop
End Program
Subroutine proc(w)
    real*8, intent(in) :: w
    common/sol/ x
    real*8 x
    
    x = w
    
Return
End Subroutine
{{< /highlight >}}
{{% /expand %}}

{{%expand "demo_c_mpi.c" %}}
{{< highlight c >}}
//demo_c_mpi
#include <stdio.h>
//======= MPI ========
#include "mpi.h"    
#include <stdlib.h>   
//====================

double proc(double w){
        double x;       
        x = w;  
        return x;
}

int main(int argc, char* argv[]){
    int N=20;
    double w;
    int i;
    double x;
    double y[N];
    double sum;
//=============================== MPI ============================
    int ind;                                                    
    double *y_local;                                            
    int numnodes,myid,rc,ierr,start_local,end_local,N_local;    
    double allsum;                                              
//================================================================
//=============================== MPI ============================
    MPI_Init(&argc, &argv);
    MPI_Comm_rank( MPI_COMM_WORLD, &myid );
    MPI_Comm_size ( MPI_COMM_WORLD, &numnodes );
    N_local = N/numnodes;
    y_local=(double *) malloc(N_local*sizeof(double));
    start_local = N_local*myid + 1;
    end_local = N_local*myid + N_local;
//================================================================
    
    for (i = start_local; i <= end_local; i++){        
        w = i*1e0;
        x = proc(w);
        ind = i - N_local*myid;
        y_local[ind-1] = x;
//      y[i-1] = x;
//      printf("i,x= %d %lf\n", i, y[i-1]) ;
    }
    sum = 0e0;
    for (i = 1; i<= N_local; i++){
        sum = sum + y_local[i-1];   
    }
//  printf("sum(y)= %lf\n", sum);    
//====================================== MPI ===========================================
    MPI_Reduce( &sum, &allsum, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD );
    MPI_Gather( &y_local[0], N_local, MPI_DOUBLE, &y[0], N_local, MPI_DOUBLE, 0, MPI_COMM_WORLD );
    
    if (myid == 0){
    printf("-----------------------------------\n");
    printf("*Final output from... myid= %d\n", myid);
    printf("numnodes = %d\n", numnodes);
    printf("mpi_sum = %lf\n", allsum);
    printf("y=...\n");
    for (i = 1; i <= N; i++){
        printf("%lf\n", y[i-1]);
    }   
    sum = 0e0;
    for (i = 1; i<= N; i++){
        sum = sum + y[i-1]; 
    }
    
    printf("sum(y) = %lf\n", sum);
    
    }
    
    free( y_local );
    MPI_Finalize ();
//======================================================================================        

return 0;
}
{{< /highlight >}}
{{% /expand %}}

---

#### Compiling the Code

The compiling of a MPI code requires first loading a compiler "engine"
such as `gcc`, `intel`, or `pgi` and then loading a MPI wrapper
`openmpi`. Here we will use the GNU Complier Collection, `gcc`, for
demonstration.

{{< highlight bash >}}
$ module load compiler/gcc/6.1 openmpi/2.1
$ mpif90 demo_f_mpi.f90 -o demo_f_mpi.x  
$ mpicc demo_c_mpi.c -o demo_c_mpi.x
{{< /highlight >}}

The above commends load the `gcc` complier with the `openmpi` wrapper.
The compiling commands `mpif90` or `mpicc` are used to compile the codes
to`.x` files (executables). 

### Creating a Submit Script

Create a submit script to request 5 cores (with `--ntasks`). A parallel
execution command `mpirun ./` needs to enter to last line before the
main program name.

{{% panel header="`submit_f.mpi`"%}}
{{< highlight bash >}}
#!/bin/sh
#SBATCH --ntasks=5
#SBATCH --mem-per-cpu=1024
#SBATCH --time=00:01:00
#SBATCH --job-name=Fortran
#SBATCH --error=Fortran.%J.err
#SBATCH --output=Fortran.%J.out

mpirun ./demo_f_mpi.x 
{{< /highlight >}}
{{% /panel %}}

{{% panel header="`submit_c.mpi`"%}}
{{< highlight bash >}}
#!/bin/sh
#SBATCH --ntasks=5
#SBATCH --mem-per-cpu=1024
#SBATCH --time=00:01:00
#SBATCH --job-name=C
#SBATCH --error=C.%J.err
#SBATCH --output=C.%J.out

mpirun ./demo_c_mpi.x 
{{< /highlight >}}
{{% /panel %}}

#### Submit the Job

The job can be submitted through the command `sbatch`. The job status
can be monitored by entering `squeue` with the `-u` option.

{{< highlight bash >}}
$ sbatch submit_f.mpi
$ sbatch submit_c.mpi
$ squeue -u <username>
{{< /highlight >}}

Replace `<username>` with your HCC username.

Sample Output
-------------

The sum from 1 to 20 is computed and printed to the `.out` file (see
below). The outputs from the 5 cores are collected and processed by the
master core (i.e. `myid=0`).

{{%expand "Fortran.out" %}}
{{< highlight batchfile>}}
 -----------------------------------------
 *Final output from... myid=           0
 numnodes =           5
 mpi_sum =   210.00000000000000     
 y=...
   1.0000000000000000     
   2.0000000000000000     
   3.0000000000000000     
   4.0000000000000000     
   5.0000000000000000     
   6.0000000000000000     
   7.0000000000000000     
   8.0000000000000000     
   9.0000000000000000     
   10.000000000000000     
   11.000000000000000     
   12.000000000000000     
   13.000000000000000     
   14.000000000000000     
   15.000000000000000     
   16.000000000000000     
   17.000000000000000     
   18.000000000000000     
   19.000000000000000     
   20.000000000000000     
 sum(y)=   210.00000000000000     
{{< /highlight >}}
{{% /expand %}} 

{{%expand "C.out" %}}
{{< highlight batchfile>}}
-----------------------------------
*Final output from... myid= 0
numnodes = 5
mpi_sum = 210.000000
y=...
1.000000
2.000000
3.000000
4.000000
5.000000
6.000000
7.000000
8.000000
9.000000
10.000000
11.000000
12.000000
13.000000
14.000000
15.000000
16.000000
17.000000
18.000000
19.000000
20.000000
sum(y) = 210.000000
{{< /highlight >}}
{{% /expand %}}