+++ title = "MPI Jobs on HCC" description = "How to compile and run MPI programs on HCC machines" weight = "52" +++ This quick start demonstrates how to implement a parallel (MPI) Fortran/C program on HCC supercomputers. The sample codes and submit scripts can be downloaded from [mpi_dir.zip](/attachments/mpi_dir.zip). #### Login to a HCC Cluster Connect to a HCC cluster]({{< relref "../../connecting/" >}}) and make a subdirectory and make a subdirectory called `mpi_dir` under your `$WORK` directory. {{< highlight bash >}} $ cd $WORK $ mkdir mpi_dir {{< /highlight >}} In the subdirectory `mpi_dir`, save all the relevant codes. Here we include two demo programs, `demo_f_mpi.f90` and `demo_c_mpi.c`, that compute the sum from 1 to 20 through parallel processes. A straightforward parallelization scheme is used for demonstration purpose. First, the master core (i.e. `myid=0`) distributes equal computation workload to a certain number of cores (as specified by `--ntasks `in the submit script). Then, each worker core computes a partial summation as output. Finally, the master core collects the outputs from all worker cores and perform an overall summation. For easy comparison with the serial code ([Fortran/C on HCC]({{< relref "fortran_c_on_hcc">}})), the added lines in the parallel code (MPI) are marked with "!=" or "//=". {{%expand "demo_f_mpi.f90" %}} {{< highlight fortran >}} Program demo_f_mpi !====== MPI ===== use mpi !================ implicit none integer, parameter :: N = 20 real*8 w integer i common/sol/ x real*8 x real*8, dimension(N) :: y !============================== MPI ================================= integer ind real*8, dimension(:), allocatable :: y_local integer numnodes,myid,rc,ierr,start_local,end_local,N_local real*8 allsum !==================================================================== !============================== MPI ================================= call mpi_init( ierr ) call mpi_comm_rank ( mpi_comm_world, myid, ierr ) call mpi_comm_size ( mpi_comm_world, numnodes, ierr ) ! N_local = N/numnodes allocate ( y_local(N_local) ) start_local = N_local*myid + 1 end_local = N_local*myid + N_local !==================================================================== do i = start_local, end_local w = i*1d0 call proc(w) ind = i - N_local*myid y_local(ind) = x ! y(i) = x ! write(6,*) 'i, y(i)', i, y(i) enddo ! write(6,*) 'sum(y) =',sum(y) !============================================== MPI ===================================================== call mpi_reduce( sum(y_local), allsum, 1, mpi_real8, mpi_sum, 0, mpi_comm_world, ierr ) call mpi_gather ( y_local, N_local, mpi_real8, y, N_local, mpi_real8, 0, mpi_comm_world, ierr ) if (myid == 0) then write(6,*) '-----------------------------------------' write(6,*) '*Final output from... myid=', myid write(6,*) 'numnodes =', numnodes write(6,*) 'mpi_sum =', allsum write(6,*) 'y=...' do i = 1, N write(6,*) y(i) enddo write(6,*) 'sum(y)=', sum(y) endif deallocate( y_local ) call mpi_finalize(rc) !======================================================================================================== Stop End Program Subroutine proc(w) real*8, intent(in) :: w common/sol/ x real*8 x x = w Return End Subroutine {{< /highlight >}} {{% /expand %}} {{%expand "demo_c_mpi.c" %}} {{< highlight c >}} //demo_c_mpi #include <stdio.h> //======= MPI ======== #include "mpi.h" #include <stdlib.h> //==================== double proc(double w){ double x; x = w; return x; } int main(int argc, char* argv[]){ int N=20; double w; int i; double x; double y[N]; double sum; //=============================== MPI ============================ int ind; double *y_local; int numnodes,myid,rc,ierr,start_local,end_local,N_local; double allsum; //================================================================ //=============================== MPI ============================ MPI_Init(&argc, &argv); MPI_Comm_rank( MPI_COMM_WORLD, &myid ); MPI_Comm_size ( MPI_COMM_WORLD, &numnodes ); N_local = N/numnodes; y_local=(double *) malloc(N_local*sizeof(double)); start_local = N_local*myid + 1; end_local = N_local*myid + N_local; //================================================================ for (i = start_local; i <= end_local; i++){ w = i*1e0; x = proc(w); ind = i - N_local*myid; y_local[ind-1] = x; // y[i-1] = x; // printf("i,x= %d %lf\n", i, y[i-1]) ; } sum = 0e0; for (i = 1; i<= N_local; i++){ sum = sum + y_local[i-1]; } // printf("sum(y)= %lf\n", sum); //====================================== MPI =========================================== MPI_Reduce( &sum, &allsum, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD ); MPI_Gather( &y_local[0], N_local, MPI_DOUBLE, &y[0], N_local, MPI_DOUBLE, 0, MPI_COMM_WORLD ); if (myid == 0){ printf("-----------------------------------\n"); printf("*Final output from... myid= %d\n", myid); printf("numnodes = %d\n", numnodes); printf("mpi_sum = %lf\n", allsum); printf("y=...\n"); for (i = 1; i <= N; i++){ printf("%lf\n", y[i-1]); } sum = 0e0; for (i = 1; i<= N; i++){ sum = sum + y[i-1]; } printf("sum(y) = %lf\n", sum); } free( y_local ); MPI_Finalize (); //====================================================================================== return 0; } {{< /highlight >}} {{% /expand %}} --- #### Compiling the Code The compiling of a MPI code requires first loading a compiler "engine" such as `gcc`, `intel`, or `pgi` and then loading a MPI wrapper `openmpi`. Here we will use the GNU Complier Collection, `gcc`, for demonstration. {{< highlight bash >}} $ module load compiler/gcc/6.1 openmpi/2.1 $ mpif90 demo_f_mpi.f90 -o demo_f_mpi.x $ mpicc demo_c_mpi.c -o demo_c_mpi.x {{< /highlight >}} The above commends load the `gcc` complier with the `openmpi` wrapper. The compiling commands `mpif90` or `mpicc` are used to compile the codes to`.x` files (executables). ### Creating a Submit Script Create a submit script to request 5 cores (with `--ntasks`). A parallel execution command `mpirun ./` needs to enter to last line before the main program name. {{% panel header="`submit_f.mpi`"%}} {{< highlight bash >}} #!/bin/sh #SBATCH --ntasks=5 #SBATCH --mem-per-cpu=1024 #SBATCH --time=00:01:00 #SBATCH --job-name=Fortran #SBATCH --error=Fortran.%J.err #SBATCH --output=Fortran.%J.out mpirun ./demo_f_mpi.x {{< /highlight >}} {{% /panel %}} {{% panel header="`submit_c.mpi`"%}} {{< highlight bash >}} #!/bin/sh #SBATCH --ntasks=5 #SBATCH --mem-per-cpu=1024 #SBATCH --time=00:01:00 #SBATCH --job-name=C #SBATCH --error=C.%J.err #SBATCH --output=C.%J.out mpirun ./demo_c_mpi.x {{< /highlight >}} {{% /panel %}} #### Submit the Job The job can be submitted through the command `sbatch`. The job status can be monitored by entering `squeue` with the `-u` option. {{< highlight bash >}} $ sbatch submit_f.mpi $ sbatch submit_c.mpi $ squeue -u <username> {{< /highlight >}} Replace `<username>` with your HCC username. Sample Output ------------- The sum from 1 to 20 is computed and printed to the `.out` file (see below). The outputs from the 5 cores are collected and processed by the master core (i.e. `myid=0`). {{%expand "Fortran.out" %}} {{< highlight batchfile>}} ----------------------------------------- *Final output from... myid= 0 numnodes = 5 mpi_sum = 210.00000000000000 y=... 1.0000000000000000 2.0000000000000000 3.0000000000000000 4.0000000000000000 5.0000000000000000 6.0000000000000000 7.0000000000000000 8.0000000000000000 9.0000000000000000 10.000000000000000 11.000000000000000 12.000000000000000 13.000000000000000 14.000000000000000 15.000000000000000 16.000000000000000 17.000000000000000 18.000000000000000 19.000000000000000 20.000000000000000 sum(y)= 210.00000000000000 {{< /highlight >}} {{% /expand %}} {{%expand "C.out" %}} {{< highlight batchfile>}} ----------------------------------- *Final output from... myid= 0 numnodes = 5 mpi_sum = 210.000000 y=... 1.000000 2.000000 3.000000 4.000000 5.000000 6.000000 7.000000 8.000000 9.000000 10.000000 11.000000 12.000000 13.000000 14.000000 15.000000 16.000000 17.000000 18.000000 19.000000 20.000000 sum(y) = 210.000000 {{< /highlight >}} {{% /expand %}}