Previous Section Table of Contents Next Section

14.2 More on Collective Communication

Unlike point-to-point communication, collective communication involves every process in a communication group. In Chapter 13, you saw two examples of collective communication functions, MPI_Bcast and MPI_Reduce (along with MPI_Allreduce). There are two advantages to collective communication functions. First, they allow you to express a complex operation using simpler semantics. Second, the implementation may be able to optimize the operations in ways not available with simple point-to-point operations.

Collective operations fall into three categories: a barrier synchronization function, global communication or data movement functions (e.g., MPI_Bcast), and global reduction or collective computation functions (e.g., MPI_Reduce). There is only one barrier synchronization function, MPI_Barrier. It serves to synchronize the processes. No data is exchanged. This function is described in Chapter 17. All of the other collective functions are nonsynchronous. That is, a collective function can return as soon as its role in the communication process is complete. Unlike point-to-point operations, nonsynchronous mode is the only mode supported by collective functions.

While collective functions don't have to wait for the corresponding functions to execute on other nodes, they may block while waiting for space in system buffers. Thus, collective functions come only in blocking versions.

The requirements that all collective functions be blocking, nonsynchronous, and support only one communication mode simplify the semantics of collective operations. There are other features in the same vein: no tag argument is used, the amount of data sent must exactly match the amount of data received, and every process must call the function with the same arguments.

14.2.1 Gather and Scatter

After MPI_Bcast and MPI_Reduce, the two most useful collective operations are MPI_Gather and MPI_Scatter.

14.2.1.1 MPI_Gather

MPI_Gather, as the name implies, gathers information from all the processes in a communication group. It takes eight arguments. The first three arguments define the information that is sent. These are the starting address of the send buffer, the number of items in the buffer, and the data type for the items. The next three arguments, which define the received information, are address of the receive buffer, the number of items for a single receive, and the data type. The seventh argument is the rank of the receiving or root process. And the last argument is the communicator. Keep in mind that each process, including the root process, sends the contents of its send buffer to the receiver. The root process receives the messages, which are stored in rank order in the receive buffer. While this may seem similar to MPI_Reduce, notice that the data is simply received. It is not combined or reduced in any way.

Figure 14-1 shows the flow of data with a gather operation. The root or receiver can be any of the processes. Each process sends a different piece of the data, shown in the figure as the different x's.

Figure 14-1. Gathering data
figs/hplc_1401.gif


Here is an example using MPI_Gather.

#include "mpi.h"

#include <stdio.h>

   

int main( int argc, char * argv[  ] )

{

   int processId;

   int a[4] = {0, 0, 0, 0};

   int b[4] = {0, 0, 0, 0};

   

   MPI_Init(&argc, &argv);

   MPI_Comm_rank(MPI_COMM_WORLD, &processId);

   

   if (processId = = 0) a[0] = 1; 

   if (processId = = 1) a[1] = 2; 

   if (processId = = 2) a[2] = 3; 

   if (processId = = 3) a[3] = 5; 

   

   if (processId = = 0)

      fprintf(stderr, "Before: b[  ] = [%d, %d, %d, %d]\n", b[0], b[1], b[2], 

b[3]);

   

   MPI_Gather(&a[processId], 1, MPI_INT, b, 1, MPI_INT, 0, MPI_COMM_WORLD);

   

   if (processId = = 0)

      fprintf(stderr, "After:  b[  ] = [%d, %d, %d, %d]\n", b[0], b[1], b[2], 

b[3]);

 

   MPI_Finalize( );

   return 0;

}

While this is a somewhat contrived example, you can clearly see how MPI_Gather works. Pay particular attention to the arguments in this example. Note that both the address of the item sent and the address of the receive buffer (in this case just an array name) are used. Here is the output:

[sloanjd@amy C12]$ mpirun -np 4 gath

Before: b[  ] = [0, 0, 0, 0]

After:  b[  ] = [1, 2, 3, 5]

MPI_Gather has a couple of useful variants. MPI_Gatherv has an additional argument, an integer array giving displacements. MPI_Gatherv is used when the amount of data varies from process to process. MPI_Allgather functions just like MPI_Gather except that all processes receive copies of the data. There is also an MPI_Allgatherv.

14.2.1.2 MPI_Scatter

MPI_Scatter is the dual or inverse of MPI_Gather. If you compare Figure 14-2 to Figure 14-1, the only difference is the direction of the arrows. The arguments to MPI_Scatter are the same as MPI_Gather. You can think of MPI_Scatter as splitting the send buffer and sending a piece to each receiver, i.e., each receiver receives a unique piece of data. MPI_Scatter also has a vector variant MPI_Scatterv.

Figure 14-2. Scattering data
figs/hplc_1402.gif


Here is another contrived example, this time with MPI_Scatter:

#include "mpi.h"

#include <stdio.h>

   

int main( int argc, char * argv[  ] )

{

   int processId, b;

   int a[4] = {0, 0, 0, 0};

   

   MPI_Init(&argc, &argv);

   MPI_Comm_rank(MPI_COMM_WORLD, &processId);

   

   if (processId = = 0) { a[0] = 1; a[1] = 2; a[2] = 3; a[3] = 5; }

   

   MPI_Scatter(a, 1, MPI_INT, &b, 1, MPI_INT, 0, MPI_COMM_WORLD);

   

   fprintf(stderr, "Process %d: b = %d \n", processId, b);

   

   MPI_Finalize( );

   return 0;

}

Notice that we are sending an array but receiving its individual elements in this example. In summary, MPI_Gather sends an element and receives an array while MPI_Scatter sends an array and receives an element.

As with point-to-point communication, there are additional collective operations (for example, MPI_Alltoall, MPI_Reduce_scatter, and MPI_Scan). It is even possible to define your own reduction operator (using MPI_Op_create) for use with MPI_Reduce, etc. Because the amount and nature of the data that you need to share varies with the nature of the problem, it is worth becoming familiar with MPI's collective functions. You may need them sooner than you think.

    Previous Section Table of Contents Next Section