Previous Section Table of Contents Next Section

16.4 Rereading Code

There are three basic escalating strategies for locating errors-rereading code, printing information at key points, and using a symbolic debugger. There is an interesting correspondence between these debugging strategies and search strategies, i.e., linear search, binary search, and indexed search. When reading code we are searching linearly for the error. Printing works best when we take a binary approach. Through the breakpoints a symbolic debugger provides, we are often able to move directly to a questionable line of code.

Rereading (or reading for the first time in some cases) means looking at the code really hard with the hope the error will jump out at you. This is the best approach for new code since you are likely to find a number of errors as well as other opportunities to improve the code. It also works well when you have a pretty good idea of where the problem is. If it is a familiar error, if you have just changed a small segment of code, or if the error could only have come from one small segment of code, rereading is a viable approach.

Rereading relies on your repeatedly asking the question, "If I were a computer, what would I do?" You can still play this game with a cluster, you just have to pretend to be several computers at once and keep everything straight. With a cluster, the order of operations is crucial. If you take this approach, you'll need to take extra care to ensure that you don't jump beyond a point in one process that relies on another process without ensuring the other process will do its part. An example may help explain what I mean.

As previously noted, one problem you may encounter with a parallel program is deadlock. For example, if two processes are waiting to receive from each other before sending to each other, both will be stalled. It is very easy when manually tracing a process to skim right over the receive call, assuming the other process has sent the necessary information. Making that type of assumption is what you must guard against when pretending to be a cluster of computers. Here is an example:

#include "mpi.h"

#include <stdio.h>

   

int main( int argc, char * argv[  ] )

{

   int datum1 = 19, datum2 = 23, datum3 = 27;

   int datum4, datum5, datum6;

   int noProcesses, processId;

   MPI_Status status;

 

   /* MPI setup */

   MPI_Init(&argc, &argv);

   MPI_Comm_size(MPI_COMM_WORLD, &noProcesses);

   MPI_Comm_rank(MPI_COMM_WORLD, &processId);

    

   if (processId = = 0)          /* for rank 0 */

   {  MPI_Recv(&datum4, 1, MPI_INT, 2, 3, MPI_COMM_WORLD, &status);

      MPI_Send(&datum1, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);

      fprintf (stderr, "Received: %d\n", datum4);

   } 

   else if (processId = = 1)     /* for rank 1 */

   {  MPI_Recv(&datum5, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, &status);

      MPI_Send(&datum2, 1, MPI_INT, 2, 1, MPI_COMM_WORLD);

      fprintf (stderr, "Received: %d\n", datum5);   

   }

   else                         /* for rank 2 */

   {  MPI_Recv(&datum6, 1, MPI_INT, 1, 1, MPI_COMM_WORLD, &status);

      MPI_Send(&datum3, 1, MPI_INT, 0, 3, MPI_COMM_WORLD);

      fprintf (stderr, "Received: %d\n", datum6);   

    }

   

   MPI_Finalize( );

   return 0;

}

This code doesn't do anything worthwhile other than illustrate deadlock. It is designed to be run with three processes. You'll notice that each process waits for another process to send it information before it sends its own information. Thus process 0 is waiting for process 1 which is waiting for process 2 which is waiting for process 0. If you run this program, nothing happens-it hangs.

While this example is fairly straightforward and something that you probably could diagnose simply by reading the source, other examples of deadlock can be quite subtle and extraordinarily difficult to diagnose simply by looking at the source code.

Deadlock is one of the most common problems you'll face with parallel code. Another common problem is mismatching parameters in function calls, particularly MPI functions. This is something that you can check carefully while rereading your code.

    Previous Section Table of Contents Next Section