16.4 Rereading Code
There are three basic escalating
strategies for locating errors-rereading code, printing
information at key points, and using a symbolic debugger. There is an
interesting correspondence between these debugging strategies and
search strategies, i.e., linear search, binary search, and indexed
search. When reading code we are searching linearly for the error.
Printing works best when we take a binary approach. Through the
breakpoints a symbolic debugger provides, we are often able to move
directly to a questionable line of code.
Rereading (or reading for the first time in some cases) means looking
at the code really hard with the hope the error will jump out at you.
This is the best approach for new code since you are likely to find a
number of errors as well as other opportunities to improve the code.
It also works well when you have a pretty good idea of where the
problem is. If it is a familiar error, if you have just changed a
small segment of code, or if the error could only have come from one
small segment of code, rereading is a viable approach.
Rereading relies on your repeatedly asking the question,
"If I were a computer, what would I
do?" You can still play this game with a cluster,
you just have to pretend to be several computers at once and keep
everything straight. With a cluster, the order of operations is
crucial. If you take this approach, you'll need to
take extra care to ensure that you don't jump beyond
a point in one process that relies on another process without
ensuring the other process will do its part. An example may help
explain what I mean.
As previously noted, one problem you may encounter with a parallel
program is deadlock. For example, if two processes are waiting to
receive from each other before sending to each other, both will be
stalled. It is very easy when manually tracing a process to skim
right over the receive call, assuming the other process has sent the
necessary information. Making that type of assumption is what you
must guard against when pretending to be a cluster of computers. Here
is an example:
#include "mpi.h"
#include <stdio.h>
int main( int argc, char * argv[ ] )
{
int datum1 = 19, datum2 = 23, datum3 = 27;
int datum4, datum5, datum6;
int noProcesses, processId;
MPI_Status status;
/* MPI setup */
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &noProcesses);
MPI_Comm_rank(MPI_COMM_WORLD, &processId);
if (processId = = 0) /* for rank 0 */
{ MPI_Recv(&datum4, 1, MPI_INT, 2, 3, MPI_COMM_WORLD, &status);
MPI_Send(&datum1, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);
fprintf (stderr, "Received: %d\n", datum4);
}
else if (processId = = 1) /* for rank 1 */
{ MPI_Recv(&datum5, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, &status);
MPI_Send(&datum2, 1, MPI_INT, 2, 1, MPI_COMM_WORLD);
fprintf (stderr, "Received: %d\n", datum5);
}
else /* for rank 2 */
{ MPI_Recv(&datum6, 1, MPI_INT, 1, 1, MPI_COMM_WORLD, &status);
MPI_Send(&datum3, 1, MPI_INT, 0, 3, MPI_COMM_WORLD);
fprintf (stderr, "Received: %d\n", datum6);
}
MPI_Finalize( );
return 0;
}
This code doesn't do anything worthwhile other than
illustrate deadlock. It is designed to be run with three processes.
You'll notice that each process waits for another
process to send it information before it sends its own information.
Thus process 0 is waiting for process 1 which is waiting for process
2 which is waiting for process 0. If you run this program, nothing
happens-it hangs.
While this example is fairly straightforward and something that you
probably could diagnose simply by reading the source, other examples
of deadlock can be quite subtle and extraordinarily difficult to
diagnose simply by looking at the source code.
Deadlock is one of the most common problems you'll
face with parallel code. Another common problem is mismatching
parameters in function calls, particularly MPI functions. This is
something that you can check carefully while rereading your
code.
|