17.5 Profilers

Thus far we have been looking at timing code manually. While this provides a lot of control, it is labor intensive. The alternative to manual timing is to use a profiler. A profiler attempts to capture the profile of a program in execution; that is, a set of timings for an application that maps where time is spent within the program. While with manual timing you'll want to focus in on part of a program, a profiler typically provides information for the entire application in one fell swoop. While a profiler may give more results than you actually need, you are less likely to overlook a hotspot in your code using a profiler, particularly when working with very complicated programs. Most profilers are easy to use and may give you some control over how much information is collected. And most profilers not only collect information, but provide a mechanism for analyzing the results. Graphical output is common, a big help with large, complicated programs.

There are a number of profilers available, particularly if you include commercial products. They differ in several ways, but it usually comes down to a question of how fine a granularity you want and how much detail you need. Choices include information on a line-by-line basis, information based on the basic blocks, or information based on function calls or modules. Profilers may provide timing information or simply count the number of times a statement is executed.

There are two basic categories for profiles-active and passive. (Some profilers, such as gprof, have features that span both categories.) A passive profiler gathers information without modifying the code. For example, a passive profiler might collect information by repeatedly sampling the program counter while the program is running. By installing an interrupt service routine that wakes up periodically and examines the program counter, the profiler can construct a statistical profile for the program.

While passive profiles are less intrusive, they have a couple of problems. First, they tend to provide a flat view of functions within an application. For example, if you have a function that is called by several different functions, a passive profiler will give you an idea of how often the function is called, but no information about what functions are making the call. Second, passive profilers are inherently statistical. Key performance issues are how often you sample and how many samples are taken. The quality of your results will depend on getting these parameters right.

Active profiles automatically add code to collect profile information. Code can be added either directly to the source code or to the object code generated by the compiler. While active compilers avoid some of the problems passive profilers face, they introduce their on set of problems. Perhaps the most significant of these is the confounding effect on timing measurement caused by the execution of the added instructions.

In this section, we'll look at two profilers, gprof and gcov. These profilers modify the application so that it will count how often functions are called and individual instructions are executed, respectively. (Additionally, gprof uses passive profiling to estimate execution times.) The two tools we'll examine use compiler options to add the profiling code to the object code generated by the compiler. While both of these tools were designed for serial programming and were not intended for use with parallel programs, they can, with a little added effort, be used with parallel programs. We'll examine their use with serial code first and then look at how they can be used with parallel code.

17.5.1 gprof

gprof generates profiles and call graphs (a report of which routines call which). There is both a Berkeley Unix gprof and a GNU gprof that provide essentially the same information. gprof is available on most systems today. (If your system doesn't have gprof, you might look to see if prof is installed. While it's not as versatile as gprof, you'll still be able to generate the flat profiles described in this section.)

Using gprof is remarkably straightforward. You'll need to compile and link the program you want to profile using the -gp option. (Compiler optimizations shouldn't cause much trouble with gprof.) Next, run the program. This will create a file gmon.out with the profile information. If the file already exists from a previous run, it will be overwritten. The gprof program is then run to convert the binary output in gmon.out into a readable format.

Here is an example:

[sloanjd@amy PROFILE]$ gcc demo.c -pg -o demo

[sloanjd@amy PROFILE]$ ./demo

[sloanjd@amy PROFILE]$ gprof -b demo > demo.gprof

This example uses the -b option to suppress the annotations that gprof includes by default. You may want to leave this off the first time you run gprof, but if you run gprof often, you'll want to use this option. gprof has other options that you might want to investigate if you find you are using it a lot, such as excluding functions from the reports. In this example, I've also redirected the output to a file to make it easier to examine.

Here is the code that I profiled.

main( ) 

{  int i;

   for (i=0; i<200; i++) 

   {  foo( );

      bar( );

      baz( ); }

}

   

foo( )

{  int j;

   for (j=0; j<250; j++) bar( );

}

   

bar( )

{  int k; 

   for (k=0; k<50; k++);

}

   

baz( )

{  int m;

   for (m=0; m<100; m++) bang( );

}

   

bang( )

{  int n;

   for (n=0; n<200; n++) bar( );

}

As you can see, it doesn't do anything worthwhile, but it provides a reasonable demonstration of gprof's capabilities.

The output from gprof consists of three parts: a flat profile, a call graph, and a function index. Here is the first part of the output, the flat profile:

Flat profile:

   

Each sample counts as 0.01 seconds.

  %   cumulative   self              self     total           

 time   seconds   seconds    calls  ms/call  ms/call  name    

 96.97      6.09     6.09  4050200     0.00     0.00  bar

  1.27      6.17     0.08    20000     0.00     0.30  bang

  1.27      6.25     0.08      200     0.40    30.87  baz

  0.48      6.28     0.03      200     0.15     0.53  foo

This is not difficult to interpret. The routines are profiled, one per line, with their names given in the last column name. The column % time gives the amount of time spent in each routine. self seconds gives the time spent in an individual routine while cumulative seconds provides the total for the named routine along with the ones above it. calls reports the number of times the function is called while self ms/call gives the average time spent in the routine per call. total ms/call gives the average time spent in the function and its descendents. All times are given in milliseconds.

It should be clear that optimizing or eliminating calls to bar would provide the greatest improvement. What's not clear is the interrelationship among the calls. Since bar is called by several functions, improvement to these will help as well. To see this you need more information. This is provided by a call graph.

Here is the call graph for this code:

                        Call graph

   

   

granularity: each sample hit covers 4 byte(s) for 0.16% of 6.28 seconds

   

index % time    self  children    called     name

                                                 <spontaneous>

[1]    100.0    0.00    6.28                 main [1]

                0.08    6.09     200/200         baz [2]

                0.03    0.08     200/200         foo [5]

                0.00    0.00     200/4050200     bar [4]

-----------------------------------------------

                0.08    6.09     200/200         main [1]

[2]     98.3    0.08    6.09     200         baz [2]

                0.08    6.01   20000/20000       bang [3]

-----------------------------------------------

                0.08    6.01   20000/20000       baz [2]

[3]     97.0    0.08    6.01   20000         bang [3]

                6.01    0.00 4000000/4050200     bar [4]

-----------------------------------------------

                0.00    0.00     200/4050200     main [1]

                0.08    0.00   50000/4050200     foo [5]

                6.01    0.00 4000000/4050200     bang [3]

[4]     97.0    6.09    0.00 4050200         bar [4]

-----------------------------------------------

                0.03    0.08     200/200         main [1]

[5]      1.7    0.03    0.08     200         foo [5]

                0.08    0.00   50000/4050200     bar [4]

-----------------------------------------------

The first thing you'll notice is that the numbers in the % time column don't add up to 100 percent. The reason is that each timing includes the total time spent in the function and its children. The next two columns give the actual time spent in the function (self) and in the functions it calls (children). The called column gives the number of times the function was called. When two numbers are given in this column, the top number is the number of calls by the parent to the child and the bottom number is the total number of calls to the child. For example, consider the next to last line in section [4].

                6.01    0.00 4000000/4050200     bang [3]

Since section [4] describes calls to bar, this line tells us that bang makes 4,000,000 of the 4,050,200 total calls made to bar.

The table is sorted numerically with a unique number of each function. This number is printed next to each function to make it easier to look up individual functions. A function index is also included at the end of the report.

Index by function name

   

   [3] bang                    [2] baz

   [4] bar                     [5] foo

This is relatively useless for a small program but can be helpful with larger programs.

As noted above, the number of function calls is determined actively, so this number should be totally accurate. The percentage of time in each is determined passively with a sampling rate that may be too slow for short programs to capture enough data to be statistically reliable. You will have observed that the basic granularity of gprof is a routine. Depending on the code, this may not provide enough information. For example, it may not be immediately obvious whether the poor performance you are seeing is the result of the computational complexity of your code or of poor memory utilization if you are looking at a large routine. To improve resolution and gain more information, you could break a program into more routines. Or you could use gcov.

17.5.2 gcov

gcov is a test-coverage program that is often used as a profiler. When a test suite is designed for a program, one of the objects is to exercise all parts of the code. A test-coverage program records the number of times each line of code is executed. The idea is that you can use this coverage data to ensure that your test suite is adequate. Of course, this is also the sort of data you might want when profiling code.

gcov is part of the gcc development packages, so if you have gcc on your system, you should have gcov. It does not work with other compilers, but similar programs may be available for them (e.g., tcov with Solaris).

To use gcov, you need to compile code with two options, -fprofile-arcs and -ftest-coverage, which tell the compiler to add, respectively, the additional code needed to generate a flow graph and extra profiling information.

[sloanjd@amy PROFILE]$ gcc -fprofile-arcs -ftest-coverage demo.c -o demo

You should avoid optimizations when using gcov, since optimizations that rewrite code will make the results difficult to interpret. When the code is compiled, two new files will be created with the same name as your program but with the extensions .bb and .bbg. The first contains a list of source files with line numbers corresponding to basic blocks. The second is used to reconstruct a flow graph for the program.

Once you have compiled the code, you'll need to run it.

[sloanjd@amy PROFILE]$ ./demo

This will create yet another file with the run data. It will have the same name as your program but with a .da extension. This is created in the directory where the original program was compiled.

Finally, you can run gcov to generate your report.

[sloanjd@amy PROFILE]$ gcov demo

100.00% of 13 source lines executed in file demo.c

Creating demo.c.gcov.

You'll notice that the command reports what percentage of the source is actually executed, something you would want to know if you are using it as a coverage tool rather than a profiler. The actual report is created in a file with the extension .gcov.

Here is an example for the demonstration program that we looked at earlier.

                main( ) 

           2    {  int i;

         201       for (i=0; i<200; i++) 

         200       {  foo( );

         200          bar( );

         200          baz( ); }

                }

   

                foo( )

         400    {  int j;

         200       for (j=0; j<250; j++) bar( );

                }

   

                bar( )

     8100400    {  int k; 

     4050200       for (k=0; k<50; k++);

                }

   

                baz( )

         400    {  int m;

         200       for (m=0; m<100; m++) bang( );

                }

   

                bang( )

       40000    {  int n;

       20000       for (n=0; n<200; n++) bar( );

                }

As you can see, a count giving the number of times each line was executed is appended to each line. Take care; if you execute the program multiple times, these counts accumulate. If this isn't what you want, delete or rename the data file between runs.

17.5.3 Profiling Parallel Programs with gprof and gcov

Depending on the version of MPI you are using, you may be able to use gprof and gcov with your MPI programs. However, you'll need to make some adjustments. If we naively try to profile an MPI program with gprof or gcov, we run into problems. Here is an example using rect from Chapter 14:

[sloanjd@amy PROFILE]$ mpicc -pg rect.c -o rect

[sloanjd@amy PROFILE]$ mpirun -np 4 rect

Enter number of chunk to divide problem into:

7

Enter number of steps per chunk:

20

Enter low end of interval:

2.0

Enter high end of interval:

5.0

Area for process 1, is: 2.107855

Area for process 2, is: 2.999984

Area for process 3, is: 4.049546

Area for process 1, is: 5.256543

Area for process 2, is: 6.620975

Area for process 3, is: 8.142841

Area for process 1, is: 9.822141

The area from 2.000000 to 5.000000 is: 38.999885

Everything appears to work. All the expected files are created. Here is the flat profile.

[sloanjd@amy PROFILE]$ gprof -bp rect

Flat profile:

   

Each sample counts as 0.01 seconds.

 no time accumulated

   

  %   cumulative   self              self     total           

 time   seconds   seconds    calls  Ts/call  Ts/call  name    

  0.00      0.00     0.00       40     0.00     0.00  f

  0.00      0.00     0.00        2     0.00     0.00  chunkArea

  0.00      0.00     0.00        1     0.00     0.00  _GLOBAL_ _I_mainGCOV

We've obviously profiled something. The question is what. Because we are using NFS, all the processes are writing out their logfiles to the same directory. Since files are overwritten, what we have is the data from the last process to finish. For the data, we can see this is either process 2 or process 3.

If we are going to avoid this problem, we need to use the local filesystem on each node. For example, we could copy the compiled code over to /tmp on each machine and run it from there.^[2]

^[2] This example uses scp to copy the file. If you have installed the C3 tools, discussed in Chapter 9, then you can use cpush instead.

[sloanjd@amy PROFILE]$ cp rect /tmp/

[sloanjd@amy PROFILE]$ scp rect oscarnode1:/tmp/

rect                100% |*****************************|   372 KB    00:00    

[sloanjd@amy PROFILE]$ scp rect oscarnode2:/tmp/

rect                100% |*****************************|   372 KB    00:00    

[sloanjd@amy PROFILE]$ scp rect oscarnode3:/tmp/

rect                100% |*****************************|   372 KB    00:00    

[sloanjd@amy PROFILE]$ cd /tmp

[sloanjd@amy tmp]$ mpirun -np 4 rect

Enter number of chunk to divide problem into:

7

Enter number of steps per chunk:

20

Enter low end of interval:

2.0

Enter high end of interval:

5.0

Area for process 1, is: 2.107855

Area for process 2, is: 2.999984

Area for process 3, is: 4.049546

Area for process 1, is: 5.256543

Area for process 2, is: 6.620975

Area for process 3, is: 8.142841

Area for process 1, is: 9.822141

The area from 2.000000 to 5.000000 is: 38.999885

Here is the flat profile for oscarnode1:

[sloanjd@oscarnode1 tmp]$ gprof -bp rect

Flat profile:

   

Each sample counts as 0.01 seconds.

 no time accumulated

   

  %   cumulative   self              self     total           

 time   seconds   seconds    calls  Ts/call  Ts/call  name    

  0.00      0.00     0.00       60     0.00     0.00  f

  0.00      0.00     0.00        3     0.00     0.00  chunkArea

  0.00      0.00     0.00        1     0.00     0.00  _GLOBAL_ _I_mainGCOV

And here is the flat profile for oscarnode2:

[sloanjd@oscarnode2 tmp]$ gprof -bp rect

Flat profile:

   

Each sample counts as 0.01 seconds.

 no time accumulated

   

  %   cumulative   self              self     total           

 time   seconds   seconds    calls  Ts/call  Ts/call  name    

  0.00      0.00     0.00       40     0.00     0.00  f

  0.00      0.00     0.00        2     0.00     0.00  chunkArea

  0.00      0.00     0.00        1     0.00     0.00  _GLOBAL_ _I_mainGCOV

If you compare these to the output, you'll see the calls to chunkArea now make sense-it is called three times by process 1 on oscarnode1 and twice by process 2 on oscarnode2. (We could also look at the head node, amy, but because of the master/slave design of the program, there isn't much to see.)

Unfortunately, using gcov with a parallel program is even more complicated. You'll recall that the compile options used with gcov create two additional files and that a datafile is created when the program is run. To run gcov, you'll need all these files as well as the executable and the original source code file. To further complicate matters, the datafile that is created when the program is run is created in the directory where the source was compiled.

To use gcov on a cluster, first copy the source code file to /tmp or a similar directory. Next, compile the program with the appropriate switches. Once you've done this, you'll need to copy the compiled code, the original source code, the .bb file, and the .bbg file to each node on the cluster. Now you can run the program. A datafile will be created on each cluster in /tmp. Once you have done this, you can then log onto each node and run gcov. For example,

[sloanjd@oscarnode1 tmp]$ gcov rect

 66.67% of 57 source lines executed in file rect.c

Creating rect.c.gcov.

Don't forget that the datafile accumulates information with gcov. If you want fresh data, you'll need to delete it from each node. This is a lot of copying, so you'll want to automate it as much as possible. You'll also need to clean up after you are done. In this example, I copied rect, rect.c, rect.bb, and rect.bbg to each node. Fortunately, for this demonstration I only needed to copy them to a few nodes.

A couple of warnings are in order. All of this is based on the assumption that each MPI process will be able to access and write to the local filesystem. With MPI, there is no guarantee this is the case. The approach outlined here seems to work with LAM/MPI and MPICH, but if you are using some other version of MPI, all bets are off.

Table of Contents