Previous Section Table of Contents Next Section

15.2 Problem Decomposition

When decomposing a program, we will talk in terms of tasks. The meaning of this word may vary slightly depending upon context. Typically, a task is a portion of a program that can be executed as a unit. It may be used to mean that part of a program that can become an independent process, or it may be used to mean a piece of the work that that process will execute. It should be clear from context which meaning is intended.

Let's begin by looking at some of the issues involved in decomposing a problem into parallelizable parts. The first issue we must face is task granularity. Depending on the problem, a task may be broken into very small pieces (fine granularity), into relatively large pieces (coarse granularity), or into a mixture of pieces of varying sizes.

Granularity, in one sense, establishes a limit on how many compute nodes or processors you may be able to use effectively. For example, if you are multiplying two 10 by 10 matrices, then you will need to do 100 multiplications. Since you won't be able to subdivide a multiplication, you won't be able to divide this problem into more than 100 pieces. Consequently, having more than 100 processors won't allow you to do the multiplications any faster. In practice, the number of processors you can effectively use will be lower. It is essential to realize that there are a number of trade-offs that must be balanced when dividing a problem. In particular, coarse granularity tends to limit communication overhead but may result in increased idle time and poor processor utilization. We will discuss each of these concerns in detail in this chapter.

We can also speak of the degree of concurrency, i.e., the number of tasks that can execute at the same time. Realize that this will vary during programming execution depending on the point you are at in the program. Thus, it is often more meaningful to talk about the maximum or the average degree of concurrency of a program. Generally, both the maximum and average concurrency are larger with fine-grained than coarse-grained problems.

A data (or task) dependency graph (or diagram) is one way of visually representing a program. This can be helpful when investigating and describing potential concurrency. The idea is to break the algorithm into pieces of code or tasks based on the data required by that task. A graph is then drawn for the algorithm that shows the set of tasks as nodes connected by arrows indicating the flow of data between connected pairs of tasks.

Figure 15-1 is a data dependency graph for the numerical integration program developed in Chapter 13. The amount of detail will vary in these graphs depending on your purpose. In this case, I've kept things very simple. If you desire, you can increase the detail to the point of having a single node for each instruction and arrows for each variable.[1]

[1] Some authors distinguish between data and task dependency graphs and between dependencies and interactions. Feel free to adjust your graphs as you see fit.

The idea is that graphs such as these help you think about and locate potential concurrencies in your code. If you have two blocks of code or tasks that don't depend on each other and have everything they need to execute, these are potentially parallelizable tasks. Data flow graphs can be used for both data and task partitioning. Data flow graphs should also provide you with some idea of the critical paths through code, i.e., those paths that will likely take the longest amount of time to complete. You won't be able to shorten the runtime of a program to less time than it takes to complete the critical path. In other words, if you want to shorten the runtime of a program, you must shorten the critical path.

Figure 15-1. Data flow for numerical integration
figs/hplc_1501.gif


There are some limitations to this approach. You'll need to give loops some thought when drawing these graphs, since the body of a loop is a potentially parallelizable piece of code. The essential step in parallelizing the numerical integration problem in Chapter 13 was packaging as individual tasks, with pieces of the loop used to calculate the area using the individual rectangles. You should also realize that the graph provides no information about the relative execution time for each task. Finally, and perhaps most important, the graph doesn't clearly indicate how idle time might show up. Depending on how we code the task Consolidate Results in Figure 15-1, most of the Calculate Chunk blocks may be idle waiting for an opportunity to report their results. (Moreover, depending on how they are coded, the individual Calculate Chunk tasks may not all be of the same length.)

15.2.1 Decomposition Strategies

There are several different decomposition strategies worth considering. Roughly speaking, decomposition strategies fall into two different categories-data decomposition, sometimes called data partitioning, and control decomposition or task partitioning. With data decomposition, the data is broken into individual chunks and distributed among processes that are essentially similar. With control decomposition, the problem is divided in such a way that each process is doing a different calculation. In practice, many algorithms show characteristics of both strategies.

15.2.1.1 Data decomposition

Data decomposition is generally much easier to program than control decomposition and is usually the best approach when trying to adapt serial algorithms for parallel use. Data decomposition also tends to scale very well, a crucial consideration when dealing with problems that may grow.

The numerical integration program from the last chapter used data decomposition. Each process had a different set of bounds, so the area that each calculated was different, but the procedure was the same.

One of the most common approaches to data decomposition is a divide-and-conquer strategy. This works particularly well with recursive algorithms. If a problem can be treated as a set of independent subproblems, it is an ideal candidate for data decomposition. Consider the problem of finding the largest value in a large collection of data. The data could be divided into different sets, the largest in each set could be found, and finally, this collection of largest values could be examined. Finding the largest value in each of the smaller sets could be handled by a different processor. Finding the final answer is an ideal use of MPI_Reduce. This is a pretty trivial example of how divide and conquer works.

For a more involved example, consider the merge sort algorithm.[2] The serial algorithm takes a set of data, divides it into smaller sets of data, sorts these smaller individual data sets, and then merges the sorted sets back together. To sort a smaller set of data, merge sort uses the same strategy recursively. Eventually, the smaller sets of data are reduced to sets of single items that are obviously sorted. Merging sorted data is straightforward since you only have to compare the first item in each group and select accordingly until you've worked your way through the smaller sets.

[2] The sorting algorithm described here is just one possible approach, not necessarily the best. Sorting in a parallel environment is particularly difficult and is an area of active, ongoing research.

In a parallel environment, you'll want to divide the data equally among the available processors, but you probably won't want to continue dividing up the data beyond that point because of the communications overhead. Once you have the data distributed, you'll need to sort it locally on each individual processor. You could use the serial version of merge sort or some other serial sorting algorithm.

Merging the data back together will be more problematic. Not only will you need code to merge two data sets, but you'll need to develop a communications strategy to do this efficiently. If you use a single process to collect and merge data, you will have a large amount of idle time. A more appropriate strategy is to have pairs of processes merge their data, i.e., one sends its data and dies while the other receives the sent data and merges that data with its own. Repeat this strategy with the remaining processes until only a single process remains. It will have all the data sorted.

For example, if you have eight processes, processes 0 and 1, processes 2 and 3, processes 4 and 5, and processes 6 and 7 could all merge their data at the same time. Next, processes 0 and 2 and processes 4 and 6 could merge their data simultaneously. Finally, processes 0 and 4 could merge their data. This strategy, shown in Figure 15-2, has three sets of parallel merges or stages. This is much more efficient than having process 0 merge its data repeatedly with each of the other seven processes sequentially, a seven-stage procedure.

Figure 15-2. Merging data
figs/hplc_1502.gif


With this strategy, for instance, 1,024 processes could merge their data in 10 stages. It would take 1,023 stages with a single receiving process, roughly 100 times as long.

15.2.1.2 Control decomposition

With control decomposition, each processor has different tasks. One common model for control decomposition is pipelining or stream parallelism. With pipelining, each task, except the first and last, plays the role of both producer and consumer. A task receives or consumes data, processes that data, and then sends the results on to the next consumer. For example, consider a set of processes designed to manipulate a video stream. The first process might crop a frame, the second might adjust brightness within the frame, the third might adjust color levels, etc. Each process does something different and will require radically different code.

Note that the second process must wait for the first process to finish before it can begin since the second process consumes and processes the data produced by the first. Similarly, the third process can't begin until the second sends its data, and so on. Getting enough data into the system so that all processes are active is referred to as priming the pipeline. Figure 15-3 shows how processes overlap.

Figure 15-3. Ideal process overlap
figs/hplc_1503.gif


You must have a lot more data than processes for this approach to be efficient. Otherwise, the idle time at both the beginning and at the end will render this approach pointless. Granularity is a key consideration here. If the granularity is coarse, priming the pipeline is particularly costly.

A second issue with pipelining is balance. Each of the processes must run for about the same amount of time. If one process takes much longer than the other processes, they will be idle much of the time and overall efficiency will be lost. (This is a likely problem with video processing, for example, as described.) Figure 15-4 shows the effect of having one process take longer. Note the idle time.

Figure 15-4. Process overlap with idle time
figs/hplc_1504.gif


However, even though task 2 takes twice as long as the other tasks in this four-task example, there is still a speedup using the pipeline.

A number of algorithms fall between these two extremes. That is, they appear to have elements of both strategies. For example, a common approach in artificial intelligence is to describe algorithms in terms of a search space. A fundamental part of a chess-playing program is to examine a number of different moves to see which appears to be the best. Since it evaluates each move in the same manner, it is reasonable to approach this as a data decomposition problem. Each process will be given a different board configuration, i.e., a different set of data. But once the data has been distributed, the different processes go their different ways. One process may terminate quickly having determined the board position is abysmal (a process known as pruning), while another may be following a hot move recursively through several levels.

    Previous Section Table of Contents Next Section