# The Algorithm Structure Design Space in Parallel Programming

• Print
This chapter is from the book

## 4.7 THE RECURSIVE DATA PATTERN

#### Problem

Suppose the problem involves an operation on a recursive data structure (such as a list, tree, or graph) that appears to require sequential processing. How can operations on these data structures be performed in parallel?

#### Context

Some problems with recursive data structures naturally use the divide-and-conquer strategy described in the Divide and Conquer pattern with its inherent potential for concurrency. Other operations on these data structures, however, seem to have little if any potential for concurrency because it appears that the only way to solve the problem is to sequentially move through the data structure, computing a result at one element before moving on to the next. Sometimes, however, it is possible to reshape the operations in a way that a program can operate concurrently on all elements of the data structure.

An example from [J92] illustrates the situation: Suppose we have a forest of rooted directed trees (defined by specifying, for each node, its immediate ancestor, with a root node's ancestor being itself) and want to compute, for each node in the forest, the root of the tree containing that node. To do this in a sequential program, we would probably trace depth-first through each tree from its root to its leaf nodes; as we visit each node, we have the needed information about the corresponding root. Total running time of such a program for a forest of N nodes would be O(N). There is some potential for concurrency (operating on subtrees concurrently), but there is no obvious way to operate on all elements concurrently, because it appears that we cannot find the root for a particular node without knowing its parent's root.

However, a rethinking of the problem exposes additional concurrency: We first define for each node a "successor", which initially will be its parent and ultimately will be the root of the tree to which the node belongs. We then calculate for each node its "successor's successor". For nodes one "hop" from the root, this calculation does not change the value of its successor (because a root's parent is itself). For nodes at least two "hops" away from a root, this calculation makes the node's successor its parent's parent. We repeat this calculation until it converges (that is, the values produced by one step are the same as those produced by the preceding step), at which point every node's successor is the desired value. Fig. 4.21 shows an example requiring three steps to converge. At each step we can operate on all N nodes in the tree concurrently, and the algorithm converges in at most log N steps.

Figure 4.21 Finding roots in a forest. Solid lines represent the original parent-child relationships among nodes; dashed lines point from nodes to their successors.

What we have done is transform the original sequential calculation (find roots for nodes one "hop" from a root, then find roots for nodes two "hops" from a root, etc.) into a calculation that computes a partial result (successor) for each node and then repeatedly combines these partial results, first with neighboring results, then with results from nodes two hops away, then with results from nodes four hops away, and so on. This strategy can be applied to other problems that at first appear unavoidably sequential; the Examples section presents other examples. This technique is sometimes referred to as pointer jumping or recursive doubling.

An interesting aspect of this restructuring is that the new algorithm involves substantially more total work than the original sequential one (O(N log N) versus O(N)), but the restructured algorithm contains potential concurrency that if fully exploited reduces total running time to O(log N) (versus O(N)). Most strategies and algorithms based on this pattern similarly trade off an increase in total work for a potential decrease in execution time. Notice also that the exploitable concurrency can be extremely fine-grained (as in the previous example), which may limit the situations in which this pattern yields an efficient algorithm. Nevertheless, the pattern can still serve as an inspiration for lateral thinking about how to parallelize problems that at first glance appear to be inherently sequential.

#### Forces

• Recasting the problem to transform an inherently sequential traversal of the recursive data structure into one that allows all elements to be operated upon concurrently does so at the cost of increasing the total work of the computation. This must be balanced against the improved performance available from running in parallel.

• This recasting may be difficult to achieve (because it requires looking at the original problem from an unusual perspective) and may lead to a design that is difficult to understand and maintain.

• Whether the concurrency exposed by this pattern can be effectively exploited to improve performance depends on how computationally expensive the operation is and on the cost of communication relative to computation on the target parallel computer system.

#### Solution

The most challenging part of applying this pattern is restructuring the operations over a recursive data structure into a form that exposes additional concurrency. General guidelines are difficult to construct, but the key ideas should be clear from the examples provided with this pattern.

After the concurrency has been exposed, it is not always the case that this concurrency can be effectively exploited to speed up the solution of a problem. This depends on a number of factors including how much work is involved as each element of the recursive data structure is updated and on the characteristics of the target parallel computer.

#### Data decomposition

In this pattern, the recursive data structure is completely decomposed into individual elements and each element is assigned to a separate UE. Ideally each UE would be assigned to a different PE, but it is also possible to assign multiple UEs to each PE. If the number of UEs per PE is too large, however, the overall performance will be poor because there will not be enough concurrency to overcome the increase in the total amount of work.

For example, consider the root-finding problem described earlier. We'll ignore overhead in our computations. If N = 1024 and t is the time to perform one step for one data element, then the running time of a sequential algorithm will be about 1024t. If each UE is assigned its own PE, then the running time of the parallel algorithm will be around (log N)t or 10t. If only two PEs are available for the parallel algorithm, however, then all N log N or 10240 computation steps must be performed on the two PEs, and the execution time will be at least 5120t, considerably more than the sequential algorithm.

#### Structure

Typically the result of applying this pattern is an algorithm whose top-level structure is a sequential composition in the form of a loop, in which each iteration can be described as "perform this operation simultaneously on all (or selected) elements of the recursive data structure". Typical operations include "replace each element's successor with its successor's successor" (as in the example in the Context section) and "replace a value held at this element with the sum of the current value and the value of the predecessor's element."

#### Synchronization

Algorithms that fit this pattern are described in terms of simultaneously updating all elements of the data structure. Some target platforms (for example, SIMD architectures such as the early Connection Machines) make this trivial to accomplish by assigning each data element to a separate PE (possibly a logical PE) and executing instructions in a lockstep fashion at each PE. MIMD platforms with the right supporting programming environments (for example, High Performance Fortran [HPF97]) provide similar semantics.

If the target platform doesn't provide the required synchronization implicitly, it will be necessary to introduce the synchronization explicitly. For example, if the operation performed during a loop iteration contains the assignment

```next [k] = next [next[k]]
```

then the parallel algorithm must ensure that next [k] is not updated before other UEs that need its value for their computation have received it. One common technique is to introduce a new variable, say next2, at each element. Even-numbered iterations then read next but update next2, while odd-numbered iterations read next2 and update next. The necessary synchronization is accomplished by placing a barrier (as described in the Implementation Mechanisms design space) between each successive pair of iterations. Notice that this can substantially increase the overhead associated with the parallel algorithm, which can overwhelm any speedup derived from the additional concurrency. This is most likely to be a factor if the calculation required for each element is trivial (which, alas, for many of the examples it is).

If there are fewer PEs than data elements, the program designer must decide whether to assign each data element to a UE and assign multiple UEs to each PE (thereby simulating some of the parallelism) or whether to assign multiple data elements to each UE and process them serially. The latter is less straightforward (requiring an approach similar to that sketched previously, in which variables involved in the simultaneous update are duplicated), but can be more efficient.

#### Partial sums of a linked list

In this example, adopted from Hillis and Steele [HS86], the problem is to compute the prefix sums of all the elements in a linked list in which each element contains a value x. In other words, after the computation is complete, the first element will contain x0, the second will contain x0 + x1 the third x0 + x1 + x2, etc.

shows pseudocode for the basic algorithm. Fig. 4.23 shows the evolution of the computation where xi is the initial value of the (i + 1)-th element in the list.

Figure 4.23 Steps in finding partial sums of a list. Straight arrows represent links between elements; curved arrows indicate additions.

This example can be generalized by replacing addition with any associative operator and is sometime known as a prefix scan. It can be used in a variety of situations, including solving various types of recurrence relations.

#### Known uses

Algorithms developed with this pattern are a type of data parallel algorithm. They are widely used on SIMD platforms and to a lesser extent in languages such as High Performance Fortran [HPF97]. These platforms support the fine-grained concurrency required for the pattern and handle synchronization automatically because every computation step (logically if not physically) occurs in lockstep on all the processors. Hillis and Steele [HS86] describe several interesting applications of this pattern, including finding the end of a linked list, computing all partial sums of a linked list, region labeling in two-dimensional images, and parsing.

#### Example 4.22. Pseudocode for finding partial sums of a list

```for all k in parallel
{
temp[k] = next[k];
while temp[k] != null
{
x[temp[k]] = x[k] + x[temp[k]];
temp[k] = temp [temp [k] ];
}
}
```

In combinatorial optimization, problems involving traversing all nodes in a graph or tree can often be solved with this pattern by first finding an ordering on the nodes to create a list. Euler tours and ear decomposition [EG88] are well-known techniques to compute this ordering.

JáJá [J92] also describes several applications of this pattern: finding the roots of trees in a forest of rooted directed trees, computing partial sums on a set of rooted directed trees (similar to the preceding example with linked lists), and list-ranking (determining for each element of the list its distance from the start/end of the list).

#### Related Patterns

With respect to the actual concurrency, this pattern is very much like the Geometric Decomposition pattern, a difference being that in this pattern the data structure containing the elements to be operated on concurrently is recursive (at least conceptually). What makes it different is the emphasis on fundamentally rethinking the problem to expose fine-grained concurrency.