- Using Multiple Processes to Improve System Productivity
- Multiple Users Utilizing a Single System
- Improving Machine Efficiency Through Consolidation
- Using Parallelism to Improve the Performance of a Single Task
- Parallelization Patterns
- How Dependencies Influence the Ability Run Code in Parallel
- Identifying Parallelization Opportunities
- Summary
How Dependencies Influence the Ability Run Code in Parallel
Dependencies within an application (or the calculation it performs) define whether the application can possibly run in parallel. There are two types of dependency: loop- or data-carried dependencies and memory-carried dependencies.
With a loop-carried dependency, the next calculation in a loop cannot be performed until the results of the previous iteration are known. A good example of this is the loop to calculate whether a point is in the Mandelbrot set. Listing 3.4 shows this loop.
Listing 3.4. Code to Determine Whether a Point Is in the Mandelbrot Set
int inSet(double ix, double iy) { int iterations=0; double x = ix, y = iy, x2 = x*x, y2 = y*y; while ( (x2+y2 < 4) && (iterations < 1000) ) { y = 2 * x * y + iy; x = x2 - y2 + ix; x2 = x * x; y2 = y * y; iterations++; } return iterations; }
Each iteration of the loop depends on the results of the previous iteration. The loop terminates either when 1,000 iterations have been completed or when the point escapes a circle centered on the origin of radius two. It is not possible to predict how many iterations this loop will complete. There is also insufficient work for each iteration of the loop to be split over multiple threads. Hence, this loop must be performed serially.
Memory-carried dependencies are more subtle. These represent the situation where a memory access must be ordered with respect to another memory access to the same location. Consider the snippet of code shown in Listing 3.5.
Listing 3.5. Code Demonstrating Ordering Constraints
int val=0; void g() { val = 1; } void h() { val = val + 2; }
If the routines g() and h() are executed by different threads, then the result depends on the order in which the two routines are executed. If g() is executed followed by h(), then the val will hold the result 3. If they are executed in the opposite order, then val will contain the result 1. This is an example of a memory-carried dependence because to produce the correct answer, the operations need to be performed in the correct order.
Antidependencies and Output Dependencies
Suppose one task, A, needs the data produced by another task, B; A depends on B and cannot start until B completes and releases the data needed by A. This is often referred to as true dependency. Typically, B writes some data, and A needs to read that data. There are other combinations of two threads reading and writing data. Table 3-1 illustrates the four ways that tasks might have a dependency.
Table 3-1. Possible Ordering Constraints
Second task Read |
Write |
||
First task |
Read |
Read after read (RAR) No dependency |
Write after read (WAR) Antidependency |
Write |
Read after write (RAW) True dependency |
Write after write (WAW) Output dependency |
When both threads perform read operations, there is no dependency between them, and the same result is produced regardless of the order the threads run in.
With an antidependency, or write after read, one task has to read the data before the second task can overwrite it. With an output dependency, or write after write, one of the two tasks has to provide the final result, and the order in which the two tasks write their results is critical. These two types of dependency can be most clearly illustrated using serial code.
In the code shown in Listing 3.6, there is an antidependency on the variable data1. The first statement needs to complete before the second statement because the second statement reuses the variable data1.
Listing 3.6. An Example of an Antidependency
void anti-dependency() { result1 = calculation( data1 ); // Needs to complete first data1 = result2 + 1; // Will overwrite data1 }
If one of the statements was modified to use an alternative or temporary variable, for example, data1_prime, then both statements could proceed in any order. Listing 3.7 shows this modified code.
Listing 3.7. Fixing an Antidependency
void anti-dependency() { data1_prime = data1; // Local copy of data1 result1 = calculation( data1_prime ); data1 = result2 + 1; // No longer has antidependence }
The code shown in Listing 3.8 demonstrates an output dependency on the variable data1. The second statement needs to complete after the first statement only because they both write to the same variable.
Listing 3.8. An Output Dependency
void output-dependency() { data1 = result1 + 2; data1 = result2 + 2; // Overwrites same variable }
If the first target variable was renamed data1_prime, then both statements could proceed in any order. Listing 3.9 shows this fix.
Listing 3.9. Fixing an Output Dependency
void output-dependency() { data1_prime = result1 + 2; data1 = result2 + 2; // No longer has output-dependence }
What is important about these two situations is that both output and antidependencies can be avoided by renaming the data being written, so the final write operation goes to a different place. This might involve taking a copy of the object and having each task work on their own copy, or it might be a matter of duplicating a subset of the active variables. In the worst case, it could be resolved by both tasks working independently and then having a short bit of code that sets the variables to the correct state.
Using Speculation to Break Dependencies
In some instances, there is a clear potential dependency between different tasks. This dependency means it is impossible to use a traditional parallelization approach where the work is split between the two threads. Even in these situations, it can be possible to extract some parallelism at the expense of performing some unnecessary work. Consider the code shown in Listing 3.10.
Listing 3.10. Code with Potential for Speculative Execution
void doWork( int x, int y ) { int value = longCalculation( x, y ); if (value > threshold) { return value + secondLongCalculation( x, y ); } else { return value; } }
In this example, it is not known whether the second long calculation will be performed until the first one has completed. However, it would be possible to speculatively compute the value of the second long calculation at the same time as the first calculation is performed. Then depending on the return value, either discard the second value or use it. Listing 3.11 shows the resulting code parallelized using pseudoparallelization directives.
Listing 3.11. Speculatively Parallelized Code
void doWork(int x, int y) { int value1, value2; #pragma start parallel region { #pragma perform parallel task { value1 = longCalculation( x, y ); } #pragma perform parallel task { value2 = secondLongCalculation( x, y ); } } #pragma wait for parallel tasks to complete if (value1 > threshold) { return value1 + value2; } else { return value1; } }
The #pragma directives in the previous code are very similar to those that are actually used in OpenMP, which we will discuss in Chapter 7, "OpenMP and Automatic Parallelization." The first directive tells the compiler that the following block of code contains statements that will be executed in parallel. The two #pragma directives in the parallel region indicate the two tasks to be performed in parallel. A final directive indicates that the code cannot exit the parallel region until both tasks have completed.
Of course, it is important to consider whether the parallelization will slow performance down more than it will improve performance. There are two key reasons why the parallel implementation could be slower than the serial code.
- The overhead from performing the work and synchronizing after the work is close in magnitude to the time taken by the parallel code.
- The second long calculation takes longer than the first long calculation, and the results of it are rarely used.
It is possible to put together an approximate model of this situation. Suppose the first calculation takes T1 seconds and the second calculation takes T2 seconds; also suppose that the probability that the second calculation is actually needed is P. Then the total runtime for the serial code would be T1 + P * T2.
For the parallel code, assume that the calculations take the same time as they do in the serial case and the probability remains unchanged, but there is also an overhead from synchronization, S. Then the time taken by the parallel code is S + max (T1,T2).
Figure 3.22 shows the two situations.
Figure 3.22 Parallelization using speculative execution
We can further deconstruct this to identify the constraints on the two situations where the parallel version is faster than the serial version:
- If T1 > T2, then for the speculation to be profitable, S+T1 < T1+P*T2, or S < P*T2. In other words, the synchronization cost needs to be less than the average amount of time contributed by the second calculation. This makes sense if the second calculation is rarely performed, because then the additional overhead of synchronization needed to speculatively calculate it must be very small.
- If T2 > T1 (as shown in Figure 3.21), then for speculation to be profitable, S+T2 < T1+P*T2 or P > (T2 +S -T1)/T2. This is a more complex result because the second task takes longer than the first task, so the speculation starts off with a longer runtime than the original serial code. Because T2 > T1, T2 + S -T1 is always >0. T2 + S -T1 represents the overhead introduced by parallelization. For the parallel code to be profitable, this has to be lower than the cost contributed by executing T2. Hence, the probability of executing T2 has to be greater than the ratio of the additional cost to the original cost. As the additional cost introduced by the parallel code gets closer to the cost of executing T2, then T2 needs to be executed increasingly frequently in order to make the parallelization profitable.
The previous approach is speculative execution, and the results are thrown away if they are not needed. There is also value speculation where execution is performed, speculating on the value of the input. Consider the code shown in Listing 3.12.
Listing 3.12. Code with Opportunity for Value Speculation
void doWork(int x, int y) { int value = longCalculation( x, y ); return secondLongCalculation( value ); }
In this instance, the second calculation depends on the value of the first calculation. If the value of the first calculation was predictable, then it might be profitable to speculate on the value of the first calculation and perform the two calculations in parallel. Listing 3.13 shows the code parallelized using value speculation and pseudoparallelization directives.
Listing 3.13. Parallelization Using Value Speculations
void doWork(int x, int y) { int value1, value2; static int last_value; #pragma start parallel region { #pragma perform parallel task { value1 = longCalculation( x, y ); } #pragma perform parallel task { value2 = secondLongCalculation( lastValue ); } } #pragma wait for parallel tasks to complete if (value1 == lastvalue) { return value2; } else { lastValue = value1; return secondLongCalculation( value1 ); } }
The value calculation for this speculation is very similar to the calculation performed for the speculative execution example. Once again, assume that T1 and T2 represent the costs of the two routines. In this instance, P represents the probability that the speculation is incorrect. S represents the synchronization overheads. Figure 3.23 shows the costs of value speculation.
Figure 3.23 Parallelization using value speculation
The original code takes T1+T2 seconds to complete. The parallel code takes max(T1,T2)+S+P*T2. For the parallelization to be profitable, one of the following conditions needs to be true:
- If T1 > T2, then for the speculation to be profitable, T1 + S + P*T2 < T1 +T2. So, S < (1-P) * T2. If the speculation is mostly correct, the synchronization costs just need to be less than the costs of performing T2. If the synchronization is often wrong, then the synchronization costs need to be much smaller than T2 since T2 will be frequently executed to correct the misspeculation.
- If T2 > T1, then for the speculation to be profitable, T2 + S + P*T2 < T1 +T2. So, S <T1 – P*T2. The synchronization costs need to be less than the cost of T1 after the overhead of recomputing T2 is included.
As can be seen from the preceding discussion, speculative computation can lead to a performance gain but can also lead to a slowdown; hence, care needs to be taken in using it only where it is appropriate and likely to provide a performance gain.
Critical Paths
One way of looking at parallelization is by examining the critical paths in the application. A critical path is the set of steps that determine the minimum time that the task can complete in. A serial program might complete tasks A, B, C, and D. Not all of the tasks need to have dependencies. B might depend on the results of A, and D might depend on the results of B and C, but C might not depend on any previous results. This kind of data can be displayed on a graph such as the one in Figure 3.24.
Figure 3.24 Critical paths
It is relatively straightforward to identify the critical path in a process once the dependencies and durations have been identified. From this graph, it is apparent that task C could be performed in parallel with tasks A and B. Given timing data, it would be possible to estimate the expected performance of this parallelization strategy.