Home > Articles

  • Print
  • + Share This
This chapter is from the book

4.2 Creating a Naive Parallel Version

In many other types of parallel programming, you would be wise to stare at your code and plot various approaches and alternative algorithms before you even consider starting to type. With OpenACC, the low effort and quick feedback allow you to dive right in and try some things without much risk of wasted effort.

4.2.1 Find the Hot Spot

Almost always the first thing to do is find the hot spot: the point of highest numerical intensity in your code. A profiler like those you’ve read about will quickly locate and rank these spots. Often, as is the case here, it is obvious where to start. A large loop is a big flag, and you have two of them within the main loop. This is where we focus.

4.2.2 Is It Safe to Use kernels?

The biggest hammer in your toolbox is the kernels directive. Refer to Chapter 1 for full details on kernels. Don’t resist the urge to put it in front of some large, nested loop. One nice feature about this directive is that it is safe out of the box; until you start to override its default behavior with additional directives, the compiler will be able to see whether there are any code-breaking dependencies, and it will make sure that the device has access to all the required data.

4.2.3 OpenACC Implementations

Let’s charge ahead and put kernels directives in front of the two big loops. The C and Fortran codes become the code shown in Listings 4.5 and 4.6.

Listing 4.5. C Laplace code main loop with kernels directives

while ( worst_dt > TEMP_TOLERANCE ) {

   #pragma acc kernels
   for(i = 1; i <= HEIGHT; i++) {
      for(j = 1; j <= WIDTH; j++) {
         Temperature[i][j] = 0.25 * (Temperature_previous[i+1][j]
                                 + Temperature_previous[i-1][j]
                                 + Temperature_previous[i][j+1]
                                 + Temperature_previous[i][j-1]);

   worst_dt = 0.0;

   #pragma acc kernels
   for(i = 1; i <= HEIGHT; i++){
      for(j = 1; j <= WIDTH; j++){
         worst_dt = fmax( fabs(Temperature[i][j]-
        Temperature_previous[i][j] = Temperature[i][j];

   if((iteration % 100) == 0) {


Listing 4.6. Fortran Laplace code main loop with kernels directives

do while ( worst_dt > temp_tolerance )

   !$acc kernels
   do j=1,width
      do i=1,height
         temperature(i,j) =0.25*(temperature_previous(i+1,j)&
                               + temperature_previous(i-1,j)&
                               + temperature_previous(i,j+1)&
                               + temperature_previous(i,j-1))
   !$acc end kernels


   !$acc kernels
   do j=1,width
      do i=1,height
         worst_dt = max( abs(temperature(i,j) – &
                         worst_dt )
         temperature_previous(i,j) = temperature(i,j)
   !$acc end kernels

   if( mod(iteration,100).eq.0 ) then
      call track_progress(temperature, iteration)

   iteration = iteration+1


The compilation is also straightforward. All you do is activate the directives using, for example, the PGI compiler, for the C version:

pgcc –acc laplace.c

Or for the Fortran version:

pgf90 –acc laplace.f90

If you do this, the executable pops right out and you can be on your way. However, you probably want to verify that your directives actually did something. OpenACC’s defense against compiling a loop with dependencies or other issues is to simply ignore the directives and deliver a “correct,” if unaccelerated, executable. With the PGI compiler, you can request feedback on the C OpenACC compilation by using this:

pgcc –acc -Minfo=acc laplace.c

Here it is for Fortran:

pgf90 –acc -Minfo=acc laplace.f90

Similar options are available for other compilers. Among the informative output, you see the “Accelerator kernel generated” message for both of your kernels-enabled loops. You may also notice that a reduction was automatically generated for worst_dt. It was nice of the compiler to catch that and generate the reduction automatically. So far so good.

If you run this executable, you will get something like this:

. . .
. . .
---------- Iteration number: 3200 ------------
. . .[998,998]: 99.18  [999,999]: 99.56  [1000,1000]: 99.86
---------- Iteration number: 3300 ------------
. . .[998,998]: 99.19  [999,999]: 99.56  [1000,1000]: 99.87

Max error at iteration 3372 was 0.009995
Total time was 35.258830 seconds.

This was executed on an NVIDIA K80, the fastest GPU available at the time of this writing. For our efforts thus far, we have managed to slow down the code by about 70 percent, which is not impressive at all.

  • + Share This
  • 🔖 Save To Your Account