Home > Articles > Programming > Graphic Programming

  • Print
  • + Share This
From the author of Conclusion


Given GPUs' stellar performance in performing reductions, such as computing the sum of an input array (as described in Chapter 12, "Reduction," in The CUDA Handbook), we might expect them to be equally good at computing histograms, which at first blush seems like a similar calculation. Unfortunately, the data-dependent behavior due to contention is hard to avoid, and strategies to detect degenerate input data reduce average-case performance. As a result, the best algorithm depends on the application: Applications that require level performance regardless of input data prefer per-thread implementations, whereas applications that are unlikely to encounter degenerate input data prefer per-block implementations.

One avenue that we did not explore in this article was putting the privatized histograms in global memory. Provided the working set of histogram elements fit in the L1 and L2 caches, this approach may work better than putting the histograms in shared memory due to higher occupancy; the per-thread implementations described in this article are below 10% occupancy on both SM 2.0 and SM 3.0. But for reasonable-seeming thread and block counts, the amount of intermediate data would be comparable to the amount of input data for most applications. For example, 128 blocks of 32 threads each would require three passes (clearing to zero, incrementing in the innermost loop, followed by a final reduction) over 4MB of intermediate data.

To improve performance of this and similar workloads, NVIDIA could add native hardware support for shared memory reductions. This feature already exists for global memory: When programmers invoke intrinsics for atomic operations, the compiler translates the intrinsic into a different machine instruction, depending on whether the return value is used (typically for synchronization). If the compiler could emit machine instructions for shared memory reduction—even if only for integer addition—then the per-block formulations of this algorithm likely would be the best choice for all applications.

  • + Share This
  • 🔖 Save To Your Account