We arrived at three implementations of 8-bit histogram computation:
- Naïve global atomics that accumulate directly into the final output (histogram per grid)
- Per-block histograms that use shared memory atomics (histogram per block)
- Privatized histograms with periodic merge (histogram per thread)
All implementations benefited from using 32-bit memory operations to read the input, and unrolling the inner loop four times in the spirit of Listing 3. Therefore, all of the performance results reported here reflect this optimization.
We ran the kernels on large inputs (at least 4096x4096) with the --random parameter varying from 256 (any 8-bit value may appear in the input) to 1 (the input data is all zeros). By decreasing the number of possible values in the input, we highlight how vulnerable each approach is to data-dependent performance degradation due to contention. The privatized histogram-per-thread approach is designed to have level performance, but pays a price in the form of lower performance on better-behaved input data.
Figure 7 shows the results for a GT200 (SM 1.3). Since SM 1.x hardware doesn't have enough shared memory to run the privatized per-thread algorithm, we report the performance of per-grid, per-block using atomics to accumulate into the final output, and per-block using reduction to accumulate into the final output. (This strategy doesn't benefit SM 2.x or SM 3.x hardware, due to their faster global atomics.) The per-grid implementation that uses global atomics to operate on the output histogram is so slow that its relative performance is impossible to make out from the chart—to get an idea of the relative performance, refer to Table 3.
Figure 7 Histogram performance—SM 1.3.
Figure 8 shows the results for a GF100 (SM 2.0). The per-thread implementation is level at about 9 billion pixels per second; the per-block implementation is faster for 256 and 128 possible values, but crosses over at 64 possible values and rapidly degrades from there. The per-grid implementation appears on the chart, but is uncompetitive for all possible input values.
Figure 8 Histogram performance—SM 2.0.
Figure 9 shows the results for a GK104 (SM 3.0). Here, the per-grid performance is much more competitive—it is actually faster than the per-thread implementation for 256 possible input values—but the most noteworthy result is that the GK104 runs the per-thread implementation significantly slower than the older GF100! (About 6.6 billion pixels per second versus 9 billion pixels per second.) The two architectures are extremely different—Kepler has fewer SMs and more execution units per SM—and the low-occupancy structure of the per-thread kernel is not a great architectural fit for Kepler.
Figure 9 Histogram performance—SM 3.0.