An alternative strategy is to allocate a histogram per block, and have the threads in the block use shared memory atomics to increment the histogram elements.
Shared memory atomics, as described in The CUDA Handbook Chapter 8, "Streaming Multiprocessors," are implemented with a combination of hardware and software: For shared memory loads, the hardware optionally may "lock" the shared memory location. Since the locks are a limited resource, this instruction returns a predicate that indicates whether the thread acquired the lock. The code emitted by the compiler then emits a series of predicated instructions that either perform the operation and relinquish the lock, or loop to acquire the lock, as shown in Listing 4.
Listing 4Microcode for shared memory increment.
/*0138*/ LDSLK P1, R7, [R6]; /*0148*/ @P1 IADD R7, R7, R4; /*0150*/ @P1 STSCUL P2, [R6], R7; /*0158*/ @!P2 BRA 0x138;
Each SM has 1,024 locks that are determined by bits 2–9 of the shared memory address; but if all the threads are trying to acquire the same lock, most will loop while the thread that acquired the lock performs the operation. The result is the same type of data-dependent performance exhibited by global memory atomics, though the effect is somewhat muted due to the contention being spread across multiple thread blocks.
Padding the histogram size to 257 results in a modest decrease in contention for degenerate inputs, presumably because it causes different blocks in the same SM to acquire different shared memory locks. However, threads within a given block still contend for the same lock.
Table 4 summarizes how this algorithm degrades due to contention, as the number of possible inputs goes from 256 (a fully random sampling) to 1 (all zeros). The histogram-per-block implementation turns out to suffer from contention even more than the histogram-per-grid implementation of Listings 2 and 3, which fire atomics directly into the output histogram.
One optimization that benefited SM 1.x-class hardware but does not help on more recent hardware (because global memory atomics are so slow on SM 1.x) is to perform a reduction on the per-block histograms after writing them to global memory. The benefits of this optimization are summarized in the "Performance" section of this article.