Notes
[1] For a very small problem, the overhead of delegating and then waiting on the work might dominate the runtime.
[2] Because CUDA cannot make any guarantees as to the threads' execution order, the output histogram must be zero-initialized in host code before invoking this kernel.
[3] Improved hardware support for global atomics is only a partial explanation for these performance increases, of course.
[4] The INTDIVIDE_CEILING macro, defined in chUtil.h, computes the smallest integer that is greater than or equal to the result of dividing the two input operands.