10.12 Optimal Block Sizing and Performance
When the texture coordinates are generated in the “obvious” way, such as in tex2d_addressing.cu:
row = blockIdx.y*blockDim.y + threadIdx.y; col = blockIdx.x*blockDim.x + threadIdx.x; ... tex2D( tex, (float) col, (float) row);
then texturing performance is dependent on the block size.
To find the optimal size of a thread block, the tex2D_shmoo.cu and surf2Dmemset_shmoo.cu programs time their performance over thread blocks whose width and height vary from 4..64, inclusive. Some combinations of these thread block sizes are not valid because they have too many threads.
For this exercise, the texturing kernel is designed to do as little work as possible (maximizing exposure to the performance of the texture hardware), while still “fooling” the compiler into issuing the code. Each thread computes the floating-point sum of the values it reads, and writes the sum if the output parameter is non-NULL. The trick is that we never pass a non-NULL pointer to this kernel! The reason the kernel is structured this way is because if it never wrote any output, the compiler would see that the kernel was not doing any work, and emit microcode that did not perform the texturing operations at all.
extern "C" __global__ void TexSums( float *out, size_t Width, size_t Height ) { float sum = 0.0f; for ( int row = blockIdx.y*blockDim.y + threadIdx.y; row < Height; row += blockDim.y*gridDim.y ) { for ( int col = blockIdx.x*blockDim.x + threadIdx.x; col < Width; col += blockDim.x*gridDim.x ) { sum += tex2D( tex, (float) col, (float) row ); } } if ( out ) { out[blockIdx.x*blockDim.x+threadIdx.x] = sum; } }
Even with our “trick,” there is a risk is that the compiler will emit code that checks the out parameter and exits the kernel early if it’s equal to NULL. We’d have to synthesize some output that wouldn’t affect performance too much (for example, have each thread block compute the reduction of the sums in shared memory and write them to out). But by compiling the program with the --keep option and using cuobjdump ---dump-sass to examine the microcode, we can see that the compiler doesn’t check out until after the doubly-nested for loop as executed.
Results
On a GeForce GTX 280 (GT200), the optimal block size was found to be 128 threads, which delivered 35.7G/s of bandwidth. Thread blocks of size 32W x 4H were about the same speed as 16W × 8H or 8W × 16H, all traversing a 4K × 4K texture of float in 1.88 ms. On a Tesla M2050, the optimal block size was found to be 192 threads, which delivered 35.4 G/s of bandwidth. As with the GT200, different-sized thread blocks were the same speed, with 6W × 32H, 16W × 12H, and 8W × 24H blocks delivering about the same performance.
The shmoo over 2D surface memset was less conclusive: block sizes of at least 128 threads generally had good performance, provided the thread count was evenly divisible by the warp size of 32. The fastest 2D surface memset performance reported on a cg1.4xlarge without ECC enabled was 48Gb/s.
For float-valued data, for both boards we tested, the peak bandwidth numbers reported by texturing and surface write are about ¼ and ½ of the achievable peaks for global load/store, respectively.