10.9 2D Texturing: Copy Avoidance
When CUDA was first introduced, CUDA kernels could read from CUDA arrays only via texture. Applications could write to CUDA arrays only with memory copies; in order for CUDA kernels to write data that would then be read through texture, they had to write to device memory and then perform a device→array memcpy. Since then, two mechanisms have been added that remove this step for 2D textures:
- a 2D texture can be bound to a pitch-allocated range of linear device memory, and
- surface load/store intrinsics enable CUDA kernels to write to CUDA arrays directly.
3D texturing from device memory and 3D surface load/store are not supported.
For applications that read most or all the texture contents with a regular access pattern (such as a video codec), or applications that must work on Tesla-class hardware, it is best to keep the data in device memory. For applications that perform random (but localized) access when texturing, it is probably best to keep the data in CUDA arrays and using surface read/write intrinsics.
10.9.1 2D Texturing From Device Memory
Texturing from 2D device memory is available on all CUDA platforms. Since the texture hardware does not have any of the benefits of “block linear” addressing – a cache line fill into the texture cache pulls in a horizontal span of texels, not a 2D or 3D block of them – but unless the application performs random access into the texture, the benefits of avoiding a copy from device memory to a CUDA array likely outweigh the penalties of losing block linear addressing.
To bind a 2D texture reference to a device memory range, call cudaBindTexture2D():
CUDART_CHECK(cudaBindTexture2D( NULL, &tex, texDevice, &channelDesc, inWidth, inHeight, texPitch ));
The above call binds the texture reference tex to the 2D device memory range given by texDevice / texPitch. The base address and pitch must conform to hardware-specific alignment constraints[2]:the base address must be aligned with respect to cudaDeviceProp.textureAlignment, and the pitch must be aligned with respect to cudaDeviceProp.texturePitchAlignment[3].The microdemo tex2d_addressing_device.cu is identical to tex2d_addressing.cu, but uses device memory to hold the texture data. The two programs are designed to be so similar that you can look at the differences.A device pointer/pitch tuple is declared, instead of a CUDA array:
< cudaArray *texArray = 0; > T *texDevice = 0; > size_t texPitch;
cudaMallocPitch() is called instead of calling cudaMallocArray(). cudaMallocPitch() delegates selection of the base address and pitch to the driver, so the code will continue working on future generations of hardware (which have a tendency to increase alignment requirements):
< CUDART_CHECK(cudaMallocArray( &texArray, < &channelDesc, < inWidth, > CUDART_CHECK(cudaMallocPitch( &texDevice, > &texPitch, > inWidth*sizeof(T), inHeight));
Next, cudaTextureBind2D() is called instead of cudaBindTextureToArray().
< CUDART_CHECK(cudaBindTextureToArray(tex, texArray)); > CUDART_CHECK(cudaBindTexture2D( NULL, > &tex, > texDevice, > &channelDesc, > inWidth, > inHeight, > texPitch ));
The final difference is that instead of freeing the CUDA array, cudaFree() is called on the pointer returned by cudaMallocPitch().
< cudaFreeArray( texArray ); > cudaFree( texDevice );
10.9.2 2D Surface Read/Write
As with 1D surface read/write, Fermi-class hardware enables kernels to write directly into CUDA arrays with intrinsic surface read/write functions:
template<class Type> Type surf2Dread(surface<void, 1> surfRef, int x, int y, boundaryMode = cudaBoundaryModeTrap); template<class Type> Type surf2Dwrite(surface<void, 1> surfRef, Type data, int x, int y, boundaryMode = cudaBoundaryModeTrap);
The surface reference declaration and corresponding CUDA kernel for 2D surface memset, given in surf2Dmemset.cu, is as follows:
surface<void, 2> surf2D; template<typename T> __global__ void surf2Dmemset_kernel( T value, int xOffset, int yOffset, int Width, int Height ) { for ( int row = blockIdx.y*blockDim.y + threadIdx.y; row < Height; row += blockDim.y*gridDim.y ) { for ( int col = blockIdx.x*blockDim.x + threadIdx.x; col < Width; col += blockDim.x*gridDim.x ) { surf2Dwrite( value, surf2D, (xOffset+col)*sizeof(T), yOffset+row ); } } }
Remember, the X offset parameter to surf2Dwrite() is given in bytes.
[2] CUDA arrays must conform to the same constraints, but in that case the base address and pitch are managed by CUDA and hidden along with the memory layout.
[3] In the driver API, the corresponding device attribute queries are CU_DEVICE_ATTRIBUTE_TEXTURE_ALIGNMENT and CU_DEVICE_ATTRIBUTE_TEXTURE_PITCH_ALIGNMENT.