10.4 Texture As A Read Path
When using texture as a read path – that is, using the texturing hardware strictly to get around awkward coalescing constraints or to take advantage of the texture cache, as opposed to accessing hardware features such as linear interpolation - many texturing features are unavailable. The highlights of this usage for texture are as follows:
- The texture reference must be bound to device memory with cudaBindTexture() or cuTexRefSetAddress().
- The tex1Dfetch() intrinsic must be used. It takes a 27-bit integer index.
- tex1Dfetch() optionally can convert the texture contents to floating point values. Integers are converted to floating point values in the range [0.0, 1.0], and 16-bit floating point values are promoted to float.
The benefits of reading device memory via tex1Dfetch() are twofold. First, memory read via texture does not have to conform to the coalescing constraints that apply when reading global memory. Second, the texture cache can be a useful complement to the other hardware resources, even the L2 cache on Fermi-class hardware.
When an out-of-range index is passed to tex1Dfetch(), it returns 0.
10.4.1 Increasing Effective Address Coverage
Since the 27-bit index specifies which texture element to fetch, and the texture elements may be up to 16 bytes in size, a texture being read via tex1Dfetch() can cover up to 31 bits (227+24) worth of memory[1]. One way to increase the amount of data being effectively covered by a texture is to use wider texture elements than the actual data size. For example, the application can texture from float4 instead of float, then select the appropriate element of the float4 depending on the least significant bits of the desired index. Similar techniques can be applied to integer data, especially 8- or 16-bit data where global memory transactions are always uncoalesced.
Alternatively, applications can alias multiple textures over different segments of the device memory, and perform predicated texture fetches from each texture in such a way that only one of them is “live.”
Microdemo: tex1dfetch_big.cu
This program illustrates using tex1Dfetch() to read from large arrays using both multiple components per texture, and multiple textures. It is invoked as follows:
tex1dfetch_big <NumMegabytes>
The application allocates the specified number of megabytes of device memory (or mapped pinned host memory, if the device memory allocation fails), fills the memory with random numbers, and uses 1-, 2- and 4-component textures to compute checksums on the data. Up to four textures of int4 can be used, enabling the application to texture from up to 8192M of memory.
For clarity, tex1dfetch_big.cu does not perform any fancy parallel reduction techniques: each thread writes back an intermediate sum, and the final checksums are accumulated on the CPU.
The application defines the 27-bit hardware limits:
#define CUDA_LG_MAX_TEX1DFETCH_INDEX 27 #define CUDA_MAX_TEX1DFETCH_INDEX (((size_t)1<<CUDA_LG_MAX_TEX1DFETCH_INDEX)-1)
and four textures of int4:
texture<int4, 1, cudaReadModeElementType> tex4_0; texture<int4, 1, cudaReadModeElementType> tex4_1; texture<int4, 1, cudaReadModeElementType> tex4_2; texture<int4, 1, cudaReadModeElementType> tex4_3;
A device function tex4Fetch() takes an index and teases it apart into a texture ordinal and a 27-bit index to pass to tex1Dfetch().
__device__ int4 tex4Fetch( size_t index ) { int texID = (int) (index>>CUDA_LG_MAX_TEX1DFETCH_INDEX); int i = (int) (index & (CUDA_MAX_TEX1DFETCH_INDEX_SIZE_T-1)); int4 i4; if ( texID == 0 ) { i4 = tex1Dfetch( tex4_0, i ); } else if ( texID == 1 ) { i4 = tex1Dfetch( tex4_1, i ); } else if ( texID == 2 ) { i4 = tex1Dfetch( tex4_2, i ); } else if ( texID == 3 ) { i4 = tex1Dfetch( tex4_3, i ); } return i4; }
This device function should compile to a small amount of code that uses four predicated TEX instructions, only one of which is “live.” If random access is desired, the application also can use predication to select from the .x, .y, .z or .w component of the int4 return value.
Binding the textures, shown in Listing 10-2, is a slightly tricky business. This code creates two small arrays texSizes[] and texBases[] and sets them up to cover the device memory range. The loop in lines 18-21 ensures that all four textures have a valid binding, even if fewer than four are needed to map the device memory.
1 |
int iTexture; |
2 |
cudaChannelFormatDesc int4Desc = cudaCreateChannelDesc<int4>(); |
3 |
size_t numInt4s = numBytes / sizeof(int4); |
4 |
int numTextures = (numInt4s+CUDA_MAX_TEX1DFETCH_INDEX)>> |
5 |
CUDA_LG_MAX_TEX1DFETCH_INDEX; |
6 |
size_t Remainder = numBytes & (CUDA_MAX_BYTES_INT4-1); |
7 |
if ( ! Remainder ) { |
8 |
Remainder = CUDA_MAX_BYTES_INT4; |
9 |
} |
10 |
|
11 |
size_t texSizes[4]; |
12 |
char *texBases[4]; |
13 |
for ( iTexture = 0; iTexture < numTextures; iTexture++ ) { |
14 |
texBases[iTexture] = deviceTex+iTexture*CUDA_MAX_BYTES_INT4; |
15 |
texSizes[iTexture] = CUDA_MAX_BYTES_INT4; |
16 |
} |
17 |
texSizes[iTexture-1] = Remainder; |
18 |
while ( iTexture < 4 ) { |
19 |
texBases[iTexture] = texBases[iTexture-1]; |
20 |
texSizes[iTexture] = texSizes[iTexture-1]; |
21 |
iTexture++; |
22 |
} |
23 |
cudaBindTexture( NULL, tex4_0, texBases[0], int4Desc, texSizes[0] ); |
24 |
cudaBindTexture( NULL, tex4_1, texBases[1], int4Desc, texSizes[1] ); |
25 |
cudaBindTexture( NULL, tex4_2, texBases[2], int4Desc, texSizes[2] ); |
26 |
cudaBindTexture( NULL, tex4_3, texBases[3], int4Desc, texSizes[3] ); |
Listing 10-2. tex1dfetch_big.cu (excerpt)
Once compiled and run, the application can be invoked with different sizes to see the effects. On a CG1 instance running in Amazon’s EC2 cloud compute offering, invocations with 512M, 768M, 1280M and 8192M worked as follows:
$ ./tex1dfetch_big 512 Expected checksum: 0x7b7c8cd3 tex1 checksum: 0x7b7c8cd3 tex2 checksum: 0x7b7c8cd3 tex4 checksum: 0x7b7c8cd3 $ ./tex1dfetch_big 768 Expected checksum: 0x559a1431 tex1 checksum: (not performed) tex2 checksum: 0x559a1431 tex4 checksum: 0x559a1431 $ ./tex1dfetch_big 1280 Expected checksum: 0x66a4f9d9 tex1 checksum: (not performed) tex2 checksum: (not performed) tex4 checksum: 0x66a4f9d9 $ ./tex1dfetch_big 8192 Device alloc of 8192 Mb failed, trying mapped host memory Expected checksum: 0xf049c607 tex1 checksum: (not performed) tex2 checksum: (not performed) tex4 checksum: 0xf049c607
Each int4 texture can “only” read 2G, so invoking the program with numbers greater than 8192 causes it to fail.
This application highlights the demand for indexed textures, where the texture being fetched can be specified as a parameter at runtime; but CUDA does not expose support for this feature.
10.4.2 Texturing From Host Memory
Using texture as a read path, applications can read from host memory by allocating mapped pinned memory, fetching the device pointer, then specifying that device pointer to cudaBindAddress() or cuTexRefSetAddress().
The capability is there, but reading host memory via texture is slow. Tesla-class hardware can texture over PCI Express at about 2G/s, and Fermi hardware is much slower. You need some other reason to do it, such as code simplicity.
Microdemo: tex1dfetch_int2float.cu
This code fragment uses texture-as-a-read path and texturing from host memory to confirm that the TexPromoteToFloat() functions work properly. The CUDA kernel that we will use for this purpose is a straightforward, blocking-agnostic implementation of a memcpy function that reads from the texture and writes to device memory:
texture<signed char, 1, cudaReadModeNormalizedFloat> tex; extern "C" __global__ void TexReadout( float *out, size_t N ) { for ( size_t i = blockIdx.x*blockDim.x + threadIdx.x; i < N; i += gridDim.x*blockDim.x ) { out[i] = tex1Dfetch( tex, i ); } }
Since promoting integers to floating point only works on 8- and 16-bit values, we can test every possible conversion by allocating a small buffer, texturing from it, and confirming that the output meets our expectations. Listing 10-3 gives an excerpt from tex1dfetch_int2float.cu. In lines 9-20, two host buffers are allocated: inHost holds the input buffer of 256 or 65536 input values, and fOutHost holds the corresponding float-valued outputs. The device pointers corresponding to these mapped host pointers are fetched into inDevice and foutDevice.
In lines 22-24, the input values are initialized to every possible value of the type to be tested. In lines 27-31, the input device pointer is then bound to the texture reference using cudaBindTexture().
Line 32 invokes the kernel to read each value from the input texture and write as output the values returned by tex1Dfetch(). In this case, both the input and output buffers reside in mapped host memory. Because the kernel is writing directly to host memory, we must call cudaDeviceSynchronize() in line 33 to make sure there are no race conditions between the CPU and GPU. In lines 27-32, we call the TexPromoteToFloat specialization corresponding to the type being tested, and confirm that it is equal to the value returned by the kernel. If all tests pass, the function returns true; if any API functions or comparisons fail, it returns false.
1 |
template<class T> |
2 |
void |
3 |
CheckTexPromoteToFloat( size_t N ) |
4 |
{ |
5 |
T *inHost, *inDevice; |
6 |
float *foutHost, *foutDevice; |
7 |
cudaError_t status; |
8 |
|
9 |
CUDART_CHECK(cudaHostAlloc( (void **) &inHost, |
10 |
N*sizeof(T), |
11 |
cudaHostAllocMapped)); |
12 |
CUDART_CHECK(cudaHostGetDevicePointer( (void **) &inDevice, |
13 |
inHost, |
14 |
0 )); |
15 |
CUDART_CHECK(cudaHostAlloc( (void **) &foutHost, |
16 |
N*sizeof(float), |
17 |
cudaHostAllocMapped)); |
18 |
CUDART_CHECK(cudaHostGetDevicePointer( (void **) &foutDevice, |
19 |
foutHost, |
20 |
0 )); |
21 |
|
22 |
for ( int i = 0; i < N; i++ ) { |
23 |
inHost[i] = (T) i; |
24 |
} |
25 |
memset( foutHost, 0, N*sizeof(float) ); |
26 |
|
27 |
CUDART_CHECK( cudaBindTexture( NULL, |
28 |
tex, |
29 |
inDevice, |
30 |
cudaCreateChannelDesc<T>(), |
31 |
N*sizeof(T))); |
32 |
TexReadout<<<2,384>>>( foutDevice, N ); |
33 |
CUDART_CHECK(cudaDeviceSynchronize()); |
34 |
|
35 |
for ( int i = 0; i < N; i++ ) { |
36 |
printf( "%.2f ", foutHost[i] ); |
37 |
assert( foutHost[i] == TexPromoteToFloat( (T) i ) ); |
38 |
} |
39 |
printf( "\n" ); |
40 |
Error: |
41 |
cudaFreeHost( inHost ); |
42 |
cudaFreeHost( foutHost ); |
43 |
} |
Listing 10-3. tex1d_int2float.cu (excerpt)
[1] Both Tesla- and Fermi-class hardware have the same 27-bit limit, so there is not yet any way to query a device for the limit.