 8.1. Memory
 8.2. Integer Support
 8.3. FloatingPoint Support
 8.4. Conditional Code
 8.5. Textures and Surfaces
 8.6. Miscellaneous Instructions
 8.7. Instruction Sets
8.3. FloatingPoint Support
Fast native floatingpoint hardware is the raison d’être for GPUs, and in many ways they are equal to or superior to CPUs in their floatingpoint implementation. Denormals are supported at full speed,^{9} directed rounding may be specified on a perinstruction basis, and the Special Function Units deliver highperformance approximation functions to six popular singleprecision transcendentals. In contrast, x86 CPUs implement denormals in microcode that runs perhaps 100x slower than operating on normalized floatingpoint operands. Rounding direction is specified by a control word that takes dozens of clock cycles to change, and the only transcendental approximation functions in the SSE instruction set are for reciprocal and reciprocal square root, which give 12bit approximations that must be refined with a NewtonRaphson iteration before being used.
Since GPUs’ greater core counts are offset somewhat by their lower clock frequencies, developers can expect at most a 10x (or thereabouts) speedup on a level playing field. If a paper reports a 100x or greater speedup from porting an optimized CPU implementation to CUDA, chances are one of the abovedescribed “instruction set mismatches” played a role.
8.3.1. Formats
Figure 8.2 depicts the three (3) IEEE standard floatingpoint formats supported by CUDA: double precision (64bit), single precision (32bit), and half precision (16bit). The values are divided into three fields: sign, exponent, and mantissa. For double, single, and half, the exponent fields are 11, 8, and 5 bits in size, respectively; the corresponding mantissa fields are 52, 23, and 10 bits.
Figure 8.2 Floatingpoint formats.
The exponent field changes the interpretation of the floatingpoint value. The most common (“normal”) representation encodes an implicit 1 bit into the mantissa and multiplies that value by 2^{ebias}, where bias is the value added to the actual exponent before encoding into the floatingpoint representation. The bias for single precision, for example, is 127.
Table 8.7 summarizes how floatingpoint values are encoded. For most exponent values (socalled “normal” floatingpoint values), the mantissa is assumed to have an implicit 1, and it is multiplied by the biased value of the exponent. The maximum exponent value is reserved for infinity and NotANumber values. Dividing by zero (or overflowing a division) yields infinity; performing an invalid operation (such as taking the square root or logarithm of a negative number) yields a NaN. The minimum exponent value is reserved for values too small to represent with the implicit leading 1. As the socalled denormals^{10} get closer to zero, they lose bits of effective precision, a phenomenon known as gradual underflow. Table 8.8 gives the encodings and values of certain extreme values for the three formats.
Table 8.7 FloatingPoint Representations
DOUBLE PRECISION 

EXPONENT 
MANTISSA 
VALUE 
CASE NAME 
0 
0 
±0 
Zero 
0 
Nonzero 
±2^{1022}[0.mantissa] 
Denormal 
1 to 2046 
Any 
±2^{e1023}[1.mantissa] 
Normal 
2047 
0 
±∞ 
Infinity 
2047 
Nonzero 
NotANumber 

SINGLE PRECISION 

0 
0 
±0 
Zero 
0 
Nonzero 
±^{126}[0.mantissa] 
Denormal 
1 to 254 
Any 
±2^{e127}[1.mantissa] 
Normal 
255 
0 
±∞ 
Infinity 
255 
Nonzero 
NotANumber 

HALF PRECISION 

0 
0 
±0 
Zero 
0 
Nonzero 
±^{14}[0.mantissa] 
Denormal 
1 to 30 
Any 
±2^{e15}[1.mantissa] 
Normal 
31 
0 
±∞ 
Infinity 
31 
Nonzero 
NotANumber 
Table 8.8 FloatingPoint Extreme Values
DOUBLE PRECISION 

HEXADECIMAL 
EXACT VALUE 

Smallest denormal 
0...0001 
2^{1074} 
Largest denormal 
000F...F 
2^{1022}[12^{52}] 
Smallest normal 
0010...0 
2^{1022} 
1.0 
3FF0...0 
1 
Maximum integer 
4340...0 
2^{53} 
Largest normal 
7F7FFFFF 
2^{1024}[12^{53}] 
Infinity 
7FF00000 
Infinity 
SINGLE PRECISION 

Smallest denormal 
00000001 
2^{149} 
Largest denormal 
007FFFFF 
2^{126}[12^{23}] 
Smallest normal 
00800000 
2^{126} 
1.0 
00800000 
1 
Maximum integer 
4B800000 
2^{24} 
Largest normal 
7F7FFFFF 
2^{128}[12^{24}] 
Infinity 
7F800000 
Infinity 
HALF PRECISION 

Smallest denormal 
0001 
2^{24} 
Largest denormal 
07FF 
2^{14}[12^{10}] 
Smallest normal 
0800 
2^{14} 
1.0 
3c00 
1 
Maximum integer 
3c00 
2^{11} 
Largest normal 
7BFF 
2^{16}[12^{11}] 
Infinity 
7C00 
Infinity 
Rounding
The IEEE standard provides for four (4) round modes.
 Roundtonearesteven (also called “roundtonearest”)
 Round toward zero (also called “truncate” or “chop”)
 Round down (or “round toward negative infinity”)
 Round up (or “round toward positive infinity”)
Roundtonearest, where intermediate values are rounded to the nearest representable floatingpoint value after each operation, is by far the most commonly used round mode. Round up and round down (the “directed rounding modes”) are used for interval arithmetic, where a pair of floatingpoint values are used to bracket the intermediate result of a computation. To correctly bracket a result, the lower and upper values of the interval must be rounded toward negative infinity (“down”) and toward positive infinity (“up”), respectively.
The C language does not provide any way to specify round modes on a perinstruction basis, and CUDA hardware does not provide a control word to implicitly specify rounding modes. Consequently, CUDA provides a set of intrinsics to specify the round mode of an operation, as summarized in Table 8.9.
Table 8.9 Intrinsics for Rounding
INTRINSIC 
OPERATION 
__fadd_[rnrzrurd] 
Addition 
__fmul_[rnrzrurd] 
Multiplication 
__fmaf_[rnrzrurd] 
Fused multiplyadd 
__frcp_[rnrzrurd] 
Recriprocal 
__fdiv_[rnrzrurd] 
Division 
__fsqrt_[rnrzrurd] 
Square root 
__dadd_[rnrzrurd] 
Addition 
__dmul_[rnrzrurd] 
Multiplication 
__fma_[rnrzrurd] 
Fused multiplyadd 
__drcp_[rnrzrurd] 
Reciprocal 
__ddiv_[rnrzrurd] 
Division 
__dsqrt_[rnrzrurd] 
Square root 
Conversion
In general, developers can convert between different floatingpoint representations and/or integers using standard C constructs: implicit conversion or explicit typecasts. If necessary, however, developers can use the intrinsics listed in Table 8.10 to perform conversions that are not in the C language specification, such as those with directed rounding.
Table 8.10 Intrinsics for Conversion
INTRINSIC 
OPERATION 
__float2int_[rnrzrurd] 
float to int 
__float2uint_[rnrzrurd] 
float to unsigned int 
__int2float_[rnrzrurd] 
int to float 
__uint2float_[rnrzrurd] 
unsigned int to float 
__float2ll_[rnrzrurd] 
float to 64bit int 
__ll2float_[rnrzrurd] 
64bit int to float 
__ull2float_[rnrzrurd] 
unsigned 64bit int to float 
__double2float_[rnrzrurd] 
double to float 
__double2int_[rnrzrurd] 
double to int 
__double2uint_[rnrzrurd] 
double to unsigned int 
__double2ll_[rnrzrurd] 
double to 64bit int 
__double2ull_[rnrzrurd] 
double to 64bit unsigned int 
__int2double_rn 
int to double 
__uint2double_rn 
unsigned int to double 
__ll2double_[rnrzrurd] 
64bit int to double 
__ull2double_[rnrzrurd] 
unsigned 64bit int to double 
Because half is not standardized in the C programming language, CUDA uses unsigned short in the interfaces for __half2float() and __float2half(). __float2half() only supports the roundtonearest rounding mode.
float __half2float( unsigned short ); unsigned short __float2half( float );
8.3.2. Single Precision (32Bit)
Singleprecision floatingpoint support is the workhorse of GPU computation. GPUs have been optimized to natively deliver high performance on this data type,^{11} not only for core standard IEEE operations such as addition and multiplication, but also for nonstandard operations such as approximations to transcendentals such as sin() and log(). The 32bit values are held in the same register file as integers, so coercion between singleprecision floatingpoint values and 32bit integers (with __float_as_int() and __int_as_float()) is free.
Addition, Multiplication, and MultiplyAdd
The compiler automatically translates +, –, and * operators on floatingpoint values into addition, multiplication, and multiplyadd instructions. The __fadd_rn() and __fmul_rn() intrinsics may be used to suppress fusion of addition and multiplication operations into multiplyadd instructions.
Reciprocal and Division
For devices of compute capability 2.x and higher, the division operator is IEEEcompliant when the code is compiled with precdiv=true. For devices of compute capability 1.x or for devices of compute capability 2.x when the code is compiled with precdiv=false, the division operator and __fdividef(x,y) have the same accuracy, but for 2^{126}<y<2^{128}, __fdividef(x,y) delivers a result of zero, whereas the division operator delivers the correct result. Also, for 2^{126}<y<2^{128}, if x is infinity, __fdividef(x,y) returns NaN, while the division operator returns infinity.
Transcendentals (SFU)
The Special Function Units (SFUs) in the SMs implement very fast versions of six common transcendental functions.
 Sine and cosine
 Logarithm and exponential
 Reciprocal and reciprocal square root
Table 8.11, excerpted from the paper on the Tesla architecture^{12}summarizes the supported operations and corresponding precision. The SFUs do not implement full precision, but they are reasonably good approximations of these functions and they are fast. For CUDA ports that are significantly faster than an optimized CPU equivalent (say, 25x or more), the code most likely relies on the SFUs.
Table 8.11 SFU Accuracy
FUNCTION 
ACCURACY (GOOD BITS) 
ULP ERROR 
1/x 
24.02 
0.98 
1/sqrt(x) 
23.40 
1.52 
2^{x} 
22.51 
1.41 
log2_{x} 
22.57 
n/a 
sin/cos 
22.47 
n/a 
The SFUs are accessed with the intrinsics given in Table 8.12. Specifying the fastmath compiler option will cause the compiler to substitute conventional C runtime calls with the corresponding SFU intrinsics listed above.
Table 8.12 SFU Intrinsics
INTRINSIC 
OPERATION 
__cosf(x) 
cos x 
__exp10f(x) 
10^{x} 
__expf(x) 
e^{x} 
__fdividef(x,y) 
x.y 
__log2f(x) 
log2 x 
__log10f(x) 
log10 x 
__powf(x,y) 
xy 
__sinf(x) 
sin x 
__sincosf(x,sptr,cptr) 
^{*}s=sin(x);*c=cos(x); 
__tanf(x) 
tan x 
Miscellaneous
__saturate(x) returns 0 if x<0, 1 if x>1, and x otherwise.
8.3.3. Double Precision (64Bit)
Doubleprecision floatingpoint support was added to CUDA with SM 1.3 (first implemented in the GeForce GTX 280), and much improved doubleprecision support (both functionality and performance) became available with SM 2.0. CUDA’s hardware support for double precision features fullspeed denormals and, starting in SM 2.x, a native fused multiplyadd instruction (FMAD), compliant with IEEE 754 c. 2008, that performs only one rounding step. Besides being an intrinsically useful operation, FMAD enables full accuracy on certain functions that are converged with the NewtonRaphson iteration.
As with singleprecision operations, the compiler automatically translates standard C operators into multiplication, addition, and multiplyadd instructions. The __dadd_rn() and __dmul_rn() intrinsics may be used to suppress fusion of addition and multiplication operations into multiplyadd instructions.
8.3.4. Half Precision (16Bit)
With 5 bits of exponent and 10 bits of significand, half values have enough precision for HDR (high dynamic range) images and can be used to hold other types of values that do not require float precision, such as angles. Half precision values are intended for storage, not computation, so the hardware only provides instructions to convert to/from 32bit.^{13} These instructions are exposed as the __halftofloat() and __floattohalf() intrinsics.
float __halftofloat( unsigned short ); unsigned short __floattohalf( float );
These intrinsics use unsigned short because the C language has not standardized the half floatingpoint type.
8.3.5. Case Study: float→half Conversion
Studying the float→half conversion operation is a useful way to learn the details of floatingpoint encodings and rounding. Because it’s a simple unary operation, we can focus on the encoding and rounding without getting distracted by the details of floatingpoint arithmetic and the precision of intermediate representations.
When converting from float to half, the correct output for any float too large to represent is half infinity. Any float too small to represent as a half (even a denormal half) must be clamped to 0.0. The maximum float that rounds to half 0.0 is 0x32FFFFFF, or 2.98^{8}, while the smallest float that rounds to half infinity is 65520.0. float values inside this range can be converted to half by propagating the sign bit, rebiasing the exponent (since float has an 8bit exponent biased by 127 and half has a 5bit exponent biased by 15), and rounding the float mantissa to the nearest half mantissa value. Rounding is straightforward in all cases except when the input value falls exactly between the two possible output values. When this is the case, the IEEE standard specifies rounding to the “nearest even” value. In decimal arithmetic, this would mean rounding 1.5 to 2.0, but also rounding 2.5 to 2.0 and (for example) rounding 0.5 to 0.0.
Listing 8.3 shows a C routine that exactly replicates the floattohalf conversion operation, as implemented by CUDA hardware. The variables exp and mag contain the input exponent and “magnitude,” the mantissa and exponent together with the sign bit masked off. Many operations, such as comparisons and rounding operations, can be performed on the magnitude without separating the exponent and mantissa.
The macro LG_MAKE_MASK, used in Listing 8.3, creates a mask with a given bit count: #define LG_MAKE_MASK(bits) ((1<<bits)1). A volatile union is used to treat the same 32bit value as float and unsigned int; idioms such as *((float *) (&u)) are not portable. The routine first propagates the input sign bit and masks it off the input.
After extracting the magnitude and exponent, the function deals with the special case when the input float is INF or NaN, and does an early exit. Note that INF is signed, but NaN has a canonical unsigned value. Lines 50–80 clamp the input float value to the minimum or maximum values that correspond to representable half values and recompute the magnitude for clamped values. Don’t be fooled by the elaborate code constructing f32MinRInfin and f32MaxRf16_zero; those are constants with the values 0x477ff000 and 0x32ffffff, respectively.
The remainder of the routine deals with the cases of output normal and denormal (input denormals are clamped in the preceding code, so mag corresponds to a normal float). As with the clamping code, f32Minf16Normal is a constant, and its value is 0x38ffffff.
To construct a normal, the new exponent must be computed (lines 92 and 93) and the correctly rounded 10 bits of mantissa shifted into the output. To construct a denormal, the implicit 1 must be OR’d into the output mantissa and the resulting mantissa shifted by the amount corresponding to the input exponent. For both normals and denormals, the rounding of the output mantissa is accomplished in two steps. The rounding is accomplished by adding a mask of 1’s that ends just short of the output’s LSB, as seen in Figure 8.3.
Figure 8.3 Rounding mask (half).
This operation increments the output mantissa if bit 12 of the input is set; if the input mantissa is all 1’s, the overflow causes the output exponent to correctly increment. If we added one more 1 to the MSB of this adjustment, we’d have elementary school–style rounding where the tiebreak goes to the larger number. Instead, to implement roundtonearest even, we conditionally increment the output mantissa if the LSB of the 10bit output is set (Figure 8.4). Note that these steps can be performed in either order or can be reformulated in many different ways.
Figure 8.4 Roundtonearesteven (half).
Listing 8.3. ConvertToHalf().
/* * exponent shift and mantissa bit count are the same. * When we are shifting, we use [f16f32]ExpShift * When referencing the number of bits in the mantissa, * we use [f16f32]MantissaBits */ const int f16ExpShift = 10; const int f16MantissaBits = 10; const int f16ExpBias = 15; const int f16MinExp = 14; const int f16MaxExp = 15; const int f16SignMask = 0x8000; const int f32ExpShift = 23; const int f32MantissaBits = 23; const int f32ExpBias = 127; const int f32SignMask = 0x80000000; unsigned short ConvertFloatToHalf( float f ) { /* * Use a volatile union to portably coerce * 32bit float into 32bit integer */ volatile union { float f; unsigned int u; } uf; uf.f = f; // return value: start by propagating the sign bit. unsigned short w = (uf.u >> 16) & f16SignMask; // Extract input magnitude and exponent unsigned int mag = uf.u & ~f32SignMask; int exp = (int) (mag >> f32ExpShift)  f32ExpBias; // Handle float32 Inf or NaN if ( exp == f32ExpBias+1 ) { // INF or NaN if ( mag & LG_MAKE_MASK(f32MantissaBits) ) return 0x7fff; // NaN // INF  propagate sign return w0x7c00; } /* * clamp float32 values that are not representable by float16 */ { // min float32 magnitude that rounds to float16 infinity unsigned int f32MinRInfin = (f16MaxExp+f32ExpBias) << f32ExpShift; f32MinRInfin = LG_MAKE_MASK( f16MantissaBits+1 ) << (f32MantissaBitsf16MantissaBits1); if (mag > f32MinRInfin) mag = f32MinRInfin; } { // max float32 magnitude that rounds to float16 0.0 unsigned int f32MaxRf16_zero = f16MinExp+f32ExpBias (f32MantissaBitsf16MantissaBits1); f32MaxRf16_zero <<= f32ExpShift; f32MaxRf16_zero = LG_MAKE_MASK( f32MantissaBits ); if (mag < f32MaxRf16_zero) mag = f32MaxRf16_zero; } /* * compute exp again, in case mag was clamped above */ exp = (mag >> f32ExpShift)  f32ExpBias; // min float32 magnitude that converts to float16 normal unsigned int f32Minf16Normal = ((f16MinExp+f32ExpBias)<< f32ExpShift); f32Minf16Normal = LG_MAKE_MASK( f32MantissaBits ); if ( mag >= f32Minf16Normal ) { // // Case 1: float16 normal // // Modify exponent to be biased for float16, not float32 mag += (unsigned int) ((f16ExpBiasf32ExpBias)<< f32ExpShift); int RelativeShift = f32ExpShiftf16ExpShift; // add rounding bias mag += LG_MAKE_MASK(RelativeShift1); // roundtonearest even mag += (mag >> RelativeShift) & 1; w = mag >> RelativeShift; } else { /* * Case 2: float16 denormal */ // mask off exponent bits  now fraction only mag &= LG_MAKE_MASK(f32MantissaBits); // make implicit 1 explicit mag = (1<<f32ExpShift); int RelativeShift = f32ExpShiftf16ExpShift+f16MinExpexp; // add rounding bias mag += LG_MAKE_MASK(RelativeShift1); // roundtonearest even mag += (mag >> RelativeShift) & 1; w = mag >> RelativeShift; } return w; }
In practice, developers should convert float to half by using the __floattohalf() intrinsic, which the compiler translates to a single F2F machine instruction. This sample routine is provided purely to aid in understanding floatingpoint layout and rounding; also, examining all the specialcase code for INF/NAN and denormal values helps to illustrate why these features of the IEEE spec have been controversial since its inception: They make hardware slower, more costly, or both due to increased silicon area and engineering effort for validation.
In the code accompanying this book, the ConvertFloatToHalf() routine in Listing 8.3 is incorporated into a program called float_to_float16.cu that tests its output for every 32bit floatingpoint value.
8.3.6. Math Library
CUDA includes a builtin math library modeled on the C runtime library, with a few small differences: CUDA hardware does not include a rounding mode register (instead, the round mode is encoded on a perinstruction basis),^{14} so functions such as rint() that reference the current rounding mode always roundtonearest. Additionally, the hardware does not raise floatingpoint exceptions; results of aberrant operations, such as taking the square root of a negative number, are encoded as NaNs.
Table 8.13 lists the math library functions and the maximum error in ulps for each function. Most functions that operate on float have an “f” appended to the function name—for example, the functions that compute the sine function are as follows.
double sin( double angle ); float sinf( float angle );
Table 8.13 Math Library
ULP ERROR 

FUNCTION 
OPERATION 
EXPRESSION 
32 
64 
x+y 
Addition 
x+y 
0^{1} 
0 
x/y 
Multiplication 
x*y 
0^{1} 
0 
x/y 
Division 
x/y 
2^{2} 
0 
1/x 
Reciprocal 
1/x 
1^{2} 
0 
acos[f](x) 
Inverse cosine 
cos^{–1} x 
3 
2 
acosh[f](x) 
Inverse hyperbolic cosine 
4 
2 

asin[f](x) 
Inverse sine 
sin^{–1}x 
4 
2 
asinh[f](x) 
Inverse hyperbolic sine 
3 
2 

atan[f](x) 
Inverse tangent 
tan^{–1}x 
2 
2 
atan2[f](y,x) 
Inverse tangent of y/x 
3 
2 

atanh[f](x) 
Inverse hyperbolic tangent 
tanh^{–1} 
3 
2 
cbrt[f](x) 
Cube root 
1 
1 

ceil[f](x) 
“Ceiling,” nearest integer greater than or equal to x 
⌊x⌋ 
0 

copysign[f](x,y) 
Sign of y, magnitude of x 
n/a 

cos[f](x) 
Cosine 
cos x 
2 
1 
cosh[f](x) 
Hyperbolic cosine 
2 

cospi[f](x) 
Cosine, scaled byΠ 
cosΠx 
2 

erf[f](x) 
Error function 
3 
2 

erfc[f](x) 
Complementary error function 
6 
4 

erfcinv[f](y) 
Inverse complementary error function 
Return x for which y=1erff(x) 
7 
8 
erfcx[f](x) 
Scaled error function 
e^{x}^{2}[erff(x)) 
6 
3 
erfinv[f](y) 
Inverse error function 
Return x for which y=erff(x) 
3 
5 
exp[f](x) 
Natural exponent 
e^{x} 
2 
1 
exp10[f](x) 
Exponent (base 10) 
10^{x} 
2 
1 
exp2[f](x) 
Exponent (base 2) 
2^{x} 
2 
1 
expm1[f](x) 
Natural exponent, minus one 
e^{x}–1 
1 
1 
fabs[f](x) 
Absolute value 
x 
0 
0 
fdim[f](x,y) 
Positive difference 
x–1,x>y+0,x≤NAN,X or y NAN 
0 
0 
floor[f](x) 
“Floor,” nearest integer less than or equal to x 
⌊x⌋ 
0 
0 
fma[f](x,y,z) 
Multiplyadd 
xy + z 
0 
0 
fmax[f](x,y) 
Maximum 
x, x y or isNaN(y)y, otherwise 
0 
0 
fmin[f](x,y) 
Minimum 
x, x y or isNaN(y)y, otherwise 
0 
0 
fmod[f](x,y) 
Floatingpoint remainder 
0 
0 

frexp[f](x,exp) 
Fractional component 
0 
0 

hypot[f](x,y) 
Length of hypotenuse 
3 
2 

ilogb[f](x) 
Get exponent 
0 
0 

isfinite(x) 
Nonzero if x is not ±INF 
n/a 

isinf(x) 
Nonzero if x is ±INF 
n/a 

isnan(x) 
Nonzero if x is a NaN 
n/a 

j0[f](x) 
Bessel function of the first kind (n=0) 
J_{0}(x) 
9^{3} 
7^{3} 
j1[f](x) 
Bessel function of the first kind (n=1) 
J_{1}(x) 
9^{3} 
7^{3} 
jn[f](n,x) 
Bessel function of the first kind 
J_{n}(x) 
* 

ldexp[f](x,exp) 
Scale by power of 2 
x2^{exp} 
0 
0 
lgamma[f](x) 
Logarithm of gamma function 
ln(Γ(x)) 
6^{4} 
4^{4} 
llrint[f](x) 
Round to long long 
0 
0 

llround[f](x) 
Round to long long 
0 
0 

lrint[f](x) 
Round to long 
0 
* 

lround[f](x) 
Round to long 
0 
0 

log[f](x) 
Natural logarithm 
ln(x) 
1 
1 
log10[f](x) 
Logarithm (base 10) 
log_{10} x 
3 
1 
log1p[f](x) 
Natural logarithm of x+1 
ln(x + 1) 
2 
1 
log2[f](x) 
Logarithm (base 2) 
log_{2} x 
3 
1 
logb[f](x) 
Get exponent 
0 
0 

modff(x,iptr) 
Split fractional and integer parts 
0 
0 

nan[f](cptr) 
Returns NaN 
NaN 
n/a 

nearbyint[f](x) 
Round to integer 
0 
0 

nextafter[f](x,y) 
Returns the FP value closest to x in the direction of y 
n/a 

normcdf[f](x) 
Normal cumulative distribution 
6 
5 

normcdinv[f](x) 
Inverse normal cumulative distribution 
5 
8 

pow[f](x,y) 
Power function 
x^{y} 
8 
2 
rcbrt[f](x) 
Inverse cube root 
2 
1 

remainder[f](x,y) 
Remainder 
0 
0 

remquo[f](x,y,iptr) 
Remainder (also returns quotient) 
0 
0 

rsqrt[f](x) 
rsqrt[f](x) Reciprocal 
2 
1 

rint[f](x) 
Round to nearest int 
0 
0 

round[f](x) 
Round to nearest int 
0 
0 

scalbln[f](x,n) 
Scale x by 2^{n} (n is long int) 
x2^{n} 
0 
0 
scalbn[f](x,n) 
Scale x by 2n (n is int) 
x2^{n} 
0 
0 
signbit(x) 
Nonzero if x is negative 
n/a 
0 

sin[f](x) 
Sine 
sin x 
2 
1 
sincos[f](x,s,c) 
Sine and cosine 
*s=sin(x);*c=cos(x); 
2 
1 
sincospi[f](x,s,c) 
Sine and cosine 
*s=sin(Πx);*c=cos(Πx); 
2 
1 
sinh[f](x) 
Hyperbolic sine 
3 
1 

Sine, scaled by Π 
sin Πx 
2 
1 

sqrt[f](x) 
Square root 
3^{6} 
0 

tan[f](x) 
Tangent 
tan x 
4 
2 
tanh[f](x) 
Hyperbolic tangent 
2 
1 

tgamma[f](x) 
True gamma function 
γ(x) 
11 
8 
trunc[f](x) 
Truncate (round to integer toward zero) 
0 
0 

y0[f](x) 
Bessel function of the second kind (n=0) 
Y_{0}(x) 
9^{3} 
7^{3} 
y1[f](x) 
Bessel function of the second kind (n=1) 
Y_{1}(x) 
9^{3} 
7^{3} 
yn[f](n,x) 
Bessel function of the second kind 
Y_{n}(x) 
.. 
* For the Bessel functions jnf(n,x) and jn(n,x), for n=128the maximum absolute error is 2.2x10^{6} and 5x10^{12}, respectively.
** For the Bessel function ynf(n,x), the error is ⌈2 2.5n⌉ for x; otherwise, the maximum absolute error is 2.2x10^{6}for n=128. For yn(n,x), the maximum absolute error is 5X10^{12}.
 On SM 1.x class hardware, the precision of addition and multiplication operation that are merged into FMAD instructions will suffer due to truncation of the intermediate mantissa.
 On SM 2.x and later hardware, developers can reduce this error rate to 0 ulps by specifying precdiv=true.
 For float, the error is 9 ulps for x<8; otherwise, the maximum absolute error is 2.2X10^{6}. For double, the error is 7 ulps for x<8; otherwise, the maximum absolute error is 5×1012.
 The error for lgammaf() is greater than 6 inside the interval –10.001, –2.264. The error for lgamma() is greater than 4 inside the interval –11.001, –2.2637.
 On SM 2.x and later hardware, developers can reduce this error rate to 0 ulps by specifying precsqrt=true.
These are denoted in Table 8.13 as, for example, sin[f].
Conversion to Integer
According to the C runtime library definition, the nearbyint() and rint() functions round a floatingpoint value to the nearest integer using the “current rounding direction,” which in CUDA is always roundtonearesteven. In the C runtime, nearbyint() and rint() differ only in their handling of the INEXACT exception. But since CUDA does not raise floatingpoint exceptions, the functions behave identically.
round() implements elementary school–style rounding: For floatingpoint values halfway between integers, the input is always rounded away from zero. NVIDIA recommends against using this function because it expands to eight (8) instructions as opposed to one for rint() and its variants. trunc() truncates or “chops” the floatingpoint value, rounding toward zero. It compiles to a single instruction.
Fractions and Exponents
float frexpf(float x, int *eptr);
frexpf() breaks the input into a floatingpoint significand in the range [0.5, 1.0) and an integral exponent for 2, such that
x = Significand · 2^{Exponent}
float logbf( float x );
logbf() extracts the exponent from x and returns it as a floatingpoint value. It is equivalent to floorf(log2f(x)), except it is faster. If x is a denormal, logbf() returns the exponent that x would have if it were normalized.
float ldexpf( float x, int exp ); float scalbnf( float x, int n ); float scanblnf( float x, long n );
ldexpf(), scalbnf(), and scalblnf() all compute x2^{n} by direct manipulation of floatingpoint exponents.
FloatingPoint Remainder
modff() breaks the input into fractional and integer parts.
float modff( float x, float *intpart );
The return value is the fractional part of x, with the same sign.
remainderf(x,y) computes the floatingpoint remainder of dividing x by y. The return value is xn*y, where n is x/y, rounded to the nearest integer. If x –ny = 0.5, n is chosen to be even.
float remquof(float x, float y, int *quo);
computes the remainder and passes back the lower bits of the integral quotient x/y, with the same sign as x/y.
Bessel Functions
The Bessel functions of order n relate to the differential equation
n can be a real number, but for purposes of the C runtime, it is a nonnegative integer.
The solution to this secondorder ordinary differential equation combines Bessel functions of the first kind and of the second kind.
The math runtime functions jn[f]() and yn[f]() compute J_{n}(x) and Y_{n}(x), respectively. j0f(), j1f(), y0f(), and y1f() compute these functions for the special cases of n=0 and n=1.
Gamma Function
The gamma function Γ is an extension of the factorial function, with its argument shifted down by 1, to real numbers. It has a variety of definitions, one of which is as follows.
The function grows so quickly that the return value loses precision for relatively small input values, so the library provides the lgamma() function, which returns the natural logarithm of the gamma function, in addition to the tgamma() (“true gamma”) function.
8.3.7. Additional Reading
Goldberg’s survey (with the captivating title “What Every Computer Scientist Should Know About Floating Point Arithmetic”) is a good introduction to the topic.
http://download.oracle.com/docs/cd/E1995701/8063568/ncg_goldberg.html
Nathan Whitehead and Alex FitFlorea of NVIDIA have coauthored a white paper entitled “Precision & Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs.”
http://developer.download.nvidia.com/assets/cuda/files/NVIDIACUDAFloatingPoint.pdf
Increasing Effective Precision
Dekker and Kahan developed methods to almost double the effective precision of floatingpoint hardware using pairs of numbers in exchange for a slight reduction in exponent range (due to intermediate underflow and overflow at the far ends of the range). Some papers on this topic include the following.
Dekker, T.J. Point technique for extending the available precision. Numer. Math. 18, 1971, pp. 224–242.
Linnainmaa, S. Software for doubledprecision floating point computations. ACM TOMS 7, pp. 172–283 (1981).
Shewchuk, J.R. Adaptive precision floatingpoint arithmetic and fast robust geometric predicates. Discrete & Computational Geometry 18, 1997, pp. 305–363.
Some GPUspecific work on this topic has been done by Andrew Thall, Da Graça, and Defour.
Guillaume, Da Graça, and David Defour. Implementation of floatfloat operators on graphics hardware, 7th Conference on Real Numbers and Computers, RNC7 (2006).
http://hal.archivesouvertes.fr/docs/00/06/33/56/PDF/floatfloat.pdf
Thall, Andrew. Extendedprecision floatingpoint numbers for GPU computation. 2007.