Home > Articles > Programming

  • Print
  • + Share This
This chapter is from the book

8.3. Floating-Point Support

Fast native floating-point hardware is the raison d’être for GPUs, and in many ways they are equal to or superior to CPUs in their floating-point implementation. Denormals are supported at full speed,9 directed rounding may be specified on a per-instruction basis, and the Special Function Units deliver high-performance approximation functions to six popular single-precision transcendentals. In contrast, x86 CPUs implement denormals in microcode that runs perhaps 100x slower than operating on normalized floating-point operands. Rounding direction is specified by a control word that takes dozens of clock cycles to change, and the only transcendental approximation functions in the SSE instruction set are for reciprocal and reciprocal square root, which give 12-bit approximations that must be refined with a Newton-Raphson iteration before being used.

Since GPUs’ greater core counts are offset somewhat by their lower clock frequencies, developers can expect at most a 10x (or thereabouts) speedup on a level playing field. If a paper reports a 100x or greater speedup from porting an optimized CPU implementation to CUDA, chances are one of the above-described “instruction set mismatches” played a role.

8.3.1. Formats

Figure 8.2 depicts the three (3) IEEE standard floating-point formats supported by CUDA: double precision (64-bit), single precision (32-bit), and half precision (16-bit). The values are divided into three fields: sign, exponent, and mantissa. For double, single, and half, the exponent fields are 11, 8, and 5 bits in size, respectively; the corresponding mantissa fields are 52, 23, and 10 bits.

Figure 8.2

Figure 8.2 Floating-point formats.

The exponent field changes the interpretation of the floating-point value. The most common (“normal”) representation encodes an implicit 1 bit into the mantissa and multiplies that value by 2e-bias, where bias is the value added to the actual exponent before encoding into the floating-point representation. The bias for single precision, for example, is 127.

Table 8.7 summarizes how floating-point values are encoded. For most exponent values (so-called “normal” floating-point values), the mantissa is assumed to have an implicit 1, and it is multiplied by the biased value of the exponent. The maximum exponent value is reserved for infinity and Not-A-Number values. Dividing by zero (or overflowing a division) yields infinity; performing an invalid operation (such as taking the square root or logarithm of a negative number) yields a NaN. The minimum exponent value is reserved for values too small to represent with the implicit leading 1. As the so-called denormals10 get closer to zero, they lose bits of effective precision, a phenomenon known as gradual underflow. Table 8.8 gives the encodings and values of certain extreme values for the three formats.

Table 8.7 Floating-Point Representations

DOUBLE PRECISION

EXPONENT

MANTISSA

VALUE

CASE NAME

0

0

±0

Zero

0

Nonzero

±2-1022[0.mantissa]

Denormal

1 to 2046

Any

±2e-1023[1.mantissa]

Normal

2047

0

±∞

Infinity

2047

Nonzero

Not-A-Number

SINGLE PRECISION

0

0

±0

Zero

0

Nonzero

±-126[0.mantissa]

Denormal

1 to 254

Any

±2e-127[1.mantissa]

Normal

255

0

±∞

Infinity

255

Nonzero

Not-A-Number

HALF PRECISION

0

0

±0

Zero

0

Nonzero

±-14[0.mantissa]

Denormal

1 to 30

Any

±2e-15[1.mantissa]

Normal

31

0

±∞

Infinity

31

Nonzero

Not-A-Number

Table 8.8 Floating-Point Extreme Values

DOUBLE PRECISION

HEXADECIMAL

EXACT VALUE

Smallest denormal

0...0001

2-1074

Largest denormal

000F...F

2-1022[1-2-52]

Smallest normal

0010...0

2-1022

1.0

3FF0...0

1

Maximum integer

4340...0

253

Largest normal

7F7FFFFF

21024[1-2-53]

Infinity

7FF00000

Infinity

SINGLE PRECISION

Smallest denormal

00000001

2-149

Largest denormal

007FFFFF

2-126[1-2-23]

Smallest normal

00800000

2-126

1.0

00800000

1

Maximum integer

4B800000

224

Largest normal

7F7FFFFF

2128[1-2-24]

Infinity

7F800000

Infinity

HALF PRECISION

Smallest denormal

0001

2-24

Largest denormal

07FF

2-14[1-2-10]

Smallest normal

0800

2-14

1.0

3c00

1

Maximum integer

3c00

211

Largest normal

7BFF

216[1-2-11]

Infinity

7C00

Infinity

Rounding

The IEEE standard provides for four (4) round modes.

  • Round-to-nearest-even (also called “round-to-nearest”)
  • Round toward zero (also called “truncate” or “chop”)
  • Round down (or “round toward negative infinity”)
  • Round up (or “round toward positive infinity”)

Round-to-nearest, where intermediate values are rounded to the nearest representable floating-point value after each operation, is by far the most commonly used round mode. Round up and round down (the “directed rounding modes”) are used for interval arithmetic, where a pair of floating-point values are used to bracket the intermediate result of a computation. To correctly bracket a result, the lower and upper values of the interval must be rounded toward negative infinity (“down”) and toward positive infinity (“up”), respectively.

The C language does not provide any way to specify round modes on a per-instruction basis, and CUDA hardware does not provide a control word to implicitly specify rounding modes. Consequently, CUDA provides a set of intrinsics to specify the round mode of an operation, as summarized in Table 8.9.

Table 8.9 Intrinsics for Rounding

INTRINSIC

OPERATION

__fadd_[rn|rz|ru|rd]

Addition

__fmul_[rn|rz|ru|rd]

Multiplication

__fmaf_[rn|rz|ru|rd]

Fused multiply-add

__frcp_[rn|rz|ru|rd]

Recriprocal

__fdiv_[rn|rz|ru|rd]

Division

__fsqrt_[rn|rz|ru|rd]

Square root

__dadd_[rn|rz|ru|rd]

Addition

__dmul_[rn|rz|ru|rd]

Multiplication

__fma_[rn|rz|ru|rd]

Fused multiply-add

__drcp_[rn|rz|ru|rd]

Reciprocal

__ddiv_[rn|rz|ru|rd]

Division

__dsqrt_[rn|rz|ru|rd]

Square root

Conversion

In general, developers can convert between different floating-point representations and/or integers using standard C constructs: implicit conversion or explicit typecasts. If necessary, however, developers can use the intrinsics listed in Table 8.10 to perform conversions that are not in the C language specification, such as those with directed rounding.

Table 8.10 Intrinsics for Conversion

INTRINSIC

OPERATION

__float2int_[rn|rz|ru|rd]

float to int

__float2uint_[rn|rz|ru|rd]

float to unsigned int

__int2float_[rn|rz|ru|rd]

int to float

__uint2float_[rn|rz|ru|rd]

unsigned int to float

__float2ll_[rn|rz|ru|rd]

float to 64-bit int

__ll2float_[rn|rz|ru|rd]

64-bit int to float

__ull2float_[rn|rz|ru|rd]

unsigned 64-bit int to float

__double2float_[rn|rz|ru|rd]

double to float

__double2int_[rn|rz|ru|rd]

double to int

__double2uint_[rn|rz|ru|rd]

double to unsigned int

__double2ll_[rn|rz|ru|rd]

double to 64-bit int

__double2ull_[rn|rz|ru|rd]

double to 64-bit unsigned int

__int2double_rn

int to double

__uint2double_rn

unsigned int to double

__ll2double_[rn|rz|ru|rd]

64-bit int to double

__ull2double_[rn|rz|ru|rd]

unsigned 64-bit int to double

Because half is not standardized in the C programming language, CUDA uses unsigned short in the interfaces for __half2float() and __float2half(). __float2half() only supports the round-to-nearest rounding mode.

float __half2float( unsigned short );
unsigned short __float2half( float );

8.3.2. Single Precision (32-Bit)

Single-precision floating-point support is the workhorse of GPU computation. GPUs have been optimized to natively deliver high performance on this data type,11 not only for core standard IEEE operations such as addition and multiplication, but also for nonstandard operations such as approximations to transcendentals such as sin() and log(). The 32-bit values are held in the same register file as integers, so coercion between single-precision floating-point values and 32-bit integers (with __float_as_int() and __int_as_float()) is free.

Addition, Multiplication, and Multiply-Add

The compiler automatically translates +, –, and * operators on floating-point values into addition, multiplication, and multiply-add instructions. The __fadd_rn() and __fmul_rn() intrinsics may be used to suppress fusion of addition and multiplication operations into multiply-add instructions.

Reciprocal and Division

For devices of compute capability 2.x and higher, the division operator is IEEE-compliant when the code is compiled with --prec-div=true. For devices of compute capability 1.x or for devices of compute capability 2.x when the code is compiled with --prec-div=false, the division operator and __fdividef(x,y) have the same accuracy, but for 2126<y<2128, __fdividef(x,y) delivers a result of zero, whereas the division operator delivers the correct result. Also, for 2126<y<2128, if x is infinity, __fdividef(x,y) returns NaN, while the division operator returns infinity.

Transcendentals (SFU)

The Special Function Units (SFUs) in the SMs implement very fast versions of six common transcendental functions.

  • Sine and cosine
  • Logarithm and exponential
  • Reciprocal and reciprocal square root

Table 8.11, excerpted from the paper on the Tesla architecture12summarizes the supported operations and corresponding precision. The SFUs do not implement full precision, but they are reasonably good approximations of these functions and they are fast. For CUDA ports that are significantly faster than an optimized CPU equivalent (say, 25x or more), the code most likely relies on the SFUs.

Table 8.11 SFU Accuracy

FUNCTION

ACCURACY (GOOD BITS)

ULP ERROR

1/x

24.02

0.98

1/sqrt(x)

23.40

1.52

2x

22.51

1.41

log2x

22.57

n/a

sin/cos

22.47

n/a

The SFUs are accessed with the intrinsics given in Table 8.12. Specifying the --fast-math compiler option will cause the compiler to substitute conventional C runtime calls with the corresponding SFU intrinsics listed above.

Table 8.12 SFU Intrinsics

INTRINSIC

OPERATION

__cosf(x)

cos x

__exp10f(x)

10x

__expf(x)

ex

__fdividef(x,y)

x.y

__log2f(x)

log2 x

__log10f(x)

log10 x

__powf(x,y)

xy

__sinf(x)

sin x

__sincosf(x,sptr,cptr)

*s=sin(x);*c=cos(x);

__tanf(x)

tan x

Miscellaneous

__saturate(x) returns 0 if x<0, 1 if x>1, and x otherwise.

8.3.3. Double Precision (64-Bit)

Double-precision floating-point support was added to CUDA with SM 1.3 (first implemented in the GeForce GTX 280), and much improved double-precision support (both functionality and performance) became available with SM 2.0. CUDA’s hardware support for double precision features full-speed denormals and, starting in SM 2.x, a native fused multiply-add instruction (FMAD), compliant with IEEE 754 c. 2008, that performs only one rounding step. Besides being an intrinsically useful operation, FMAD enables full accuracy on certain functions that are converged with the Newton-Raphson iteration.

As with single-precision operations, the compiler automatically translates standard C operators into multiplication, addition, and multiply-add instructions. The __dadd_rn() and __dmul_rn() intrinsics may be used to suppress fusion of addition and multiplication operations into multiply-add instructions.

8.3.4. Half Precision (16-Bit)

With 5 bits of exponent and 10 bits of significand, half values have enough precision for HDR (high dynamic range) images and can be used to hold other types of values that do not require float precision, such as angles. Half precision values are intended for storage, not computation, so the hardware only provides instructions to convert to/from 32-bit.13 These instructions are exposed as the __halftofloat() and __floattohalf() intrinsics.

float __halftofloat( unsigned short );
unsigned short __floattohalf( float );

These intrinsics use unsigned short because the C language has not standardized the half floating-point type.

8.3.5. Case Study: floathalf Conversion

Studying the floathalf conversion operation is a useful way to learn the details of floating-point encodings and rounding. Because it’s a simple unary operation, we can focus on the encoding and rounding without getting distracted by the details of floating-point arithmetic and the precision of intermediate representations.

When converting from float to half, the correct output for any float too large to represent is half infinity. Any float too small to represent as a half (even a denormal half) must be clamped to 0.0. The maximum float that rounds to half 0.0 is 0x32FFFFFF, or 2.98-8, while the smallest float that rounds to half infinity is 65520.0. float values inside this range can be converted to half by propagating the sign bit, rebiasing the exponent (since float has an 8-bit exponent biased by 127 and half has a 5-bit exponent biased by 15), and rounding the float mantissa to the nearest half mantissa value. Rounding is straightforward in all cases except when the input value falls exactly between the two possible output values. When this is the case, the IEEE standard specifies rounding to the “nearest even” value. In decimal arithmetic, this would mean rounding 1.5 to 2.0, but also rounding 2.5 to 2.0 and (for example) rounding 0.5 to 0.0.

Listing 8.3 shows a C routine that exactly replicates the float-to-half conversion operation, as implemented by CUDA hardware. The variables exp and mag contain the input exponent and “magnitude,” the mantissa and exponent together with the sign bit masked off. Many operations, such as comparisons and rounding operations, can be performed on the magnitude without separating the exponent and mantissa.

The macro LG_MAKE_MASK, used in Listing 8.3, creates a mask with a given bit count: #define LG_MAKE_MASK(bits) ((1<<bits)-1). A volatile union is used to treat the same 32-bit value as float and unsigned int; idioms such as *((float *) (&u)) are not portable. The routine first propagates the input sign bit and masks it off the input.

After extracting the magnitude and exponent, the function deals with the special case when the input float is INF or NaN, and does an early exit. Note that INF is signed, but NaN has a canonical unsigned value. Lines 50–80 clamp the input float value to the minimum or maximum values that correspond to representable half values and recompute the magnitude for clamped values. Don’t be fooled by the elaborate code constructing f32MinRInfin and f32MaxRf16_zero; those are constants with the values 0x477ff000 and 0x32ffffff, respectively.

The remainder of the routine deals with the cases of output normal and denormal (input denormals are clamped in the preceding code, so mag corresponds to a normal float). As with the clamping code, f32Minf16Normal is a constant, and its value is 0x38ffffff.

To construct a normal, the new exponent must be computed (lines 92 and 93) and the correctly rounded 10 bits of mantissa shifted into the output. To construct a denormal, the implicit 1 must be OR’d into the output mantissa and the resulting mantissa shifted by the amount corresponding to the input exponent. For both normals and denormals, the rounding of the output mantissa is accomplished in two steps. The rounding is accomplished by adding a mask of 1’s that ends just short of the output’s LSB, as seen in Figure 8.3.

Figure 8.3

Figure 8.3 Rounding mask (half).

This operation increments the output mantissa if bit 12 of the input is set; if the input mantissa is all 1’s, the overflow causes the output exponent to correctly increment. If we added one more 1 to the MSB of this adjustment, we’d have elementary school–style rounding where the tiebreak goes to the larger number. Instead, to implement round-to-nearest even, we conditionally increment the output mantissa if the LSB of the 10-bit output is set (Figure 8.4). Note that these steps can be performed in either order or can be reformulated in many different ways.

Figure 8.4

Figure 8.4 Round-to-nearest-even (half).

Listing 8.3. ConvertToHalf().

/*
 * exponent shift and mantissa bit count are the same.
 *    When we are shifting, we use [f16|f32]ExpShift
 *    When referencing the number of bits in the mantissa,
 *        we use [f16|f32]MantissaBits
 */
const int f16ExpShift = 10;
const int f16MantissaBits = 10;

const int f16ExpBias = 15;
const int f16MinExp = -14;
const int f16MaxExp = 15;
const int f16SignMask = 0x8000;

const int f32ExpShift = 23;
const int f32MantissaBits = 23;
const int f32ExpBias = 127;
const int f32SignMask = 0x80000000;

unsigned short
ConvertFloatToHalf( float f )
{
    /*
     * Use a volatile union to portably coerce
     * 32-bit float into 32-bit integer
     */
    volatile union {
        float f;
        unsigned int u;
    } uf;
    uf.f = f;

    // return value: start by propagating the sign bit.
    unsigned short w = (uf.u >> 16) & f16SignMask;

    // Extract input magnitude and exponent
    unsigned int mag = uf.u & ~f32SignMask;
    int exp = (int) (mag >> f32ExpShift) - f32ExpBias;

    // Handle float32 Inf or NaN
    if ( exp == f32ExpBias+1 ) {    // INF or NaN

        if ( mag & LG_MAKE_MASK(f32MantissaBits) )
            return 0x7fff; // NaN

        // INF - propagate sign
        return w|0x7c00;
    }

    /*
     * clamp float32 values that are not representable by float16
     */
    {
        // min float32 magnitude that rounds to float16 infinity

        unsigned int f32MinRInfin = (f16MaxExp+f32ExpBias) <<
            f32ExpShift;
        f32MinRInfin |= LG_MAKE_MASK( f16MantissaBits+1 ) <<
            (f32MantissaBits-f16MantissaBits-1);

        if (mag > f32MinRInfin)
            mag = f32MinRInfin;
    }

    {
        // max float32 magnitude that rounds to float16 0.0

        unsigned int f32MaxRf16_zero = f16MinExp+f32ExpBias
            (f32MantissaBits-f16MantissaBits-1);
        f32MaxRf16_zero <<= f32ExpShift;
        f32MaxRf16_zero |= LG_MAKE_MASK( f32MantissaBits );

        if (mag < f32MaxRf16_zero)
            mag = f32MaxRf16_zero;
    }

    /*
     * compute exp again, in case mag was clamped above
     */
    exp = (mag >> f32ExpShift) - f32ExpBias;

    // min float32 magnitude that converts to float16 normal
    unsigned int f32Minf16Normal = ((f16MinExp+f32ExpBias)<<
        f32ExpShift);
    f32Minf16Normal |= LG_MAKE_MASK( f32MantissaBits );
    if ( mag >= f32Minf16Normal ) {
        //
        // Case 1: float16 normal
        //

        // Modify exponent to be biased for float16, not float32
        mag += (unsigned int) ((f16ExpBias-f32ExpBias)<<
            f32ExpShift);

        int RelativeShift = f32ExpShift-f16ExpShift;

        // add rounding bias
        mag += LG_MAKE_MASK(RelativeShift-1);

        // round-to-nearest even
        mag += (mag >> RelativeShift) & 1;

        w |= mag >> RelativeShift;
    }
    else {
        /*
         * Case 2: float16 denormal
         */

        // mask off exponent bits - now fraction only
        mag &= LG_MAKE_MASK(f32MantissaBits);

        // make implicit 1 explicit
        mag |= (1<<f32ExpShift);

        int RelativeShift = f32ExpShift-f16ExpShift+f16MinExp-exp;

        // add rounding bias
        mag += LG_MAKE_MASK(RelativeShift-1);

        // round-to-nearest even
        mag += (mag >> RelativeShift) & 1;

        w |= mag >> RelativeShift;
    }
    return w;
}

In practice, developers should convert float to half by using the __floattohalf() intrinsic, which the compiler translates to a single F2F machine instruction. This sample routine is provided purely to aid in understanding floating-point layout and rounding; also, examining all the special-case code for INF/NAN and denormal values helps to illustrate why these features of the IEEE spec have been controversial since its inception: They make hardware slower, more costly, or both due to increased silicon area and engineering effort for validation.

In the code accompanying this book, the ConvertFloatToHalf() routine in Listing 8.3 is incorporated into a program called float_to_float16.cu that tests its output for every 32-bit floating-point value.

8.3.6. Math Library

CUDA includes a built-in math library modeled on the C runtime library, with a few small differences: CUDA hardware does not include a rounding mode register (instead, the round mode is encoded on a per-instruction basis),14 so functions such as rint() that reference the current rounding mode always round-to-nearest. Additionally, the hardware does not raise floating-point exceptions; results of aberrant operations, such as taking the square root of a negative number, are encoded as NaNs.

Table 8.13 lists the math library functions and the maximum error in ulps for each function. Most functions that operate on float have an “f” appended to the function name—for example, the functions that compute the sine function are as follows.

double sin( double angle );
float sinf( float angle );

Table 8.13 Math Library

ULP ERROR

FUNCTION

OPERATION

EXPRESSION

32

64

x+y

Addition

x+y

01

0

x/y

Multiplication

x*y

01

0

x/y

Division

x/y

22

0

1/x

Reciprocal

1/x

12

0

acos[f](x)

Inverse cosine

cos–1 x

3

2

acosh[f](x)

Inverse hyperbolic cosine

p0259_01.jpg

4

2

asin[f](x)

Inverse sine

sin–1x

4

2

asinh[f](x)

Inverse hyperbolic sine

p0259_02.jpg

3

2

atan[f](x)

Inverse tangent

tan–1x

2

2

atan2[f](y,x)

Inverse tangent of y/x

p0259_04.jpg

3

2

atanh[f](x)

Inverse hyperbolic tangent

tanh–1

3

2

cbrt[f](x)

Cube root

259fig01.jpg

1

1

ceil[f](x)

“Ceiling,” nearest integer greater than or equal to x

⌊x⌋

0

copysign[f](x,y)

Sign of y, magnitude of x

n/a

cos[f](x)

Cosine

cos x

2

1

cosh[f](x)

Hyperbolic cosine

p0263_01.jpg

2

cospi[f](x)

Cosine, scaled byΠ

cosΠx

2

erf[f](x)

Error function

p0260_01.jpg

3

2

erfc[f](x)

Complementary error function

p0260_02.jpg

6

4

erfcinv[f](y)

Inverse complementary error function

Return x for which y=1-erff(x)

7

8

erfcx[f](x)

Scaled error function

ex2[erff(x))

6

3

erfinv[f](y)

Inverse error function

Return x for which y=erff(x)

3

5

exp[f](x)

Natural exponent

ex

2

1

exp10[f](x)

Exponent (base 10)

10x

2

1

exp2[f](x)

Exponent (base 2)

2x

2

1

expm1[f](x)

Natural exponent, minus one

ex–1

1

1

fabs[f](x)

Absolute value

|x|

0

0

fdim[f](x,y)

Positive difference

x–1,x>y+0,x≤NAN,X or y NAN

0

0

floor[f](x)

“Floor,” nearest integer less than or equal to x

⌊x⌋

0

0

fma[f](x,y,z)

Multiply-add

xy + z

0

0

fmax[f](x,y)

Maximum

x, x y or isNaN(y)y, otherwise

0

0

fmin[f](x,y)

Minimum

x, x y or isNaN(y)y, otherwise

0

0

fmod[f](x,y)

Floating-point remainder

0

0

frexp[f](x,exp)

Fractional component

0

0

hypot[f](x,y)

Length of hypotenuse

p0261_01.jpg

3

2

ilogb[f](x)

Get exponent

0

0

isfinite(x)

Nonzero if x is not ±INF

n/a

isinf(x)

Nonzero if x is ±INF

n/a

isnan(x)

Nonzero if x is a NaN

n/a

j0[f](x)

Bessel function of the first kind (n=0)

J0(x)

93

73

j1[f](x)

Bessel function of the first kind (n=1)

J1(x)

93

73

jn[f](n,x)

Bessel function of the first kind

Jn(x)

*

ldexp[f](x,exp)

Scale by power of 2

x2exp

0

0

lgamma[f](x)

Logarithm of gamma function

ln(Γ(x))

64

44

llrint[f](x)

Round to long long

0

0

llround[f](x)

Round to long long

0

0

lrint[f](x)

Round to long

0

*

lround[f](x)

Round to long

0

0

log[f](x)

Natural logarithm

ln(x)

1

1

log10[f](x)

Logarithm (base 10)

log10 x

3

1

log1p[f](x)

Natural logarithm of x+1

ln(x + 1)

2

1

log2[f](x)

Logarithm (base 2)

log2 x

3

1

logb[f](x)

Get exponent

0

0

modff(x,iptr)

Split fractional and integer parts

0

0

nan[f](cptr)

Returns NaN

NaN

n/a

nearbyint[f](x)

Round to integer

0

0

nextafter[f](x,y)

Returns the FP value closest to x in the direction of y

n/a

normcdf[f](x)

Normal cumulative distribution

6

5

normcdinv[f](x)

Inverse normal cumulative distribution

5

8

pow[f](x,y)

Power function

xy

8

2

rcbrt[f](x)

Inverse cube root

p0262_01.jpg

2

1

remainder[f](x,y)

Remainder

0

0

remquo[f](x,y,iptr)

Remainder (also returns quotient)

0

0

rsqrt[f](x)

rsqrt[f](x) Reciprocal

p0262_02.jpg

2

1

rint[f](x)

Round to nearest int

0

0

round[f](x)

Round to nearest int

0

0

scalbln[f](x,n)

Scale x by 2n (n is long int)

x2n

0

0

scalbn[f](x,n)

Scale x by 2n (n is int)

x2n

0

0

signbit(x)

Nonzero if x is negative

n/a

0

sin[f](x)

Sine

sin x

2

1

sincos[f](x,s,c)

Sine and cosine

*s=sin(x);*c=cos(x);

2

1

sincospi[f](x,s,c)

Sine and cosine

*s=sin(Πx);*c=cos(Πx);

2

1

sinh[f](x)

Hyperbolic sine

p0263_01.jpg

3

1

Sine, scaled by Π

sin Πx

2

1

sqrt[f](x)

Square root

p0263_03.jpg

36

0

tan[f](x)

Tangent

tan x

4

2

tanh[f](x)

Hyperbolic tangent

p0263_02.jpg

2

1

tgamma[f](x)

True gamma function

γ(x)

11

8

trunc[f](x)

Truncate (round to integer toward zero)

0

0

y0[f](x)

Bessel function of the second kind (n=0)

Y0(x)

93

73

y1[f](x)

Bessel function of the second kind (n=1)

Y1(x)

93

73

yn[f](n,x)

Bessel function of the second kind

Yn(x)

..

* For the Bessel functions jnf(n,x) and jn(n,x), for n=128the maximum absolute error is 2.2x10-6 and 5x10-12, respectively.

** For the Bessel function ynf(n,x), the error is ⌈2 2.5n⌉ for |x|; otherwise, the maximum absolute error is 2.2x10-6for n=128. For yn(n,x), the maximum absolute error is 5X10-12.

  1. On SM 1.x class hardware, the precision of addition and multiplication operation that are merged into FMAD instructions will suffer due to truncation of the intermediate mantissa.
  2. On SM 2.x and later hardware, developers can reduce this error rate to 0 ulps by specifying --prec-div=true.
  3. For float, the error is 9 ulps for |x|<8; otherwise, the maximum absolute error is 2.2X10-6. For double, the error is 7 ulps for |x|<8; otherwise, the maximum absolute error is 5×10-12.
  4. The error for lgammaf() is greater than 6 inside the interval –10.001, –2.264. The error for lgamma() is greater than 4 inside the interval –11.001, –2.2637.
  5. On SM 2.x and later hardware, developers can reduce this error rate to 0 ulps by specifying --prec-sqrt=true.

These are denoted in Table 8.13 as, for example, sin[f].

Conversion to Integer

According to the C runtime library definition, the nearbyint() and rint() functions round a floating-point value to the nearest integer using the “current rounding direction,” which in CUDA is always round-to-nearest-even. In the C runtime, nearbyint() and rint() differ only in their handling of the INEXACT exception. But since CUDA does not raise floating-point exceptions, the functions behave identically.

round() implements elementary school–style rounding: For floating-point values halfway between integers, the input is always rounded away from zero. NVIDIA recommends against using this function because it expands to eight (8) instructions as opposed to one for rint() and its variants. trunc() truncates or “chops” the floating-point value, rounding toward zero. It compiles to a single instruction.

Fractions and Exponents

float frexpf(float x, int *eptr);

frexpf() breaks the input into a floating-point significand in the range [0.5, 1.0) and an integral exponent for 2, such that

x = Significand · 2Exponent

float logbf( float x );

logbf() extracts the exponent from x and returns it as a floating-point value. It is equivalent to floorf(log2f(x)), except it is faster. If x is a denormal, logbf() returns the exponent that x would have if it were normalized.

float ldexpf( float x, int exp );
float scalbnf( float x, int n );
float scanblnf( float x, long n );

ldexpf(), scalbnf(), and scalblnf() all compute x2n by direct manipulation of floating-point exponents.

Floating-Point Remainder

modff() breaks the input into fractional and integer parts.

float modff( float x, float *intpart );

The return value is the fractional part of x, with the same sign.

remainderf(x,y) computes the floating-point remainder of dividing x by y. The return value is x-n*y, where n is x/y, rounded to the nearest integer. If |x –ny| = 0.5, n is chosen to be even.

float remquof(float x, float y, int *quo);

computes the remainder and passes back the lower bits of the integral quotient x/y, with the same sign as x/y.

Bessel Functions

The Bessel functions of order n relate to the differential equation

265equ01.jpg

n can be a real number, but for purposes of the C runtime, it is a nonnegative integer.

The solution to this second-order ordinary differential equation combines Bessel functions of the first kind and of the second kind.

265equ02.jpg

The math runtime functions jn[f]() and yn[f]() compute Jn(x) and Yn(x), respectively. j0f(), j1f(), y0f(), and y1f() compute these functions for the special cases of n=0 and n=1.

Gamma Function

The gamma function Γ is an extension of the factorial function, with its argument shifted down by 1, to real numbers. It has a variety of definitions, one of which is as follows.

265equ03.jpg

The function grows so quickly that the return value loses precision for relatively small input values, so the library provides the lgamma() function, which returns the natural logarithm of the gamma function, in addition to the tgamma() (“true gamma”) function.

8.3.7. Additional Reading

Goldberg’s survey (with the captivating title “What Every Computer Scientist Should Know About Floating Point Arithmetic”) is a good introduction to the topic.

http://download.oracle.com/docs/cd/E19957-01/806-3568/ncg_goldberg.html

Nathan Whitehead and Alex Fit-Florea of NVIDIA have coauthored a white paper entitled “Precision & Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs.”

http://developer.download.nvidia.com/assets/cuda/files/NVIDIA-CUDA-Floating-Point.pdf

Increasing Effective Precision

Dekker and Kahan developed methods to almost double the effective precision of floating-point hardware using pairs of numbers in exchange for a slight reduction in exponent range (due to intermediate underflow and overflow at the far ends of the range). Some papers on this topic include the following.

Dekker, T.J. Point technique for extending the available precision. Numer. Math. 18, 1971, pp. 224–242.

Linnainmaa, S. Software for doubled-precision floating point computations. ACM TOMS 7, pp. 172–283 (1981).

Shewchuk, J.R. Adaptive precision floating-point arithmetic and fast robust geometric predicates. Discrete & Computational Geometry 18, 1997, pp. 305–363.

Some GPU-specific work on this topic has been done by Andrew Thall, Da Graça, and Defour.

Guillaume, Da Graça, and David Defour. Implementation of float-float operators on graphics hardware, 7th Conference on Real Numbers and Computers, RNC7 (2006).

http://hal.archives-ouvertes.fr/docs/00/06/33/56/PDF/float-float.pdf

Thall, Andrew. Extended-precision floating-point numbers for GPU computation. 2007.

http://andrewthall.org/papers/df64_qf128.pdf

  • + Share This
  • 🔖 Save To Your Account