Home > Articles > Programming

  • Print
  • + Share This
This chapter is from the book

8.2. Integer Support

The SMs have the full complement of 32-bit integer operations.

  • Addition with optional negation of an operand for subtraction
  • Multiplication and multiply-add
  • Integer division
  • Logical operations
  • Condition code manipulation
  • Conversion to/from floating point
  • Miscellaneous operations (e.g., SIMD instructions for narrow integers, population count, find first zero)

CUDA exposes most of this functionality through standard C operators. Nonstandard operations, such as 24-bit multiplication, may be accessed using inline PTX assembly or intrinsic functions.

8.2.1. Multiplication

Multiplication is implemented differently on Tesla- and Fermi-class hardware. Tesla implements a 24-bit multiplier, while Fermi implements a 32-bit multiplier. As a consequence, full 32-bit multiplication on SM 1.x hardware requires four instructions. For performance-sensitive code targeting Tesla-class hardware, it is a performance win to use the intrinsics for 24-bit multiply.8 Table 8.4 shows the intrinsics related to multiplication.

Table 8.4 Multiplication Intrinsics

INTRINSIC

DESCRIPTION

__[u]mul24

Returns the least significant 32 bits of the product of the 24 least significant bits of the integer parameters. The 8 most significant bits of the inputs are ignored.

__[u]mulhi

Returns the most significant 32 bits of the product of the inputs.

__[u]mul64hi

Returns the most significant 64 bits of the products of the 64-bit inputs.

8.2.2. Miscellaneous (Bit Manipulation)

The CUDA compiler implements a number of intrinsics for bit manipulation, as summarized in Table 8.5. On SM 2.x and later architectures, these intrinsics map to single instructions. On pre-Fermi architectures, they are valid but may compile into many instructions. When in doubt, disassemble and look at the microcode! 64-bit variants have “ll” (two ells for “long long”) appended to the intrinsic name __clzll(), ffsll(), popcll(), brevll().

Table 8.5 Bit Manipulation Intrinsics

INTRINSIC

SUMMARY

DESCRIPTION

__brev(x)

Bit reverse

Reverses the order of bits in a word

__byte_perm(x,y,s)

Permute bytes

Returns a 32-bit word whose bytes were selected from the two inputs according to the selector parameter s

__clz(x)

Count leading zeros

Returns number of zero bits (0–32) before most significant set bit

__ffs(x)

Find first sign bit

Returns the position of the least significant set bit.The least significant bit is position 1. For an input of 0,__ffs() returns 0.

__popc(x)

Population count

Returns the number of set bits

__[u]sad(x,y,z)

Sum of absolute differences

Adds |x-y| to z and returns the result

8.2.3. Funnel Shift (SM 3.5)

GK110 added a 64-bit “funnel shift” instruction that concatenates two 32-bit values together (the least significant and most significant halves are specified as separate 32-bit inputs, but the hardware operates on an aligned register pair), shifts the resulting 64-bit value left or right, and then returns the most significant (for left shift) or least significant (for right shift) 32 bits.

Funnel shift may be accessed with the intrinsics given in Table 8.6. These intrinsics are implemented as inline device functions (using inline PTX assembler) in sm_35_intrinsics.h. By default, the least significant 5 bits of the shift count are masked off; the _lc and _rc intrinsics clamp the shift value to the range 0..32.

Table 8.6 Funnel Shift Intrinsics

INTRINSIC

DESCRIPTION

__funnelshift_l(hi, lo, sh)

Concatenates [hi:lo] into a 64-bit quantity, shifts it left by (sh&31)bits, and returns the most significant 32 bits

__funnelshift_lc(hi, lo, sh)

Concatenates [hi:lo] into a 64-bit quantity, shifts it left by min(sh,32) bits, and returns the most significant 32 bits

__funnelshift_r(hi, lo, sh)

Concatenates [hi:lo] into a 64-bit quantity, shifts it right by (sh&31) bits, and returns the least significant 32 bits

__funnelshift_rc(hi, lo, sh)

Concatenates [hi:lo] into a 64-bit quantity, shifts it right by min(sh,32) bits, and returns the least significant 32 bits

Applications for funnel shift include the following.

  • Multiword shift operations
  • Memory copies between misaligned buffers using aligned loads and stores
  • Rotate

To right-shift data sizes greater than 64 bits, use repeated __funnelshift_r() calls, operating from the least significant to the most significant word. The most significant word of the result is computed using operator>>, which shifts in zero or sign bits as appropriate for the integer type. To left-shift data sizes greater than 64 bits, use repeated __funnelshift_l() calls, operating from the most significant to the least significant word. The least significant word of the result is computed using operator<<. If the hi and lo parameters are the same, the funnel shift effects a rotate operation.

  • + Share This
  • 🔖 Save To Your Account