Home > Store

Inner Loops: A Sourcebook for Fast 32-bit Software Development

Register your product to gain access to bonus material or receive a coupon.

Inner Loops: A Sourcebook for Fast 32-bit Software Development


  • Sorry, this book is no longer in print.
Not for Sale


  • Copyright 1997
  • Dimensions: 7-3/8" x 9-1/8"
  • Pages: 392
  • Edition: 1st
  • Book
  • ISBN-10: 0-201-47960-5
  • ISBN-13: 978-0-201-47960-7

Includes coverage of Pentium II - referred to in the chapters on Pentium Pro and MMX.

No speed limits have been posted on the PC performance track, yet much software runs in the slow lane, functioning at 10 to 50 percent of its potential speed. The cause of these slowdowns? Bottlenecking on time-critical inner loops.

Inner Loops: A Sourcebook for Fast 32-bit Software Development gives the green light to optimal PC performance with practical advice and a strategic sampling of important algorithms. Focused directly on the 32-bit future of PC computing, Inner Loops explores the new rules and opportunities of a wide-open memory space, parallel instruction execution, and clock speeds in the hundreds of megahertz. You'll be taken through:

  • a thorough review of 32-bit code optimization for the 486, Pentium, and Pentium Pro
  • making the transition from 16-bit to 32-bit assembly language
  • principles of C and assembly language optimization
  • tips for fast 32-bit software design
  • real-world examples of top-speed inner loops for several important PC algorithms
  • what MMX, the Intel multimedia extensions, mean for speed

Author Rick Booth backs up his theory of speed with practical examples and source code, including such topics as:

  • Fast memory moves
  • Random numbers
  • Hashing
  • Huffman compression
  • Sorting
  • Matrix math
  • JPEG's inner loop
Many chapters contain high-performance demos, which are also found on the CD. These include one of the fastest sort engines possible, a top-speed Huffman compression system, and JPEG's decompression inner loop tuned for top performance.

Consultant and developer Rick Booth is a 17-year veteran of the video game and digital video industries.




Click below for Supplements related to this title:

CD Contents

Untitled Document This file contains the CD Contents from the book Inner Loops: A Sourcebook for Fast 32-bit Software Development

Sample Content

Downloadable Sample Chapter


Two, four, [six,] eight. Who do we appreciate?

--Common cheer, in apparent anticipation of MMXparallelism

In early 1996, Intel announced the first major architectural extensionof the 80x86 instruction set since the introduction of 32-bit registersand their associated instructions with the 386. Called MMX, the eight new64-bit registers and their associated instruction set are to be graftedonto future Pentium and Pentium Pro processors. Although the term "MMX"appears to have been inspired by the expression "multimedia extensions,"Intel claims that it doesn't stand for anything in particular. But whateverthe name does or doesn't mean, it's a major boost to the processor's power--particularlywith regard to multimedia applications!

As of this writing, in mid 1996, the chips aren't out. But a fair amountof detailed documentation is. This chapter is a preliminary look at whatthe new chips should have to offer. In particular, this chapter looks at

* General processor enhancements

* The MMX registers

* The MMX instructions

* MMX instruction timing and pairing

* Software emulation of MMX

The presence or absence of MMX can be detected using information returnedby the cpuid instruction (which was new with the Pentium). Andif software doesn't find MMX, it needs fast inner loops to even come closeto making up for its absence.

General Processor Enhancements

Almost overlooked in the celebration of the new MMX features is thefact that Pentium Pros and, especially, Pentiums produced with MMX willhave several significant improvements to their general architecture thatwill benefit general program execution.

16K Primary Cache

The first great and general gift of MMX is a 16K primary data and codecache. Each cache line is still 32 bytes, but instead of the former, slightlydangerous two-way set-associative structure (where just three memory referencesin a cycle can produce worst-case cache behavior), the cache is four-wayset-associative. This not only means that there's a lot more cache to goaround but also that it's much less likely that situations will arise wherecache lines get thrown out just before they're needed (although, in truth,this was a rather rare occurrence on the Pentium anyhow). Having the extra8K is just a great thing.

Faster Instructions (0FH Prefix)

The following common instructions (plus some uncommon ones) should runone cycle faster on a Pentium with MMX, because, the claim is made, instructionswith the 0FH prefix are no longer penalized one cycle by this "prefix":

* bswap

* setCC

* shld

* shrd

* movsx

* movzx

This doubles the speed of bswap and setCC and makesthem much more attractive in inner loops.

16-Bit Instructions More Pairable

It used to be that instructions that operated on 16-bit data had toexecute in the
U pipe (the first position in a pair). On a Pentium with MMX, 16-bit instructionscan also occupy the second position, in the V pipe. It is not entirelyclear from the documentation what the performance penalty is for pairingtwo 16-bit instructions, or, for that matter, pairing a 16-bit instructionin the V pipe. My best guess, subject to verification, is that there'sa one-cycle penalty for each 16-bit instruction.

Pairable Immediate/Displacement Instructions

On a regular Pentium, this instruction is not pairable, because it containsboth an immediate value and a displacement:

mov byte ptr [edi+7],99

On a Pentium with MMX, however, it can be placed in the first positionof a pair, and hence in the U pipe. In effect, this particular instruction'seffective speed is thereby doubled from one cycle to half a cycle (whenpaired with another half-cycle instruction).

Merged Write Buffers

On the Pentium, there is a two-deep write buffer, or queue, associatedwith each "pipe." If you have to do four writes in rapid successionto noncached memory, the best way to do so is to issue two of them fromthe U pipe and the other two from
the V pipe. If they all come from, say, the U pipe, the queue will jamup and the processor will be delayed, even though the V pipe write bufferis empty. Basically, the right hand can't help out the left, even if itdoes know what it's doing.

On the Pentium with MMX, however, the two separate write buffers havebeen merged into a single, four-deep write buffer. Now you don't have toworry about which pipe is writing to memory.

Return Stack Buffer

The return stack buffer, which made its debut in the Pentium Pro, isalso present in Pentiums with MMX. Consider the following routine:




int ret;

ret = NextByte();

ret += NextByte();

ret += NextByte();

return ret;


A regular Pentium would try in vain to predict the return address ofthe NextByte() routine. Because there are three separate invocationsand three separate return addresses, the branch table prediction mechanismwould get totally confused and guess wrong every time. (A Pentium thinksof a ret instruction as a form of a branch.)

On a Pentium with MMX, the processor actually keeps a limited list ofstacked return addresses inside the chip, as well as on the external stackin memory. So when it sees a ret, it doesn't have to rely on thebranch predictor but references the internal stack instead, for a near-perfectprediction. The routine shown above will execute about a dozen cycles fasterunder MMX.

Faster Cache Delivery

On both the 486 and the Pentium, there is a marked delay between thetime a given cache miss completes and the time another reference to thesame cache line can complete. A typical scenario would be

mov al,[esi+0] ; 7;primary cache miss

mov bl,[esi+4] ; 7;same-line delay

;;;;14 cycles

The claim is made that on processors with MMX, cache line data willbe available for a second hit (and beyond) as soon as it arrives in thecache line, and you don't have to wait until the cache line is full anymore.If this is really true, then the preceding code should execute more likethis under MMX:

mov al,[esi+0] ; 9;primary cache miss

mov bl,[esi+4] ; 0;no delay in same 8 bytes

;;;;9 cycles

The difference in the timing of the first instruction is due to thereported larger delay associated with a cache miss under MMX.

If your cache miss hits somewhere in the first eight bytes of a cacheline, then the cache line will be delivered in eight-byte chunks from lowto high. If it hits in the top eight bytes, then it delivers in high-to-loworder. This way, applications that reference data serially in memory, whetherfrom low address to high or the other way around, will get data deliveredin the order in which it will most likely be needed. If it's really true,this will be a significant boon to applications that rip through memoryvery quickly.

Some Minuses

The delay for a primary cache miss usually measures in at about sixcycles on the Pentium (plus the normal one cycle for a memory reference).On a Pentium with MMX, it is claimed to be eight. And it's ten on a PentiumPro with MMX. So a cache miss is a little more expensive. On the plus side,with 16K of data cache and 16K of code cache, you should be missing muchless often.

If, on an MMX chip, you have two back-to-back branching statements andthey both end in the same four-byte-aligned, four-byte chunk of memory,the second branch will probably be mispredicted. If, as is usually thecase, the branches are short and take up only two bytes, then there's abouta 50-50 chance that back-to-back branches will have this problem in anygiven instance. The best, but very awkward, solution is to force the secondbranch to be a six-byte-long branch instead (with a 32-bit offset ratherthan 8-bit). It takes four more bytes, but it executes in the same time.This is, frankly, hard to do with some assemblers. MASM likes to optimizebranch sizes, so you could end up putting the branch in with data statements.It may be easier to put a pairable no-op such as mov ebx,ebx inbetween them for spacing. It will only end up costing a cycle on the Pentiumif the branch mispredicts.

The MMX Registers

The eight 64-bit MMX registers correspond exactly to the stack of eightfloating-point registers available in the floating-point unit. In fact,they correspond so exactly that if you read an MMX register that hasn'tbeen initialized, you'll see the mantissa of the last floating-point numberstored there. And if you don't take the proper precautions, you can makethe floating-point unit rather sick by interspersing MMX instructions withfloating-point instructions, since they use the same register bits.

Generally speaking, the floating-point instructions and the MMX instructionsshould be used in separate groups of instructions, because it takes a reported"up to 50 cycles" to revert from MMX code to floating-point code.Floating-point instructions and MMX instructions cannot be freely intermixed.

Each of the eight MMX registers, named mm0 through mm7,can be thought of and treated in one of three basic ways during the executionof any individual instruction:

* An MMX register can be eight independent 8-bit bytes.

* An MMX register can be four independent 16-bit words.

* An MMX register can be two independent 32-bit doublewords.

These are illustrated in Figure 9-1. Thepowerful aspect of MMX is that it can operate, usually in a single cycle,on all the independent units in a register in parallel, performing a varietyof logical, arithmetic, and housekeeping operations.

As an illustration of just how much MMX can do in a single cycle, considerthe following two instructions, which can pair together and add 16 individualbytes to 16 other individual bytes in just one cycle:

paddb mm0,mm1 ; 1;packed add mm1 to mm0

paddb mm2,mm3 ; 0;packed add mm3 to mm2

;;;;1 cycle

That's an order of magnitude faster than the main integer processorcan do the same thing.

When you want to revert to using floating-point instructions, you haveto issue an emms instruction to make the registers safe once againfor floating-point operations.

The MMX Instructions

In some ways, the MMX instructions are reminiscent of the floating-pointinstruction set. They don't set the processor's flags, or indeed any flags.There is no performance penalty for fetching one operand from memory insteadof from an MMX register. You can only generate arithmetic or logical resultsin registers, not in memory operands. And you can't store a computationalresult back to memory until the second cycle after its computation. Butother than these similarities, there is a much different--and very fast--integermachine under the hood.

Data Movement Instructions

The main data movement instruction is movq, which transfersquadwords (eight bytes) between MMX registers and memory as well as betweentwo MMX registers. Another data movement instruction, movd, transfersdoublewords back and forth between the low halves of MMX registers andthe processor's own 32-bit registers (like ebx).

Packing and Unpacking Instructions

Suppose you had a large array of signed words that you wanted to convertto signed bytes instead. You would probably want a word with value 300to be converted to
a byte of value 127, the largest possible signed byte. Likewise, ­144should become ­128. Well, MMX is good at pulling off this kind of transformation.It has instructions for packing words into bytes with or without saturation(signed or unsigned), which is the process of setting a smaller data itemto its largest or smallest possible value when it can't adequately representthe larger data item it came from.

The three basic packing instructions are

* packssdw Pack with signed saturation, dwords to words

* packsswb Pack with signed saturation, words to bytes

* packuswb Pack with unsigned saturation, words to bytes

To pack eight signed words located at [esi] into eight signedbytes in register mm0, you would do this:

movq mm0,[esi] ; ;load 4 words into mm0

packsswb mm0,[esi+8] ; ;pack with 4 words in mem

When unpacking the same data (the reverse process), you might expectto have a signed and an unsigned unpack instruction. But you don't. Instead,you take two MMX registers and then intermix them into one register, usingeither the four high bytes of both registers or the four low bytes of bothregisters. The eight bytes in the destination register thus come alternatelyfrom one source register and then
the other. If you want to do sign extension, one of the operands has tobe precalculated to hold sign extension bytes.

The unpacking instructions, which operate on bytes, words, and doublewordsto produce words, doublewords, and quadwords, respectively, are

* punpcklbw Unpack from low halves, bytes to words

* punpckhbw Unpack from high halves, bytes to words

* punpcklwd Unpack from low halves, words to doublewords

* punpckhwd Unpack from high halves, words to doublewords

* punpckldq Unpack from low halves, doublewords to quadwords

* punpckhdq Unpack from high halves, doublewords to quadwords

Here is how you would zero-extend eight bytes in mm0 into eightwords in mm0 and mm1, assuming quadzero is azero quadword memory variable:

movq mm1,mm0 ; ;copy the register

punpcklbw mm0,quadzero ; ;expand 4 low bytes

punpckhbw mm1,quadzero ; ;expand 4 high bytes

Addition and Subtraction

Addition and subtraction come with the option of value saturation, muchlike the way the packing instructions saturate packed values. If you addtwo MMX registers via byte-wise addition, you can choose no saturationoption (values wrap) or saturation according to either signed or unsignedarithmetic. The addition operations are

* paddb Add by bytes, ignoring overflows (wraps)

* paddsb Add by bytes, signed bytes with saturation

* paddusb Add by bytes, unsigned bytes with saturation

* paddw Add by words, ignoring overflows (wraps)

* paddsw Add by words, signed words with saturation

* paddusw Add by words, unsigned words with saturation

* paddd Add by doublewords, ignoring overflows

There are subtraction operations exactly equivalent to each of the additionoperations.

Bitwise Logical Operations

There are four 64-bit bitwise logical operators. There is no distinctionbetween byte, word, doubleword, and quadword forms, because a bitwise operatorworks the same way regardless. The operators are

* pand A 64-bit and operation

* por A 64-bit or operation

* pxor A 64-bit xor operation

* pnand Inverts the bits of the destination, then does a 64-bitand

Sometimes you may want to combine two quadwords into a single quadword,using a mask of bits to indicate which of the two quadwords the correspondingbit comes from. To do this, you and the mask with one quadwordand then invert the bits of the mask (logical not) and andit with the other one; then you or the two intermediates together.The pnand operation is specifically there to serve the invert-and-andfunction in one step.

Mask-Making Instructions

Suppose you have some sort of graphic image stored in memory at onebyte per pixel, and you want the bytes with value 211 in the image to betransparent when you overlay the image on a bitmap. Let [esi]be the pointer to a piece of the image, let [edi] be the pointerto the area where you would like to overlay it, and let mmx211be a memory quadword filled with bytes of value 211. Then this sequencewould perform the overlay with transparency:

movq mm0,[esi] ; ;fetch image to overlay

movq mm1,[edi] ; ;fetch background

movq mm2,mmx211 ; ;fetch bytes for mask test

pcmpeqb mm2,mm0 ; ;make transparent spots 0ffH

pand mm1,mm2 ; ;mask only visible background

pnand mm2,mm0 ; ;mask only visible image

por mm1,mm2 ; ;combine background and image

movq [edi],mm1 ; ;and store the result

The comparison instruction in this case put 0ffH in any byteposition that matched exactly in both operands, or zero otherwise. Thereis another comparison operation that deposits 0ffH if the bytein a given position in the first operand is greater than the byte in thesecond operand at that position, according to signed arithmetic. Theseinstructions can operate on bytes, words, or doublewords. There are sixforms:

* pcmpeqb Compare by bytes. If equal, set byte to 0ffH;otherwise, set to 0

* pcmpeqw Compare by words. If equal, set word to 0ffffH;otherwise, set to 0

* pcmpeqd Compare by dwords. If equal, set dword to 0ffffffffH;otherwise, set to 0

* pcmpgtb Compare by bytes. If greater than, set byte to 0ffH;otherwise, set to 0

* pcmpgtw Compare by words. If greater than, set word to 0ffffH;otherwise, set to 0

* pcmpgtd Compare by dwords. If greater than, set dword to0ffffffffH; otherwise, set to 0


Although there is no 64-bit addition instruction in MMX, there isa pair of 64-bit shifts. On the other hand, there are no shifts that operateon bytes. Although the missing byte shifts are to be slightly mourned,the full 64-bit shift is useful for coping with data that doesn't lie onnatural eight-byte boundaries.

Here are the shift operations:

* psllw Shift left logical by words

* pslld Shift left logical by doublewords

* psslq Shift left logical by quadwords

* psrlw Shift right logical by words

* psrld Shift right logical by doublewords

* psrlq Shift right logical by quadwords

* psraw Shift right arithmetic by words

* psrad Shift right by doublewords

The amount of the shift can be specified as an immediate value:

psllw mm0,7 ; ;shift each word by 7

Or the amount(s) can be read from another quadword:

psllw mm0,mm1 ; ;shift by mm1 amounts

Note that in the preceding example, mm1 could possibly containfour different values in its four component words, in which case the fourwords of mm0 would each be shifted by a different amount.


All MMX multiplication is based on the notion of multiplying the foursigned words of one quadword by the four signed words of another quadword.Of course, multiplying two 16-bit values gives a 32-bit value, so the threeresult options are as follows: take the low 16 bits of all four results;take the high 16 bits of all four results; or take two 32-bit sums of theresults of two pairs of multiplications.

The instructions are

* pmulhw Multiply and return the high word

* pmullw Multiply and return the low word

* pmaddwd Multiply and return two doubleword multiplicationsums

Compared to conventional integer multiplication, the MMX forms are blazinglyfast.

MMX Instruction Timing and Pairing

MMX instructions are fast. Other than the emms instruction--whichends an MMX session (and is implied to take up to 50 cycles)--and the multiplicationinstructions, all the MMX instructions take just one cycle, and most arepairable. The multiplication instructions take three cycles to complete,but because they are heavily pipelined, you can issue a multiplicationinstruction on every cycle, and as long as you don't try to read the resultsuntil three cycles later, you won't encounter a stall.

Besides the typical Pentium pairing constraints (no read or write afterwrite, and so on), the biggest single constraint on paired instructionsis that only the first instruction of a pair (the one in the U pipe) canread or write to memory or to a processor 32-bit register. Less inhibiting,you can't issue two multiplication instructions in the same cycle, andyou can't issue two pack/unpack/shift instructions in the same cycle (whereaspacking and unpacking are really types of shift operations and presumablyuse the shifter's engine). And finally, an MMX instruction can pair witha regular processor instruction as long as that regular instruction doesn'taccess memory. For instance, add edx,esi would pair with an MMXinstruction just fine.

According to Intel's documentation, if you write to video memory overa PCI bus, you are likely to see a performance improvement writing fromthe MMX registers over writing the same data with 32-bit writes. And ifyou're doing a large memory-to-memory copy (not in a cache), you mightdo well to write a loop that loads and stores 64 bits at a time. On somePentiums, they say, a performance gain of up to 30 percent has been seen.(This is comparable to the performance gain seen when using the floating-pointunit to do memory transfers, as described in Chapter 10.)

One final thing to remember about MMX timing, already mentioned, isthat an MMX register isn't ready to be written to memory for an extra cycleafter its last computation completes.

Software Emulation of MMX

If software is to be built that can opt to use MMX if present--or performequivalent actions if it is not--the first question that comes to mindis, "How much faster is MMX-enabled code than work-alike code on,say, a Pentium?" It depends on what you're doing.

Packing with Signed Saturation

First consider the problem of packing an array of words into bytes usingsigned saturation, represented by this MMX code:

movq mm0,[esi] ; 1;load 4 words

packsswb mm0,[esi+8] ; 1;pack with 4 more words

movq [edi],mm0 ; 1;store

;;;;3 cycles

Listing 9-1 shows the fastest equivalent Pentium code I could thinkof to do the same thing. At 24 cycles, it takes eight times longer. The64K table should not be a cache miss problem unless the data is very badlybehaved and much of it is wildly out of the saturation range.

Listing 9-1 Pentium Equivalent of MMX packsswb

;on entry:

; ebx - pointer to 64K saturation table

; ecx - 0

; edx - 0

; esi - source pointer

; edi - destination pointer


mov cx,[esi+6] ; 2;pick up 4th word

mov dx,[esi+4] ; 2;pick up 3rd word

mov ah,[ecx+ebx] ; 1;saturate 4th byte

mov cx,[esi+2] ; 2;pick up 2nd word

mov al,[edx+ebx] ; 0;saturate 3rd byte

mov dx,[esi+0] ; 2;pick up 1st word

shl eax,16 ; 1;boost al,ah up 16 bits

mov ah,[ecx+ebx] ; 1;saturate 2nd byte

mov cx,[esi+14] ; 2;pick up 8th word

mov al,[edx+ebx] ; 0;saturate 1st byte

mov dx,[esi+12] ; 2;pick up 7th word

mov [edi],eax ; 0;store 4 saturated bytes

mov ah,[ecx+ebx] ; 1;saturate 8th byte

mov cx,[esi+10] ; 2;pick up 6th word

mov al,[edx+ebx] ; 0;saturate 7th byte

mov dx,[esi+8] ; 2;pick up 5th word

shl eax,16 ; 1;boost al,ah up 16 bits

mov ah,[ecx+ebx] ; 1;saturate 6th byte

mov al,[edx+ebx] ; 1;saturate 5th byte

mov [edi],eax ; 1;store 4 saturated bytes

;;;;24 cycles

Packing with Unsigned Saturation

Unsigned saturation can run about twice as fast on a Pentium if thereare few bytes to be saturated. If you're willing to use an algorithm thatmodifies the source array in order to gain a few cycles, you can writean MMX instruction equivalent for packuswb in 13 cycles. Listing9-2 shows the first half of the algorithm. The second half repeats eightbytes higher and saves a cycle by overlapping with the first half.

Listing 9-2 Unsigned Saturation of Four Bytes, as in packuswb

;on entry:

; ebp - 0ff00ff00H

; esi - source pointer

; edi - destination pointer

mov eax,[esi+0] ; 1;get first 2 words

mov ebx,[esi+4] ; 0;get second 2 words

.if eax & ebp ; 1;check for saturation

.if eax & 0ff00H ; ;maybe saturate byte 1

mov al,0ffH ; ;

.endif ; ;

.if eax & 0ff000000H ; ;maybe saturate byte 2

mov byte ptr [esi+2],0ffH ; ;modify source!

.endif ; ;

.endif ; ;

.if ebx & ebp ; 1;check for saturation

.if ebx & 0ff00H ; ;maybe saturate 3

mov bl,0ffH ; ;

.endif ; ;

.if ebx & 0ff000000H ; ;maybe saturate 4

mov byte ptr [esi+6],0ffH ; ;modify source!

.endif ; ;

.endif ; ;

and eax,0ffffH ; 1;clear top of eax

mov bh,[esi+6] ; 0;bx has byte pair

shl ebx,16 ; 1;ebx ready for or

mov ah,[esi+2] ; 0;ax has byte pair

or eax,ebx ; 1;compose 4 bytes

mov [edi],eax ; 1;and deposit

;;;;7 cycles


Pipelined parallel multiplication is obviously a large win. It takesa Pentium
44 cycles to execute four 16-bit signed multiplication instructions, whereasMMX can cut the effective cost down to half a cycle if the instructionis paired and the results are not accessed for three cycles. You can multiplythe elements of two memory word arrays into a third word array using MMXat an overall cost approaching 0.75 cycles per array element (assumingreasonably large arrays in the primary cache and overlapped load-multiply-storesequences). In fact, multiplication is so fast in this instance that itcan be rate-limited at perhaps half its top speed if the arrays aren'tin cache. The best a raw Pentium can do, in contrast, is about 15 cyclesper element, although it can deliver full 32-bit results.

Performing 16-bit array multiplication with 32-bit results is actuallya very interesting optimization problem. The following sequence of instructionsmultiplies two quadwords to produce two quadwords:

movq mm0,[esi] ; ;fetch from input array 1

movq mm1,[edx] ; ;fetch from input array 2

movq mm2,mm0 ; ;copy array 1 data

pmullw mm0,mm1 ; ;multiply low into mm0

pmulhw mm1,mm2 ; ;multiply high into mm1

movq mm2,mm0 ; ;duplicate low result

punpcklwd mm0,mm1 ; ;unpack low quadword

punpckhwd mm2,mm1 ; ;unpack high quadword

movq [edi],mm0 ; ;store 2 low doublewords

movq [edi+8],mm2 ; ;store 2 high doublewords

The tricky part about optimizing code like this is in obeying all thespecial constraints, like one memory operation per cycle, the three-cyclewait for multiplications to complete, and the one-cycle delay before unloadingresults to memory. Code like this is a perfect candidate for tandem loops.

Listing 9-3 shows an example of a high-efficiency loop that producesa 32-bit array from two input 16-bit arrays. Not shown (and not even written)is the entry and exit code, both of which are tedious problems in and ofthemselves. The two instructions of loop overhead are nestled in betweenotherwise unpaired MMX instructions, and there is only one stall cycle,due to a multiplication result that was read early. The end result is whatappears to be a 12-cycle loop that produces eight 32-bit results. That's1.5 cycles per result, a clean factor of 10 faster than a plain Pentium.

Listing 9-3 16 * 16 * 32-Bit Multiplication Inner Loop

.repeat ; ;

movq [edi+ecx*2-16],mm3 ; 1;2

movq mm2,mm0 ; 0;1

movq [edi+ecx*2-8],mm5 ; 1;2

pmullw mm0,mm1 ; 0;1

movq mm3,[esi+ecx+8] ; 1;2 <- loop 2 start

pmulhw mm1,mm2 ; 0;1

movq mm4,[edx+ecx+8] ; 1;2

movq mm5,mm3 ; 0;2

pmullw mm3,mm4 ; 1;2

movq mm2,mm0 ; 0;1

pmulhw mm4,mm5 ; 1;2

punpcklwd mm0,mm1 ; 0;1

movq mm5,mm3 ; 2;2 <- pmullw stall

punpckhwd mm2,mm1 ; 0;1

movq [edi+ecx*2],mm0 ; 1;1

punpcklwd mm3,mm4 ; 0;2

movq [edi+ecx*2+8],mm2 ; 1;1

punpckhwd mm5,mm4 ; 0;2

movq mm0,[esi+ecx+16] ; 1;1 <- loop 1 start

add ecx,16 ; 0;

movq mm1,[edx+ecx+16] ; 1;1

.until ZERO? ; 0;

;;;;12 cycles

Overlaying Graphics

A very common procedure in graphic animation is to overlay one graphicon top of another. Typically both objects are stored as bitmaps, and thereis one unique "color" in the graphic image that indicates transparency.For this example, assume the bitmaps consist of one byte per pixel (thevery common 256-color-lookup screen mode) and that the transparent "color"is any byte in the object that has a value of zero. Furthermore, assumethat the alignment of the graphic object in memory is different from whatits alignment will be when it is overlaid. This is a reasonable generalassumption, since smooth animation often requires placing objects at everypossible alignment.

Listing 9-4 shows a reasonable Pentium solution to the problem. It isnatural to want to align the reads and writes of the background data ondoubleword boundaries, but this generally means that corresponding doublewordreads from the object being overlaid will probably not be doubleword-aligned.There is a strong disincentive to read the object data a doubleword ata time at arbitrary alignment, as this will generally cause a three-cycledelay. (To be honest, though, if there is a high probability that doublewordswill be consistently transparent or opaque while reading through the data,there are somewhat more efficient solutions that do take the three-cyclepenalty.) Randomly mixed transparent and opaque bytes, though, will doquite well with the loop in Listing 9-4, which if further unrolled willaverage, at best, about three cycles per pixel but will run closer to fouror five if branch misprediction becomes frequent (due to frequent transparent-to-opaquechanges).

Listing 9-4 A Pentium Graphic Overlay Loop

.repeat ; ;

mov eax,[edi+ecx] ; 1;pick up background

mov bl,[esi+ecx+0] ; 0;

.if bl ; 1;decide about 1st byte

mov al,bl ; 1;overlay it

.endif ; ;

mov bl,[esi+ecx+1] ; 0;

.if bl ; 1;decide about 2nd byte

mov ah,bl ; 1;overlay it

.endif ; ;

mov bl,[esi+ecx+2] ; 0;

rol eax,16 ; 1;

.if bl ; 1;decide about 3rd byte

mov al,bl ; 1;overlay it

.endif ; ;

mov bl,[esi+ecx+3] ; 0;

.if bl ; 1;decide about 4th byte

mov ah,bl ; 1;overlay it

.endif ; ;

rol eax,16 ; 1;

mov [edi+ecx],eax ; 1;deposit overlayed data

add ecx,4 ; 0;advance counter/pointer

.until ZERO? ; 1;and loop

;;;;13 cycles

Now, how fast is MMX? The MMX is so fast that a three-cycle penaltyfor a misaligned read is much more troublesome than it would be in thePentium loop. The object of the loop design, therefore, is to read boththe object bitmap and the background bitmap along quadword-aligned boundariesand use internal shifts to register the data correctly. Listing 9-5 showsthe basic loop sequence that achieves this.

Listing 9-5 Overlaying an Image with Transparency

on entry:

; mm7 - contains 0

; mm6 - contains amount by which to shift left

; mm5 - contains amount by which to shift right

; esi - points to aligned object data

; edi - points to aligned background data

movq mm0,[esi] ; ;fetch 8 pixels of sprite

movq mm2,mm0 ; ;and make a copy

psllq mm2,mm6 ; ;shift both left

psrlq mm0,mm5 ; ;and right

por mm2,mm1 ; ;build shifted qword

movq mm3,mm7 ; ;copy a qword of zeroes

pcmpeqb mm3,mm2 ; ;make a transparency mask

pnand mm3,[edi] ; ;mask against background

por mm3,mm2 ; ;combine fgnd,bgnd

movq [edx],mm3 ; ;and output it

By interleaving two copies of the loop, it is possible to get completepairing and a tandem loop that can be unrolled at ten cycles per tandem(double) iteration, as shown in Listing 9-6.

Listing 9-6 Fast Image Overlay Loop

.repeat ; ;

movq mm0,[esi+ecx] ; 1;1 <- loop 1 start

por mm4,mm0 ; 0;2

movq [edx+ecx-16],mm3 ; 1;1

movq mm3,mm7 ; 0;2

pcmpeqb mm3,mm4 ; 1;2

movq mm2,mm0 ; 0;1

pnand mm3,[edi+ecx] ; 1;2

psllq mm2,mm6 ; 0;1

por mm3,mm4 ; 1;2

psrlq mm0,mm5 ; 0;1

movq mm1,[esi+ecx+8] ; 1;2 <- loop 2 start

por mm2,mm1 ; 0;1

movq [edx+ecx],mm3 ; 1;2

movq mm3,mm7 ; 0;1

movq mm4,mm1 ; 1;2

pcmpeqb mm3,mm2 ; 0;1

pnand mm3,[edi+ecx] ; 1;1

psllq mm4,mm6 ; 0;2

psrlq mm1,mm5 ; 1;2

por mm3,mm2 ; 0;1

add ecx,16 ; 1;

.until ZERO? ; 0;

;;;;11 cycles

In just over five cycles per eight bytes, the MMX performs the overlayin about 0.7 cycles per byte, despite the work of shifting to achieve alignment.This is typically about four to five times faster than the Pentium-onlyclass of solutions. Parallelism does make a difference.


The MMX enhancements to the Pentium and Pentium Pro appear to be ableto speed up various multimedia-related operations by a factor of five toten over base Pentium performance. Put another way, if software that isheavily dependent on MMX to achieve high performance has to run on an ordinaryPentium or Pentium Pro in emulation, then it may end up running severaltimes slower.

It's a good bet that many, if not most, optimized MMX processing loopswill really be two loops operating in tandem to achieve instruction pairing.In many instances the eight MMX registers are just enough to supply twosets of registers for two parallel loops.


Intel Corporation. March 1996. Intel Architecture MMX TechnologyDeveloper's Manual. Santa Clara, Calif.: Intel Corporation. The developer'smanual gives optimization advice and practical code examples.

Intel Corporation. March 1996. Intel Architecture MMX TechnologyProgrammer's Reference Manual. Santa Clara, Calif.: Intel Corporation.The instructions are fully specified in the programmer's manual.


Submit Errata

More Information

Unlimited one-month access with your purchase
Free Safari Membership