Home > Articles > Hardware

This chapter is from the book

3.3 The PowerPC 970FX

3.3.1 At a Glance

In this section, we will look at details of the PowerPC 970FX. Although several parts of the discussion could apply to other PowerPC processors, we will not attempt to identify such cases. Table 3–4 lists the important technical specifications of the 970FX.

Table 3–4. The PowerPC 970FX at a Glance

Feature

Details

Architecture

64-bit PowerPC AS, [a] with support for 32-bit operating system bridge facility

Extensions

Vector/SIMD Multimedia extension (VMX [b] )

Processor clock frequency

Up to 2.7GHz [c]

Front-side bus frequency

Integer fraction of processor clock frequency

Data-bus width

128 bits

Address-bus width

42 bits

Maximum addressable physical memory

4TB (242 bytes)

Address translation

65-bit virtual addresses, 42-bit real addresses, support for large (16MB) virtual memory pages, a 1024-entry translation lookaside buffer (TLB), and a 64-entry segment lookaside buffer (SLB)

Endianness

Big-endian; optional little-endian facility not implemented

L1 I-cache

64KB, direct-mapped, with parity

L1 D-cache

32KB, two-way set-associative, with parity

L2 cache

512KB, eight-way set-associative, with ECC, fully inclusive of L1 D-cache

L3 cache

None

Cache line width

128 bytes for all caches

Instruction buffer

32 entries

Instructions/cycle

Up to five (up to four nonbranch + up to one branch)

General-purpose registers

32x64-bit

Floating-point registers

32x64-bit

Vector registers

32x128-bit

Load/Store Units

Two units, with 64-bit data paths

Fixed-Point Units

Two asymmetrical [d] 64-bit units

Floating-Point Units

Two 64-bit units, with support for IEEE-754 double-precision floating-point, hardware fused multiply-add, and square root

Vector units

A 128-bit unit

Condition Register Unit

For performing logical operations on the Condition Register (CR)

Execution pipeline

Ten execution pipelines, with up to 25 stages in a pipeline, and up to 215 instructions in various stages of execution at a time

Power management

Multiple software-initialized power-saving modes, PowerTune frequency and voltage scaling

3.3.2 Caches

A multilevel cache hierarchy is a common aspect of modern processors. A cache can be defined as a small chunk of very fast memory that stores recently used data, instructions, or both. Information is typically added and removed from a cache in aligned quanta called cache lines. The 970FX contains several caches and other special-purpose buffers to improve memory performance. Figure 3–4 shows a conceptual diagram of these caches and buffers.

singhfig3-4.gif

Figure 3–4 Cache(s)970FXL1 and L2 cachesCaches and buffers in the 970FX

3.3.2.1 L1 and L2 Caches

The level 1 (L1) cache is closest to the processor. Memory-resident information must be loaded into this cache before the processor can use it, unless that portion of memory is marked noncacheable. For example, when a load instruction is being executed, the processor refers to the L1 cache to see if the data in question is already held by a currently resident cache line. If so, the data is simply loaded from the L1 cache—an L1 cache hit. This operation takes only a few processor cycles as compared to a few hundred cycles for accessing main memory. [23] If there is an L1 miss, the processor checks the next level in the cache hierarchy: the level 2 (L2) cache. An L2 hit would cause the cache line containing the data to be loaded into the L1 cache and then into the appropriate register. The 970FX does not have level 3 (L3) caches, but if it did, similar steps would be repeated for the L3 cache. If none of the caches contains the requested data, the processor must access main memory.

As a cache line's worth of data is loaded into L1, a resident cache line must be flushed to make room for the new cache line. The 970FX uses a pseudo-least-recently-used (LRU) algorithm [24] to determine which cache line to evict. Unless instructed otherwise, the evicted cache line is sent to the L2 cache, which makes L2 a victim cache. Table 3–5 shows the important properties of the 970FX's caches.

Table 3–5. 970FX Caches

Property

L1 I-cache

L1 D-cache

L2 Cache

Size

64KB

32KB

512KB

Type

Instructions

Data

Data and instructions

Associativity

Direct-mapped

Two-way set-associative

Eight-way set-associative

Line size

128 bytes

128 bytes

128 bytes

Sector size

32 bytes

Number of cache lines

512

256

4096

Number of sets

512

128

512

Granularity

1 cache line

1 cache line

1 cache line

Replacement policy

LRU

LRU

Store policy

Write-through, with no allocate-on-store-miss

Write-back, with allocate-on-store-miss

Index

Effective address

Effective address

Physical address

Tags

Physical address

Physical address

Physical address

Inclusivity

Inclusive of L1 D-cache

Hardware coherency

No

Yes

Yes, standard MERSI cache-coherency protocol

Enable bit

Yes

Yes

No

Reliability, availability, and serviceability (RAS)

Parity, with invalidate-on-error for data and tags

Parity, with invalidate-on-error for data and tags

ECC on data, parity on tags

Cache locking

No

No

No

Demand load latencies (typical)

3, 5, 4, 5 cycles for GPRs, FPRs, VPERM, and VALU, respectively [a]

11, 12, 11, 11 cycles for GPRs, FPRs, VPERM, and VALU, respectivelya

You can retrieve processor cache information using the sysctl command on Mac OS X as shown in Figure 3–5. Note that the hwprefs command is part of Apple's CHUD Tools package.

Example 3–5. Retrieving processor cache information using the sysctl command

$ sudo hwprefs machine_type # Power Mac G5 Dual 2.5GHz
PowerMac7,3
$ sysctl -a hw
...
hw.cachelinesize: 128
hw.l1icachesize: 65536
hw.l1dcachesize: 32768
hw.l2settings = 2147483648
hw.l2cachesize: 524288
...
$ sudo hwprefs machine_type # Power Mac G5 Quad 2.5GHz
PowerMac11,2
$ sysctl -a hw
...
hw.cachelinesize = 128
hw.l1icachesize = 65536
hw.l1dcachesize = 32768
hw.l2settings = 2147483648
hw.l2cachesize = 1048576
...

3.3.2.2 Cache Properties

Let us look more closely at some of the cache-related terminology used in Table 3–5.

Associativity

As we saw earlier, the granularity of operation for a cache—that is, the unit of memory transfers in and out of a cache—is a cache line (also called a block). The cache line size on the 970FX is 128 bytes for both the L1 and L2 caches. The associativity of a cache is used to determine where to place a cache line's worth of memory in the cache.

If a cache is m-way set-associative, then the total space in the cache is conceptually divided into sets, with each set containing m cache lines. In a set-associative cache, a block of memory can be placed only in certain locations in the cache: It is first mapped to a set in the cache, after which it can be stored in any of the cache lines within that set. Typically, given a memory block with address B, the target set is calculated using the following modulo operation:

target set = B MOD {number of sets in cache}

A direct-mapped cache is equivalent to a one-way set-associative cache. It has the same number of sets as cache lines. This means a memory block with address B can exist only in one cache line, which is calculated as the following:

target cache line = B MOD {number of cache lines in cache}

Store Policy

A cache's store policy defines what happens when an instruction writes to memory. In a write-through design, such as the 970FX L1 D-cache, information is written to both the cache line and to the corresponding block in memory. There is no L1 D-cache allocation on write misses—the affected block is modified only in the lower level of the cache hierarchy and is not loaded into L1. In a write-back design, such as the 970FX L2 cache, information is written only to the cache line—the affected block is written to memory only when the cache line is replaced.

MERSI

Only the L2 cache is physically mapped, although all caches use physical address tags. Stores are always sent to the L2 cache in addition to the L1 cache, as the L2 cache is the data coherency point. Coherent memory systems aim to provide the same view of all devices accessing the memory. For example, it must be ensured that processors in a multiprocessor system access the correct data—whether the most up-to-date data resides in main memory or in another processor's cache. Maintaining such coherency in hardware introduces a protocol that requires the processor to "remember" the state of the sharing of cache lines. [25] The L2 cache implements the MERSI cache-coherency protocol, which has the following five states.

  1. Modified—This cache line is modified with respect to the rest of the memory subsystem.
  2. Exclusive—This cache line is not cached in any other cache.
  3. Recent—The current processor is the most recent reader of this shared cache line.
  4. Shared—This cache line was cached by multiple processors.
  5. Invalid—This cache line is invalid.

RAS

The caches incorporate parity-based error detection and correction mechanisms. Parity bits are additional bits used along with normal information to detect and correct errors in the transmission of that information. In the simplest case, a single parity bit is used to detect an error. The basic idea in such parity checking is to add an extra bit to each unit of information—say, to make the number of 1s in each unit either odd or even. Now, if a single error (actually, an odd number of errors) occurs during information transfer, the parity-protected information unit would be invalid. In the 970FX's L1 cache, parity errors are reported as cache misses and therefore are implicitly handled by refetching the cache line from the L2 cache. Besides parity, the L2 cache implements an error detection and correction scheme that can detect double errors and correct single errors by using a Hamming code. [26] When a single error is detected during an L2 fetch request, the bad data is corrected and actually written back to the L2 cache. Thereafter, the good data is refetched from the L2 cache.

3.3.3 Memory Management Unit (MMU)

During virtual memory operation, software-visible memory addresses must be translated to real (or physical) addresses, both for instruction accesses and for data accesses generated by load/store instructions. The 970FX uses a two-step address translation mechanism [27] based on segments and pages. In the first step, a software-generated 64-bit effective address (EA) is translated to a 65-bit virtual address (VA) using the segment table, which lives in memory. Segment table entries (STEs) contain segment descriptors that define virtual addresses of segments. In the second step, the virtual address is translated to a 42-bit real address (RA) using the hashed page table, which also lives in memory.

The 32-bit PowerPC architecture provides 16 segment registers through which the 4GB virtual address space can be divided into 16 segments of 256MB each. The 32-bit PowerPC implementations use these segment registers to generate VAs from EAs. The 970FX includes a transitional bridge facility that allows a 32-bit operating system to continue using the 32-bit PowerPC implementation's segment register manipulation instructions. Specifically, the 970FX allows software to associate segments 0 through 15 with any of the 237 available virtual segments. In this case, the first 16 entries of the segment lookaside buffer (SLB), which is discussed next, act as the 16 segment registers.

3.3.3.1 SLB and TLB

We saw that the segment table and the page table are memory-resident. It would be prohibitively expensive if the processor were to go to main memory not only for data fetching but also for address translation. Caching exploits the principle of locality of memory. If caching is effective, then address translations will also have the same locality as memory. The 970FX includes two on-chip buffers for caching recently used segment table entries and page address translations: the segment lookaside buffer (SLB) and the translation lookaside buffer (TLB), respectively. The SLB is a 64-entry, fully associative cache. The TLB is a 1024-entry, four-way set-associative cache with parity protection. It also supports large pages (see Section 3.3.3.4).

3.3.3.2 Address Translation

Figure 3–6 depicts address translation in the 970FX MMU, including the roles of the SLB and the TLB. The 970FX MMU uses 64-bit or 32-bit effective addresses, 65-bit virtual addresses, and 42-bit physical addresses. The presence of the DART introduces another address flavor, the I/O address, which is an address in a 32-bit address space that maps to a larger physical address space.

singhfig3-6.gif

Figure 3–6 Address translation in the 970FX MMU

The 65-bit extended address space is divided into pages. Each page is mapped to a physical page. A 970FX page table can be as large as 231 bytes (2GB), containing up to 224 (16 million) page table entry groups (PTEGs), where each PTEG is 128 bytes.

As Figure 3–6 shows, during address translation, the MMU converts program-visible effective addresses to real addresses in physical memory. It uses a part of the effective address (the effective segment ID) to locate an entry in the segment table. It first checks the SLB to see if it contains the desired STE. If there is an SLB miss, the MMU searches for the STE in the memory-resident segment table. If the STE is still not found, a memory access fault occurs. If the STE is found, a new SLB entry is allocated for it. The STE represents a segment descriptor, which is used to generate the 65-bit virtual address. The virtual address has a 37-bit virtual segment ID (VSID). Note that the page index and the byte offset in the virtual address are the same as in the effective address. The concatenation of the VSID and the page index forms the virtual page number (VPN), which is used for looking up in the TLB. If there is a TLB miss, the memory-resident page table is looked up to retrieve a page table entry (PTE), which contains a real page number (RPN). The RPN, along with the byte offset carried over from the effective address, forms the physical address.

3.3.3.3 Caching the Caches: ERATs

Information from the SLB and the TLB may be cached in two effective-to-real address translation caches (ERATs)—one for instructions (I-ERAT) and another for data (D-ERAT). Both ERATs are 128-entry, two-way set-associative caches. Each ERAT entry contains effective-to-real address translation information for a 4KB block of storage. Both ERATs contain invalid information upon power-on. As shown in Figure 3–6, the ERATs represent a shortcut path to the physical address when there is a match for the effective address in the ERATs.

3.3.3.4 Large Pages

Large pages are meant for use by high-performance computing (HPC) applications. The typical page size of 4KB could be detrimental to memory performance in certain circumstances. If an application's locality of reference is too wide, 4KB pages may not capture the locality effectively enough. If too many TLB misses occur, the consequent TLB entry allocations and the associated delays would be undesirable. Since a large page represents a much larger memory range, the number of TLB hits should increase, as the TLB would now cache translations for larger virtual memory ranges.

It is an interesting problem for the operating system to make large pages available to applications. Linux provides large-page support through a pseudo file system (hugetlbfs) that is backed by large pages. The superuser must explicitly configure some number of large pages in the system by preallocating physically contiguous memory. Thereafter, the hugetlbfs instance can be mounted on a directory, which is required if applications intend to use the mmap() system call to access large pages. An alternative is to use shared memory calls—shmat() and shmget(). Files may be created, deleted, mmap()'ed, and munmap()'ed on hugetlbfs. It does not support reads or writes, however. AIX also requires separate, dedicated physical memory for large-page use. An AIX application can use large pages either via shared memory, as on Linux, or by requesting that the application's data and heap segments be backed by large pages.

Note that whereas the 970FX TLB supports large pages, the ERATs do not; large pages require multiple entries—corresponding to each referenced 4KB block of a large page—in the ERATs. Cache-inhibited accesses to addresses in large pages are not permitted.

3.3.3.5 No Support for Block Address Translation Mechanism

The 970FX does not support the Block Address Translation (BAT) mechanism that is supported in earlier PowerPC processors such as the G4. BAT is a software-controlled array used for mapping large—often much larger than a page—virtual address ranges into contiguous areas of physical memory. The entire map will have the same attributes, including access protection. Thus, the BAT mechanism is meant to reduce address translation overhead for large, contiguous regions of special-purpose virtual address spaces. Since BAT does not use pages, such memory cannot be paged normally. A good example of a scenario where BAT is useful is that of a region of framebuffer memory, which could be memory-mapped effectively via BAT. Software can select block sizes ranging from 128KB to 256MB.

On PowerPC processors that implement BAT, there are four BAT registers each for data (DBATs) and instructions (IBATs). A BAT register is actually a pair of upper and lower registers, which are accessible from supervisor mode. The eight pairs are named DBAT0U-DBAT3U, DBAT0L-DBAT3L, IBAT0U-IBAT3U, and IBAT0L-IBAT3L. The contents of a BAT register include a block effective page index (BEPI), a block length (BL), and a block real page number (BRPN). During BAT translation, a certain number of high-order bits of the EA—as specified by BL—are matched against each BAT register. If there is a match, the BRPN value is used to yield the RA from the EA. Note that BAT translation is used over page table translation for storage locations that have mappings in both a BAT register and the page table.

3.3.4 Miscellaneous Internal Buffers and Queues

The 970FX contains several miscellaneous buffers and queues internal to the processor, most of which are not visible to software. Examples include the following:

  • A 4-entry (128 bytes per entry) Instruction Prefetch Queue logically above the L1 I-cache
  • Fetch buffers in the Instruction Fetch Unit and the Instruction Decode Unit
  • An 8-entry Load Miss Queue (LMQ) that tracks loads that missed the L1 cache and are waiting to receive data from the processor's storage subsystem
  • A 32-entry Store Queue (STQ) [28] for holding stores that can be written to cache or memory later
  • A 32-entry Load Reorder Queue (LRQ) in the Load/Store Unit (LSU) that holds physical addresses for tracking the order of loads and watching for hazards
  • A 32-entry Store Reorder Queue (SRQ) in the LSU that holds physical addresses and tracks all active stores
  • A 32-entry Store Data Queue (SDQ) in the LSU that holds a double word of data
  • A 12-entry Prefetch Filter Queue (PFQ) for detecting data streams for prefetching
  • An 8-entry (64 bytes per entry) fully associative Store Queue for the L2 cache controller

3.3.5 Prefetching

Cache miss rates can be reduced through a technique called prefetching—that is, fetching information before the processor requests it. The 970FX prefetches instructions and data to hide memory latency. It also supports software-initiated prefetching of up to eight data streams called hardware streams, four of which can optionally be vector streams. A stream is defined as a sequence of loads that reference more than one contiguous cache line.

The prefetch engine is a functionality of the Load/Store Unit. It can detect sequential access patterns in ascending or descending order by monitoring loads and recording cache line addresses when there are cache misses. The 970FX does not prefetch store misses.

Let us look at an example of the prefetch engine's operation. Assuming no prefetch streams are active, the prefetch engine will act when there is an L1 D-cache miss. Suppose the miss was for a cache line with address A; then the engine will create an entry in the Prefetch Filter Queue (PFQ) [29] with the address of either the next or the previous cache line—that is, either A + 1 or A – 1. It guesses the direction (up or down) based on whether the memory access was located in the top 25% of the cache line (guesses down) or the bottom 75% of the cache line (guesses up). If there is another L1 D-cache miss, the engine will compare the line address with the entries in the PFQ. If the access is indeed sequential, the line address now being compared must be either A + 1 or A – 1. Alternatively, the engine could have incorrectly guessed the direction, in which case it would create another filter entry for the opposite direction. If the guessed direction was correct (say, up), the engine deems it a sequential access and allocates a stream entry in the Prefetch Request Queue (PRQ) [30] using the next available stream identifier. Moreover, the engine will initiate prefetching for cache line A + 2 to L1 and cache line A + 3 to L2. If A + 2 is read, the engine will cause A + 3 to be fetched to L1 from L2, and A + 4, A + 5, and A + 6 to be fetched to L2. If further sequential demand-reads occur (for A + 3 next), this pattern will continue until all streams are assigned. The PFQ is updated using an LRU algorithm.

The 970FX allows software to manipulate the prefetch mechanism. This is useful if the programmer knows data access patterns ahead of time. A version of the data-cache-block-touch (dcbt) instruction, which is one of the storage control instructions, can be used by a program to provide hints that it intends to read from a specified address or data stream in the near future. Consequently, the processor would initiate a data stream prefetch from a particular address.

Note that if you attempt to access unmapped or protected memory via software-initiated prefetching, no page faults will occur. Moreover, these instructions are not guaranteed to succeed and can fail silently for a variety of reasons. In the case of success, no result is returned in any register—only the cache block is fetched. In the case of failure, no cache block is fetched, and again, no result is returned in any register. In particular, failure does not affect program correctness; it simply means that the program will not benefit from prefetching.

Prefetching continues until a page boundary is reached, at which point the stream will have to be reinitialized. This is so because the prefetch engine does not know about the effective-to-real address mapping and can prefetch only within a real page. This is an example of a situation in which large pages—with page boundaries that are 16MB apart—will fare better than 4KB pages.

On a Mac OS X system with AltiVec hardware, you can use the vec_dst() AltiVec function to initiate data read of a line into cache, as shown in the pseudocode in Figure 3–7.

Example 3–7. Data prefetching in AltiVec

while (/* data processing loop */) {

    /* prefetch */
    vec_dst(address + prefetch_lead, control, stream_id);

    /* do some processing */

    /* advance address pointer */
}

/* stop the stream */
vec_dss(stream_id);

The address argument to vec_dst() is a pointer to a byte that lies within the first cache line to be fetched; the control argument is a word whose bits specify the block size, the block count, and the distance between the blocks; and the stream_id specifies the stream to use.

3.3.6 Registers

The 970FX has two privilege modes of operation: a user mode (problem state) and a supervisor mode (privileged state). The former is used by user-space applications, whereas the latter is used by the Mac OS X kernel. When the processor is first initialized, it comes up in supervisor mode, after which it can be switched to user mode via the Machine State Register (MSR).

The set of architected registers can be divided into three levels (or models) in the PowerPC architecture:

  1. User Instruction Set Architecture (UISA)
  2. Virtual Environment Architecture (VEA)
  3. Operating Environment Architecture (OEA)

The UISA and VEA registers can be accessed by software through either user-level or supervisor-level privileges, although there are VEA registers that cannot be written to by user-level instructions. OEA registers can be accessed only by supervisor-level instructions.

3.3.6.1 UISA and VEA Registers

Figure 3–8 shows the UISA and VEA registers of the 970FX. Their purpose is summarized in Table 3–6. Note that whereas the general-purpose registers are all 64-bit wide, the set of supervisor-level registers contains both 32-bit and 64-bit registers.

singhfig3-8.gif

Figure 3–8 PowerPCregistersPowerPC UISA and VEA registers

Table 3–6. UISA and VEA Registers

Name

Width

Count

Notes

General-Purpose Registers (GPRs)

64-bit

32

GPRs are used as source or destination registers for fixed-point operations—e.g., by fixed-point load/store instructions. You also use GPRs while accessing special-purpose registers (SPRs). Note that GPR0 is not hardwired to the value 0, as is the case on several RISC architectures.

Floating-Point Registers (FPRs)

64-bit

32

FPRs are used as source or destination registers for floating-point instructions. You also use FPRs to access the Floating-Point Status and Control Register (FPSCR). An FPR can hold integer, single-precision floating-point, or double-precision floating-point values.

Vector Registers (VRs)

128-bit

32

VRs are used as vector source or destination registers for vector instructions.

Integer Exception Register (XER)

32-bit

1

The XER is used to indicate carry conditions and overflows for integer operations. It is also used to specify the number of bytes to be transferred by a load-string-word-indexed (lswx) or store-string-word-indexed (stswx) instruction.

Floating-Point Status and Control Register (FPSCR)

32-bit

1

The FPSCR is used to record floating-point exceptions and the result type of a floating-point operation. It is also used to toggle the reporting of floating-point exceptions and to control the floating-point rounding mode.

Vector Status and Control Register (VSCR)

32-bit

1

Only two bits of the VSCR are defined: the saturate (SAT) bit and the non-Java mode (NJ) bit. The SAT bit indicates that a vector saturating-type instruction generated a saturated result. The NJ bit, if cleared, enables a Java-IEEE-C9X-compliant mode for vector floating-point operations that handles denormalized values in accordance with these standards. When the NJ bit is set, a potentially faster mode is selected, in which the value 0 is used in place of denormalized values in source or result vectors.

Condition Register (CR)

32-bit

1

The CR is conceptually divided into eight 4-bit fields (CR0–CR7). These fields store results of certain fixed-point and floating-point operations. Some branch instructions can test individual CR bits.

Vector Save/Restore Register (VRSAVE)

32-bit

1

The VRSAVE is used by software while saving and restoring VRs across context-switching events. Each bit of the VRSAVE corresponds to a VR and specifies whether that VR is in use or not.

Link Register (LR)

64-bit

1

The LR can be used to return from a subroutine—it holds the return address after a branch instruction if the link (LK) bit in that branch instruction's encoding is 1. It is also used to hold the target address for the branch-conditional-to-Link-Register (bclrx) instruction. Some instructions can automatically load the LR to the instruction following the branch.

Count Register (CTR)

64-bit

1

The CTR can be used to hold a loop count that is decremented during execution of branch instructions. The branch-conditional-to-Count-Register (bcctrx) instruction branches to the target address held in this register.

Timebase Registers (TBL, TBU)

32-bit

2

The Timebase (TB) Register, which is the concatenation of the 32-bit TBU and TBL registers, contains a periodically incrementing 64-bit unsigned integer.

Processor registers are used with all normal instructions that access memory. In fact, there are no computational instructions in the PowerPC architecture that modify storage. For a computational instruction to use a storage operand, it must first load the operand into a register. Similarly, if a computational instruction writes a value to a storage operand, the value must go to the target location via a register. The PowerPC architecture supports the following addressing modes for such instructions.

  • Register Indirect—The effective address EA is given by (rA | 0).
  • Register Indirect with Immediate Index—EA is given by (rA | 0) + offset, where offset can be zero.
  • Register Indirect with Index—EA is given by (rA | 0) + rB.

The UISA-level performance-monitoring registers provide user-level read access to the 970FX's performance-monitoring facility. They can be written only by a supervisor-level program such as the kernel or a kernel extension.

The Timebase Register

The Timebase (TB) provides a long-period counter driven by an implementation-dependent frequency. The TB is a 64-bit register containing an unsigned 64-bit integer that is incremented periodically. Each increment adds 1 to bit 63 (the lowest-order bit) of the TB. The maximum value that the TB can hold is 264 – 1, after which it resets to zero without generating any exception. The TB can either be incremented at a frequency that is a function of the processor clock frequency, or it can be driven by the rising edge of the signal on the TB enable (TBEN) input pin. [31] In the former case, the 970FX increments the TB once every eight full frequency processor clocks. It is the operating system's responsibility to initialize the TB. The TB can be read—but not written to—from user space. The program shown in Figure 3–9 retrieves and prints the TB.

Example 3–9. Retrieving and displaying the Timebase Register

   // timebase.c

#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>

u_int64_t mftb64(void);
void mftb32(u_int32_t *, u_int32_t *);

int
main(void)
{
    u_int64_t tb64;
    u_int32_t tb32u, tb32l;

    tb64 = mftb64();
    mftb32(&tb32u, &tb32l);

    printf("%llx %x%08x\n", tb64, tb32l, tb32u);
    exit(0);
}

// Requires a 64-bit processor

   // The TBR can be read in a single instruction (TBU || TBL)
u_int64_t
mftb64(void)
{
    u_int64_t tb64;

    __asm("mftb %0\n\t"
          : "=r" (tb64)
          :
    );

    return tb64;
}
// 32-bit or 64-bit
void
mftb32(u_int32_t *u, u_int32_t *l)
{
    u_int32_t tmp;

    __asm(
    "loop:              \n\t"
        "mftbu    %0    \n\t"
        "mftb     %1    \n\t"
        "mftbu    %2    \n\t"
        "cmpw     %2,%0 \n\t"
        "bne      loop  \n\t"
        : "=r"(*u), "=r"(*l), "=r"(tmp)
        :
    );
}
$ gcc -Wall -o timebase timebase.c
$ ./timebase; ./timebase; ./timebase; ./timebase; ./timebase
b6d10de300000001 b6d10de4000002d3
b6d4db7100000001 b6d4db72000002d3
b6d795f700000001 b6d795f8000002d3
b6da5a3000000001 b6da5a31000002d3
b6dd538c00000001 b6dd538d000002d3

Note in Figure 3–9 that we use inline assembly rather than create a separate assembly source file. The GNU assembler inline syntax is based on the template shown in Figure 3–10.

Example 3–10. Code template for inline assembly in the GNU assembler

__asm__ volatile(
        "assembly statement 1\n"
        "assembly statement 2\n"
        ...
        "assembly statement N\n"
    :   outputs, if any
    :   inputs, if any
    :   clobbered registers, if any
);

We will come across other examples of inline assembly in this book.

3.3.6.2 OEA Registers

The OEA registers are shown in Figure 3–11. Examples of their use include the following.

  • The bit-fields of the Machine State Register (MSR) are used to define the processor's state. For example, MSR bits are used to specify the processor's computation mode (32-bit or 64-bit), to enable or disable power management, to determine whether the processor is in privileged (supervisor) or nonprivileged (user) mode, to enable single-step tracing, and to enable or disable address translation. The MSR can be explicitly accessed via the move-to-MSR (mtmsr), move-to-MSR-double (mtmsrd), and move-from-MSR (mfmsr) instructions. It is also modified by the system-call (sc) and return-from-interrupt-double (rfid) instructions.
  • The Hardware-Implementation-Dependent (HID) registers allow very fine-grained control of the processor's features. Bit-fields in the various HID registers can be used to enable, disable, or alter the behavior of processor features such as branch prediction mode, data prefetching, instruction cache, and instruction prefetch mode and also to specify which data cache replacement algorithm to use (LRU or first-in first-out [FIFO]), whether the Timebase is externally clocked, and whether large pages are disabled.
  • The Storage Description Register (SDR1) is used to hold the page table base address.
singhfig3-11.gif

Figure 3–11 PowerPC OEA registers

3.3.7 Rename Registers

The 970FX implements a substantial number of rename registers, which are used to handle register-name dependencies. Instructions can depend on one another from the point of view of control, data, or name. Consider two instructions, say, I1 and I2, in a program, where I2 comes after I1.

I1
...
Ix
...
I2

In a data dependency, I2 either uses a result produced by I1, or I2 has a data dependency on an instruction Ix, which in turn has a data dependency on I1. In both cases, a value is effectively transmitted from I1 to I2.

In a name dependency, I1 and I2 use the same logical resource or name, such as a register or a memory location. In particular, if I2 writes to the same register that is either read from or written to by I1, then I2 would have to wait for I1 to execute before it can execute. These are known as write-after-read (WAR) and write-after-write (WAW) hazards.

I1 reads (or writes) <REGISTER X>
...
I2 writes <REGISTER X>

In this case, the dependency is not "real" in that I2 does not need I1's result. One solution to handle register-name dependencies is to rename the conflicting register used in the instructions so that they become independent. Such renaming could be done in software (statically, by the compiler) or in hardware (dynamically, by logic in the processor). The 970FX uses pools of physical rename registers that are assigned to instructions during the mapping stage in the processor pipeline and released when they are no longer needed. In other words, the processor internally renames architected registers used by instructions to physical registers. This makes sense only when the number of physical registers is (substantially) larger than the number of architected registers. For example, the PowerPC architecture has 32 GPRs, but the 970FX implementation has a pool of 80 physical GPRs, from which the 32 architected GPRs are assigned. Let us consider a specific example, say, of a WAW hazard, where renaming is helpful.

; before renaming
r20 U2190.GIF r21 + r22 ; r20 is written to
...
r20 U2190.GIF r23 + r24 ; r20 is written to... WAW hazard here
r25 U2190.GIF r20 + r26 ; r20 is read from

; after renaming
r20 U2190.GIF r21 + r22 ; r20 is written to
...
r64 U2190.GIF r23 + 424 ; r20 is renamed to r64... no WAW hazard now
r25 U2190.GIF r64 + r26 ; r20 is renamed to r64

Renaming is also beneficial to speculative execution, since the processor can use the extra physical registers to reduce the amount of architected register state it must save to recover from incorrectly speculated execution.

Table 3–7 lists the available renamed registers in the 970FX. The table also mentions emulation registers, which are available to cracked and microcoded instructions, which, as we will see in Section 3.3.9.1, are processes by which complex instructions are broken down into simpler instructions.

Table 3–7. Rename Register Resources

Resource

Architected (Logical Resource)

Emulation (Logical Resource)

Rename Pool (Physical Resource)

GPRs

32x64-bit

4x64-bit

80x64-bit.

VRSAVE

1x32-bit

Shared with the GPR rename pool.

FPRs

32x64-bit

1x64-bit

80x64-bit.

FPSCR

1x32-bit

One rename per active instruction group using a 20-entry buffer.

LR

1x64-bit

16x64-bit.

CTR

1x64-bit

LR and CTR share the same rename pool.

CR

8x4-bit

1x4-bit

32x4-bit.

XER

1x32-bit

24x2-bit. Only two bits—the overflow bit OV and the carry bit CA—are renamed from a pool of 24 2-bit registers.

VRs

32x128-bit

80x128-bit.

VSCR

1x32-bit

20x1-bit. Of the VSCR's two defined bits, only the SAT bit is renamed from a pool of 20 1-bit registers.

3.3.8 Instruction Set

All PowerPC instructions are 32 bits wide regardless of whether the processor is in 32-bit or 64-bit computation mode. All instructions are word aligned, which means that the two lowest-order bits of an instruction address are irrelevant from the processor's standpoint. There are several instruction formats, but bits 0 through 5 of an instruction word always specify the major opcode. PowerPC instructions typically have three operands: two source operands and one result. One of the source operands may be a constant or a register, but the other operands are usually registers.

We can broadly divide the instruction set implemented by the 970FX into the following instruction categories: fixed-point, floating-point, vector, control flow, and everything else.

3.3.8.1 Fixed-Point Instructions

Operands of fixed-point instructions can be bytes (8-bit), half words (16-bit), words (32-bit), or double words (64-bit). This category includes the following instruction types:

  • Fixed-point load and store instructions for moving values between the GPRs and storage
  • Fixed-point load-multiple-word (lmw) and store-multiple-word (stmw), which can be used for restoring or saving up to 32 GPRs in a single instruction
  • Fixed-point load-string-word-immediate (lswi), load-string-word-indexed (lswx), store-string-word-immediate (stswi), and store-string-word-indexed (stswx), which can be used to fetch and store fixed- and variable-length strings, with arbitrary alignments
  • Fixed-point arithmetic instructions, such as add, divide, multiply, negate, and subtract
  • Fixed-point compare instructions, such as compare-algebraic, compare-algebraic-immediate, compare-algebraic-logical, and compare-algebraic-logical-immediate
  • Fixed-point logical instructions, such as and, and-with-complement, equivalent, or, or-with-complement, nor, xor, sign-extend, and count-leading-zeros (cntlzw and variants)
  • Fixed-point rotate and shift instructions, such as rotate, rotate-and-mask, shift-left, and shift-right
  • Fixed-point move-to-system-register (mtspr), move-from-system-register (mfspr), move-to-MSR (mtmsr), and move-from-MSR (mfmsr), which allow GPRs to be used to access system registers

3.3.8.2 Floating-Point Instructions

Floating-point operands can be single-precision (32-bit) or double-precision (64-bit) floating-point quantities. However, floating-point data is always stored in the FPRs in double-precision format. Loading a single-precision value from storage converts it to double precision, and storing a single-precision value to storage actually rounds the FPR-resident double-precision value to single precision. The 970FX complies with the IEEE 754 standard [32] for floating-point arithmetic. This instruction category includes the following types:

  • Floating-point load and store instructions for moving values between the FPRs and storage
  • Floating-point comparison instructions
  • Floating-point arithmetic instructions, such as add, divide, multiply, multiply-add, multiply-subtract, negative-multiply-add, negative-multiply-subtract, negate, square-root, and subtract
  • Instructions for manipulating the FPSCR, such as move-to-FPSCR, move-from-FPSCR, set-FPSCR-bit, clear-FPSCR-bit, and copy-FPSCR-field-to-CR
  • PowerPC optional floating-point instructions, namely: floating-square-root (fsqrt), floating-square-root-single (fsqrts), floating-reciprocal-estimate-single (fres), floating-reciprocal-square-root-estimate (frsqrte), and floating-point-select (fsel)

Example 3–12. Precision of the floating-point-estimate instruction on the G4 and the G5

   // frsqrte.c

#include <stdio.h>
#include <stdlib.h>

double
frsqrte(double n)
{
    double s;

    asm(
        "frsqrte %0, %1"
        : "=f" (s)  /* out */
        : "f"  (n)  /* in */
    );

    return s;
}

int
main(int argc, char **argv)
{
    printf("%8.8f\n", frsqrte(strtod(argv[1], NULL)));
    return 0;
}

$ machine
ppc7450
$ gcc -Wall -o frsqrte frsqrte.c
$ ./frsqrte 0.5
1.39062500

$ machine
ppc970
$ gcc -Wall -o frsqrte frsqrte.c
$ ./frsqrte 0.5
1.37500000

3.3.8.3 Vector Instructions

Vector instructions execute in the 128-bit VMX execution unit. We will look at some of the VMX details in Section 3.3.10. The 970FX VMX implementation contains 162 vector instructions in various categories.

3.3.8.4 Control-Flow Instructions

A program's control flow is sequential—that is, its instructions logically execute in the order they appear—until a control-flow change occurs either explicitly (because of an instruction that modifies the control flow of a program) or as a side effect of another event. The following are examples of control-flow changes:

  • An explicit branch instruction, after which execution continues at the target address specified by the branch
  • An exception, which could represent an error, a signal external to the processor core, or an unusual condition that sets a status bit but may or may not cause an interrupt [33]
  • A trap, which is an interrupt caused by a trap instruction
  • A system call, which is a form of software-only interrupt caused by the system-call (sc) instruction

Each of these events could have handlers—pieces of code that handle them. For example, a trap handler may be executed when the conditions specified in the trap instruction are satisfied. When a user-space program executes an sc instruction with a valid system call identifier, a function in the operating system kernel is invoked to provide the service corresponding to that system call. Similarly, control flow also changes when the program is returning from such handlers. For example, after a system call finishes in the kernel, execution continues in user space—in a different piece of code.

The 970FX supports absolute and relative branching. A branch could be conditional or unconditional. A conditional branch can be based on any of the bits in the CR being 1 or 0. We earlier came across the special-purpose registers LR and CTR. LR can hold the return address on a procedure call. A leaf procedure—one that does not call another procedure—does not need to save LR and therefore can return faster. CTR is used for loops with a fixed iteration limit. It can be used to branch based on its contents—the loop counter—being zero or nonzero, while decrementing the counter automatically. LR and CTR are also used to hold target addresses of conditional branches for use with the bclr and bcctr instructions, respectively.

3.3.8.5 Miscellaneous Instructions

The 970FX includes various other types of instructions, many of which are used by the operating system for low-level manipulation of the processor. Examples include the following types:

  • Instructions for processor management, including direct manipulation of some SPRs
  • Instructions for controlling caches, such as for touching, zeroing, and flushing a cache; requesting a store; and requesting a prefetch stream to be initiated—for example: instruction-cache-block-invalidate (icbi), data-cache-block-touch (dcbt), data-cache-block-touch-for-store (dcbtst), data-cache-block-set-to-zero (dcbz), data-cache-block-store (dcbst), and data-cache-block-flush (dcbf)
  • Instructions for loading and storing conditionally, such as load-word-and-reserve-indexed (lwarx), load-double-word-and-reserve-indexed (ldarx), store-word-conditional-indexed (stwcx.), and store-double-word-conditional-indexed (stdcx.)
  • Instructions for memory synchronization, [34] such as enforce-in-order-execution-of-i/o (eieio), synchronize (sync), and special forms of sync (lwsync and ptesync)
  • Instructions for manipulating SLB and TLB entries, such as slb-invalidate-all (slbia), slb-invalidate-entry (slbie), tlb-invalidate-entry (tlbie), and tlb-synchronize (tlbsync)

3.3.9 The 970FX Core

The 970FX core is depicted in Figure 3–13. We have come across several of the core's major components earlier in this chapter, such as the L1 caches, the ERATs, the TLB, the SLB, register files, and register-renaming resources.

singhfig3-13.gif

Figure 3–13 The core of the 970FX

The 970FX core is designed to achieve a high degree of instruction parallelism. Some of its noteworthy features include the following.

  • It has a highly superscalar 64-bit design, with support for the 32-bit operating-system-bridge [35] facility.
  • It performs dynamic "cracking" of certain instructions into two or more simpler instructions.
  • It performs highly speculative execution of instructions along with aggressive branch prediction and dynamic instruction scheduling.
  • It has twelve logically separate functional units and ten execution pipelines.
  • It has two Fixed-Point Units (FXU1 and FXU2). Both units are capable of basic arithmetic, logical, shifting, and multiplicative operations on integers. However, only FXU1 is capable of executing divide instructions, whereas only FXU2 can be used in operations involving special purpose registers.
  • It has two Floating-Point Units (FPU1 and FPU2). Both units are capable of performing the full supported set of floating-point operations.
  • It has two Load/Store Units (LSU1 and LSU2).
  • It has a Condition Register Unit (CRU) that executes CR logical instructions.
  • It has a Branch Execution Unit (BRU) that computes branch address and branch direction. The latter is compared with the predicted direction. If the prediction was incorrect, the BRU redirects instruction fetching.
  • It has a Vector Processing Unit (VPU) with two subunits: a Vector Arithmetic and Logical Unit (VALU) and a Vector Permute Unit (VPERM). The VALU has three subunits of its own: a Vector Simple-Integer [36] Unit (VX), a Vector Complex-Integer Unit (VC), and a Vector Floating-Point Unit (VF).
  • It can perform 64-bit integer or floating-point operations in one clock cycle.
  • It has deeply pipelined execution units, with pipeline depths of up to 25 stages.
  • It has reordering issue queues that allow for out-of-order execution.
  • Up to 8 instructions can be fetched in each cycle from the L1 instruction cache.
  • Up to 8 instructions can be issued in each cycle.
  • Up to 5 instructions can complete in each cycle.
  • Up to 215 instructions can be in flight—that is, in various stages of execution (partially executed)—at any time.

3.3.9.1 Instruction Pipeline

In this section, we will discuss how the 970FX processes instructions. The overall instruction pipeline is shown in Figure 3–14. Let us look at the important stages of this pipeline.

singhfig3-14.gif

Figure 3–14 The 970FX instruction pipeline

IFAR, ICA [37]

Based on the address in the Instruction Fetch Address Register (IFAR), the instruction-fetch-logic fetches eight instructions every cycle from the L1 I-cache into a 32-entry instruction buffer. The eight-instruction block, so fetched, is 32-byte aligned. Besides performing IFAR-based demand fetching, the 970FX prefetches cache lines into a 4x128-byte Instruction Prefetch Queue. If a demand fetch results in an I-cache miss, the 970FX checks whether the instructions are in the prefetch queue. If the instructions are found, they are inserted into the pipeline as if no I-cache miss had occurred. The cache line's critical sector (eight words) is written into the I-cache.

D0

There is logic to partially decode (predecode) instructions after they leave the L2 cache and before they enter the I-cache or the prefetch queue. This process adds five extra bits to each instruction to yield a 37-bit instruction. An instruction's predecode bits mark it as illegal, microcoded, conditional or unconditional branch, and so on. In particular, the bits also specify how the instruction is to be grouped for dispatching.

D1, D2, D3

The 970FX splits complex instructions into two or more internal operations, or iops. The iops are more RISC-like than the instructions they are part of. Instructions that are broken into exactly two iops are called cracked instructions, whereas those that are broken into three or more iops are called microcoded instructions because the processor emulates them using microcode.

Fetched instructions go to a 32-instruction fetch buffer. Every cycle, up to five instructions are taken from this buffer and sent through a decode pipeline that is either inline (consisting of three stages, namely, D1, D2, and D3), or template-based if the instruction needs to be microcoded. The template-based decode pipeline generates up to four iops per cycle that emulate the original instruction. In any case, the decode pipeline leads to the formation of an instruction dispatch group.

Given the out-of-order execution of instructions, the processor needs to keep track of the program order of all instructions in various stages of execution. Rather than tracking individual instructions, the 970FX tracks instructions in dispatch groups. The 970FX forms such groups containing one to five iops, each occupying an instruction slot (0 through 4) in the group. Dispatch group formation [38] is subject to a long list of rules and conditions such as the following.

  • The iops in a group must be in program order, with the oldest instruction being in slot 0.
  • A group may contain up to four nonbranch instructions and optionally a branch instruction. When a branch is encountered, it is the last instruction in the current group, and a new group is started.
  • Slot 4 can contain only branch instructions. In fact, no-op (no-operation) instructions may have to be inserted in the other slots to force a branch instruction to fall in slot 4.
  • An instruction that is a branch target is always at the start of a group.
  • A cracked instruction takes two slots in a group.
  • A microcoded instruction takes an entire group by itself.
  • An instruction that modifies an SPR with no associated rename register terminates a group.
  • No more than two instructions that modify the CR may be in a group.

XFER

The iops wait for resources to become free in the XFER stage.

GD, DSP, WRT, GCT, MAP

After group formation, the execution pipeline divides into multiple pipelines for the various execution units. Every cycle, one group of instructions can be sent (or dispatched) to the issue queues. Note that instructions in a group remain together from dispatch to completion.

As a group is dispatched, several operations occur before the instructions actually execute. Internal group instruction dependencies are determined (GD). Various internal resources are assigned, such as issue queue slots, rename registers and mappers, and entries in the load/store reorder queues. In particular, each iop in the group that returns a result must be assigned a register to hold the result. Rename registers are allocated in the dispatch phase before the instructions enter the issue queues (DSP, MAP).

To track the groups themselves, the 970FX uses a global completion table (GCT) that stores up to 20 entries in program order—that is, up to 20 dispatch groups can be in flight concurrently. Since each group can have up to 5 iops, as many as 100 iops can be tracked in this manner. The WRT stage represents the writes to the GCT.

ISS, RF

After all the resources that are required to execute the instructions are available, the instructions are sent (ISS) to appropriate issue queues. Once their operands appear, the instructions start to execute. Each slot in a group feeds separate issue queues for various execution units. For example, the FXU/LSU and the FPU draw their instructions from slots { 0, 3 } and { 1, 2 }, respectively, of an instruction group. If one pair goes to the FXU/LSU, the other pair goes to the FPU. The CRU draws its instructions from the CR logical issue queue that is fed from instruction slots 0 and 1. As we saw earlier, slot 4 of an instruction group is dedicated to branch instructions. AltiVec instructions can be issued to the VALU and the VPERM issue queues from any slot except slot 4. Table 3–8 shows the 970FX issue queue sizes—each execution unit listed has one issue queue.

Table 3–8. Sizes of the Various 970FX Issue Queues

Execution Unit

Queue Size (Instructions)

LSU0/FXU0 [a]

18

LSU1/FXU1 [b]

18

FPU0

10

FPU1

10

BRU

12

CRU

10

VALU

20

VPERM

16

The FXU/LSU and FPU issue queues have odd and even halves that are hardwired to receive instructions only from certain slots of a dispatch group, as shown in Figure 3–15.

singhfig3-15.gif

Figure 3–15 The FPU and FXU/LSU issue queues in the 970FX

As long as an issue queue contains instructions that have all their data dependencies resolved, an instruction moves every cycle from the queue into the appropriate execution unit. However, there are likely to be instructions whose operands are not ready; such instructions block in the queue. Although the 970FX will attempt to execute the oldest instruction first, it will reorder instructions within a queue's context to avoid stalling. Ready-to-execute instructions access their source operands by reading the corresponding register file (RF), after which they enter the execution unit pipelines. Up to ten operations can be issued in a cycle—one to each of the ten execution pipelines. Note that different execution units may have varying numbers of pipeline stages.

We have seen that instructions both issue and execute out of order. However, if an instruction has finished execution, it does not mean that the program will "know" about it. After all, from the program's standpoint, instructions must execute in program order. The 970FX differentiates between an instruction finishing execution and an instruction completing. An instruction may finish execution (speculatively, say), but unless it completes, its effect is not visible to the program. All pipelines terminate in a common stage: the group completion stage (CP). When groups complete, many of their resources are released, such as load reorder queue entries, mappers, and global completion table entries. One dispatch group may be "retired" per cycle.

Accounting for 215 In-Flight Instructions

We can account for the theoretical maximum of 215 in-flight instructions by looking at Figure 3–4—specifically, the areas marked 1 through 6.

  1. The Instruction Fetch Unit has a fetch/overflow buffer that can hold 16 instructions.
  2. The instruction fetch buffer in the decode/dispatch unit can hold 32 instructions.
  3. Every cycle, up to 5 instructions are taken from the instruction fetch buffer and sent through a three-stage instruction decode pipeline. Therefore, up to 15 instructions can be in this pipeline.
  4. There are four dispatch buffers, each holding a dispatch group of up to five operations. Therefore, up to 20 instructions can be held in these buffers.
  5. The global completion table can track up to 20 dispatch groups after they have been dispatched, corresponding to up to 100 instructions in the 970FX core.
  6. The store queue can hold up to 32 stores.

Thus, the theoretical maximum number of in-flight instructions can be calculated as the sum 16 + 32 + 15 + 20 + 100 + 32, which is 215.

3.3.9.2 Branch Prediction

Branch prediction is a mechanism wherein the processor attempts to keep the pipeline full, and therefore improve overall performance, by fetching instructions in the hope that they will be executed. In this context, a branch is a decision point for the processor: It must predict the outcome of the branch—whether it will be taken or not—and accordingly prefetch instructions. As shown in Figure 3–15, the 970FX scans fetched instructions for branches. It looks for up to two branches per cycle and uses multistrategy branch prediction logic to predict their target addresses, directions, or both. Consequently, up to 2 branches are predicted per cycle, and up to 16 predicted branches can be in flight.

All conditional branches are predicted, based on whether the 970FX fetches instructions beyond a branch and speculatively executes them. Once the branch instruction itself executes in the BRU, its actual outcome is compared with its predicted outcome. If the prediction was incorrect, there is a severe penalty: Any instructions that may have speculatively executed are discarded, and instructions in the correct control-flow path are fetched.

The 970FX's dynamic branch prediction hardware includes three branch history tables (BHTs), a link stack, and a count cache. Each BHT has 16K 1-bit entries.

The first BHT is the local predictor table. Its 16K entries are indexed by branch instruction addresses. Each 1-bit entry indicates whether the branch should be taken or not. This scheme is "local" because each branch is tracked in isolation.

The second BHT is the global predictor table. It is used by a prediction scheme that takes into account the execution path taken to reach the branch. An 11-bit vector—the global history vector—represents the execution path. The bits of this vector represent the previous 11 instruction groups fetched. A particular bit is 1 if the next group was fetched sequentially and is 0 otherwise. A given branch's entry in the global predictor table is at a location calculated by performing an XOR operation between the global history vector and the branch instruction address.

The third BHT is the selector table. It tracks which of the two prediction schemes is to be favored for a given branch. The BHTs are kept up to date with the actual outcomes of executed branch instructions.

The link stack and the count cache are used by the 970FX to predict branch target addresses of branch-conditional-to-link-register (bclr, bclrl) and branch-conditional-to-count-register (bcctr, bcctrl) instructions, respectively.

So far, we have looked at dynamic branch prediction. The 970FX also supports static prediction wherein the programmer can use certain bits in a conditional branch operand to statically override dynamic prediction. Specifically, two bits called the "a" and "t" bits are used to provide hints regarding the branch's direction, as shown in Table 3–9.

Table 3–9. Static Branch Prediction Hints

"a" Bit

"t" Bit

Hint

0

0

Dynamic branch prediction is used.

0

1

Dynamic branch prediction is used.

1

0

Dynamic branch prediction is disabled; static prediction is "not taken"; specified by a "-" suffix to a branch conditional mnemonic.

1

1

Dynamic branch prediction is disabled; static prediction is "taken"; specified by a "+" suffix to a branch conditional mnemonic.

3.3.9.3 Summary

Let us summarize the instruction parallelism achieved by the 970FX. In every cycle of the 970FX, the following events occur.

  • Up to eight instructions are fetched.
  • Up to two branches are predicted.
  • Up to five iops (one group) are dispatched.
  • Up to five iops are renamed.
  • Up to ten iops are issued from the issue queues.
  • Up to five iops are completed.

3.3.10 AltiVec

The 970FX includes a dedicated vector-processing unit and implements the VMX instruction set, which is an AltiVec [39] interchangeable extension to the PowerPC architecture. AltiVec provides a SIMD-style 128-bit [40] vector-processing unit.

3.3.10.1 Vector Computing

SIMD stands for single-instruction, multiple-data. It refers to a set of operations that can efficiently handle large quantities of data in parallel. SIMD operations do not necessarily require more or wider registers, although more is better. SIMD essentially better uses registers and data paths. For example, a non-SIMD computation would typically use a hardware register for each data element, even if the register could hold multiple such elements. In contrast, SIMD would use a register to hold multiple data elements—as many as would fit—and would perform the same operation on all elements through a single instruction. Thus, any operation that can be parallelized in this manner stands to benefit from SIMD. In AltiVec's case, a vector instruction can perform the same operation on all constituents of a vector. Note that AltiVec instructions work on fixed-length vectors.

Several processor architectures have similar extensions. Table 3–10 lists some well-known examples.

Table 3–10. Examples of Processor Multimedia-Extensions

Processor Family

Manufacturers

Multimedia Extension Sets

Alpha

Hewlett-Packard (Digital Equipment Corporation)

MVI

AMD

Advanced Micro Devices (AMD)

3DNow!

MIPS

Silicon Graphics Incorporated (SGI)

MDMX, MIPS-3D

PA-RISC

Hewlett-Packard

MAX, MAX2

PowerPC

IBM, Motorola

VMX/AltiVec

SPARC V9

Sun Microsystems

VIS

x86

Intel, AMD, Cyrix

MMX, SSE, SSE2, SSE3

AltiVec can greatly improve the performance of data movement, benefiting applications that do processing of vectors, matrices, arrays, signals, and so on. As we saw in Chapter 2, Apple provides portable APIs—through the Accelerate framework (Accelerate.framework)—for performing vector-optimized operations. [42] Accelerate is an umbrella framework that contains the vecLib and vImage [43] subframeworks. vecLib is targeted for performing numerical and scientific computing—it provides functionality such as BLAS, LAPACK, digital signal processing, dot products, linear algebra, and matrix operations. vImage provides vector-optimized APIs for working with image data. For example, it provides functions for alpha compositing, convolutions, format conversion, geometric transformations, histograms operations, and morphological operations.

AltiVec has wide-ranging applications since areas such as high-fidelity audio, video, videoconferencing, graphics, medical imaging, handwriting analysis, data encryption, speech recognition, image processing, and communications all use algorithms that can benefit from vector processing.

Figure 3–16 shows a trivial AltiVec C program.

Example 3–16. A trivial AltiVec program

   // altivec.c

#include <stdio.h>
#include <stdlib.h>

int
main(void)
{
    // "vector" is an AltiVec keyword
    vector float v1, v2, v3;

    v1 = (vector float)(1.0, 2.0, 3.0, 4.0);
    v2 = (vector float)(2.0, 3.0, 4.0, 5.0);

    // vector_add() is a compiler built-in function
    v3 = vector_add(v1, v2);

    // "%vf" is a vector-formatting string for printf()
    printf("%vf\n", v3);

    exit(0);
}

$ gcc -Wall -faltivec -o altivec altivec.c
$ ./altivec
3.000000 5.000000 7.000000 9.000000

As also shown in Figure 3–16, the -faltivec option to GCC enables AltiVec language extensions.

3.3.10.2 The 970FX AltiVec Implementation

The 970FX AltiVec implementation consists of the following components:

  • A vector register file (VRF) consisting of 32 128-bit architected vector registers (VR0–VR31)
  • 48 128-bit rename registers for allocation in the dispatch phase
  • A 32-bit Vector Status and Control Register (VSCR)
  • A 32-bit Vector Save/Restore Register (VRSAVE)
  • A Vector Permute Unit (VPERM) that benefits the implementation of operations such as arbitrary byte-wise data organization, table lookups, and packing/unpacking of data
  • A Vector Arithmetic and Logical Unit (VALU) that contains three parallel subunits: the Vector Simple-Integer Unit (VX), the Vector Complex-Integer Unit (VC), and the Vector Floating-Point Unit (VF)

The CR is also modified as a result of certain vector instructions.

The 32-bit VRSAVE serves a special purpose: Each of its bits indicates whether the corresponding vector register is in use or not. The processor maintains this register so that it does not have to save and restore every vector register every time there is an exception or a context switch. Frequently saving or restoring 32 128-bit registers, which together constitute 512 bytes, would be severely detrimental to cache performance, as other, perhaps more critical data would need to be evicted from cache.

Let us extend our example program from Figure 3–16 to examine the value in the VRSAVE. Figure 3–17 shows the extended program.

Example 3–17. Displaying the contents of the VRSAVE

   // vrsave.c

#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>

void prbits(u_int32_t);
u_int32_t read_vrsave(void);

// Print the bits of a 32-bit number
void
prbits32(u_int32_t u)
{
    u_int32_t i = 32;

    for (; i--; putchar(u & 1 << i ? '1' : '0'));

    printf("\n");
}

// Retrieve the contents of the VRSAVE
u_int32_t
read_vrsave(void)
{
    u_int32_t v;
    __asm("mfspr %0,VRsave\n\t"
          : "=r"(v)
          :
    );

    return v;
}

int
main()
{
    vector float v1, v2, v3;

    v1 = (vector float)(1.0, 2.0, 3.0, 4.0);
    v2 = (vector float)(2.0, 3.0, 4.0, 5.0);

    v3 = vec_add(v1, v2);

    prbits32(read_vrsave());

    exit(0);
}


$ gcc -Wall -faltivec -o vrsave vrsave.c
$ ./vrsave
11000000000000000000000000000000

We see in Figure 3–17 that two high-order bits of the VRSAVE are set and the rest are cleared. This means the program uses two VRs: VR0 and VR1. You can verify this by looking at the assembly listing for the program.

The VPERM execution unit can do merge, permute, and splat operations on vectors. Having a separate permute unit allows data-reorganization instructions to proceed in parallel with vector arithmetic and logical instructions. The VPERM and VALU both maintain their own copies of the VRF that are synchronized on the half cycle. Thus, each receives its operands from its own VRF. Note that vector loads, stores, and data stream instructions are handled in the usual LSU pipes. Although no AltiVec instructions are cracked or microcoded, vector store instructions logically break down into two components: a vector part and an LSU part. In the group formation stage, a vector store is a single entity occupying one slot. However, once the instruction is issued, it occupies two issue queue slots: one in the vector store unit and another in the LSU. Address generation takes place in the LSU. There is a slot for moving the data out of the VRF in the vector unit. This is not any different from scalar (integer and floating-point) stores, in whose case address generation still takes place in the LSU, and the respective execution unit—integer or floating-point—is used for accessing the GPR file (GPRF) or the FPR file (FPRF).

AltiVec instructions were designed to be pipelined easily. The 970FX can dispatch up to four vector instructions every cycle—regardless of type—to the issue queues. Any vector instruction can be dispatched from any slot of the dispatch group except the dedicated branch slot 4.

3.3.10.3 AltiVec Instructions

AltiVec adds 162 vector instructions to the PowerPC architecture. Like all other PowerPC instructions, AltiVec instructions have 32-bit-wide encodings. To use AltiVec, no context switching is required. There is no special AltiVec operating mode—AltiVec instructions can be used along with regular PowerPC instructions in a program. AltiVec also does not interfere with floating-point registers.

The following points are noteworthy regarding AltiVec vectors.

  • A vector is 128 bits wide.
  • A vector can be comprised of one of the following: 16 bytes, 8 half words, 4 words (integers), or 4 single-precision floating-point numbers.
  • The largest vector element size is hardware-limited to 32 bits; the largest adder in the VALU is 32 bits wide. Moreover, the largest multiplier array is 24 bits wide, which is good enough for only a single-precision floating-point mantissa. [44]
  • A given vector's members can be all unsigned or all signed quantities.
  • The VALU behaves as multiple ALUs based on the vector element size.

Instructions in the AltiVec instruction set can be broadly classified into the following categories:

  • Vector load and store instructions
  • Instructions for reading from or writing to the VSCR
  • Data stream manipulation instructions, such as data-stream-touch (dst), data-stream-stop (dss), and data-stream-stop-all (dssall)
  • Vector fixed-point arithmetic and comparison instructions
  • Vector logical, rotate, and shift instructions
  • Vector pack, unpack, merge, splat, and permute instructions
  • Vector floating-point instructions

3.3.11 Power Management

The 970FX supports power management features such as the following.

  • It can dynamically stop the clocks of some of its constituents when they are idle.
  • It can be programmatically put into predefined power-saving modes such as doze, nap, and deep nap.
  • It includes PowerTune, a processor-level power management technology that supports scaling of processor and bus clock frequencies and voltage.

3.3.11.1 PowerTune

PowerTune allows clock frequencies to be dynamically controlled and even synchronized across multiple processors. PowerTune frequency scaling occurs in the processor core, the busses, the bridge, and the memory controller. Allowed frequencies range from f—the full nominal frequency—to f/2, f/4, and f/64. The latter corresponds to the deep nap power-saving mode. If an application does not require the processor's maximum available performance, frequency and voltage can be changed system-wide—without stopping the core execution units and without disabling interrupts or bus snooping. All processor logic, except the bus clocks, remains active. Moreover, the frequency change is very rapid. Since power has a quadratic dependency on voltage, reducing voltage has a desirable effect on power dissipation. Consequently, the 970FX has much lower typical power consumption than the 970, which did not have PowerTune.

3.3.11.2 Power Mac G5 Thermal and Power Management

In the Power Mac G5, Apple combines the power management capabilities of the 970FX/970MP with a network of fans and sensors to contain heat generation, power consumption, and noise levels. Examples of hardware sensors include those for fan speed, temperature, current, and voltage. The system is divided into discrete cooling zones with independently controlled fans. Some Power Mac G5 models additionally contain a liquid cooling system that circulates a thermally conductive fluid to transfer heat away from the processors into a radiant grille. As air passes over the grille's cooling fins, the fluid's temperature decreases. [46]

Operating system support is required to make the Power Mac G5's thermal management work properly. Mac OS X regularly monitors various temperatures and power consumption. It also communicates with the fan control unit (FCU). If the FCU does not receive feedback from the operating system, it will spin the fans at maximum speed.

A liquid-cooled dual-processor 2.5GHz Power Mac has the following fans:

  • CPU A PUMP
  • CPU A INTAKE
  • CPU A EXHAUST
  • CPU B PUMP
  • CPU B INTAKE
  • CPU B EXHAUST
  • BACKSIDE
  • DRIVE BAY
  • SLOT

Additionally, the Power Mac has sensors for current, voltage, and temperature, as listed in Table 3–11.

Table 3–11. Power Mac G5 Sensors: An Example

Sensor Type

Sensor Location/Name

Ammeter

CPU A AD7417 [a] AD2

Ammeter

CPU A AD7417 AD4

Ammeter

CPU B AD7417 AD2

Ammeter

CPU B AD7417 AD4

Switch

Power Button

Thermometer

BACKSIDE

Thermometer

U3 HEATSINK

Thermometer

DRIVE BAY

Thermometer

CPU A AD7417 AMB

Thermometer

CPU A AD7417 AD1

Thermometer

CPU B AD7417 AMB

Thermometer

CPU B AD7417 AD1

Thermometer

MLB INLET AMB

Voltmeter

CPU A AD7417 AD3

Voltmeter

CPU B AD7417 AD3

We will see in Chapter 10 how to programmatically retrieve the values of various sensors from the kernel.

3.3.12 64-bit Architecture

We saw earlier that the PowerPC architecture was designed with explicit support for 64- and 32-bit computing. PowerPC is, in fact, a 64-bit architecture with a 32-bit subset. A particular PowerPC implementation may choose to implement only the 32-bit subset, as is the case with the G3 and G4 processor families used by Apple. The 970FX implements both the 64-bit and 32-bit forms [47] dynamic computation modes [48] of the PowerPC architecture. The modes are dynamic in that you can switch between the two dynamically by setting or clearing bit 0 of the MSR.

3.3.12.1 64-bit Features

The key aspects of the 970FX's 64-bit mode are as follows:

  • 64-bit registers: [49] the GPRs, CTR, LR, and XER
  • 64-bit addressing, including 64-bit pointers, which allow one program's address space to be larger than 4GB
  • 32-bit and 64-bit programs, which can execute side by side
  • 64-bit integer and logical operations, with fewer instructions required to load and store 64-bit quantities [50]
  • Fixed instruction size—32 bits—in both 32- and 64-bit modes
  • 64-bit-only instructions such as load-word-algebraic (lwa), load-word-algebraic-indexed (lwax), and "double-word" versions of several instructions

Although a Mac OS X process must be 64-bit itself to be able to directly access more than 4GB of virtual memory, having support in the processor for more than 4GB of physical memory benefits both 64-bit and 32-bit applications. After all, physical memory backs virtual memory. Recall that the 970FX can track a large amount of physical memory—42 bits worth, or 4TB. Therefore, as long as there is enough RAM, much greater amounts of it can be kept "alive" than is possible with only 32 bits of physical addressing. This is beneficial to 32-bit applications because the operating system can now keep more working sets in RAM, reducing the number of page-outs—even though a single 32-bit application will still "see" only a 4GB address space.

3.3.12.2 The 970FX as a 32-bit Processor

Just as the 64-bit PowerPC is not an extension of the 32-bit PowerPC, the latter is not a performance-limited version of the former—there is no penalty for executing in 32-bit-only mode on the 970FX. There are, however, some differences. Important aspects of running the 970FX in 32-bit mode include the following.

  • The sizes of the floating-point and AltiVec registers are the same across 32-bit and 64-bit implementations. For example, an FPR is 64 bits wide and a VR is 128 bits wide on both the G4 and the G5.
  • The 970FX uses the same resources—registers, execution units, data paths, caches, and busses—in 64- and 32-bit modes.
  • Fixed-point logical, rotate, and shift instructions behave the same in both modes.
  • Fixed-point arithmetic instructions (except the negate instruction) actually produce the same result in 64- and 32-bit modes. However, the carry (CA) and overflow (OV) fields of the XER register are set in a 32-bit-compatible way in 32-bit mode.
  • Load/store instructions ignore the upper 32 bits of an effective address in 32-bit mode. Similarly, branch instructions deal with only the lower 32 bits of an effective address in 32-bit mode.

3.3.13 Softpatch Facility

The 970FX provides a facility called softpatch, which is a mechanism that allows software to work around bugs in the processor core and to otherwise debug the core. This is achieved either by replacing an instruction with a substitute microcoded instruction sequence or by making an instruction cause a trap to software through a softpatch exception.

The 970FX's Instruction Fetch Unit contains a seven-entry array with content-addressable memory (CAM). This array is called the Instruction Match CAM (IMC). Additionally, the 970FX's instruction decode unit contains a microcode softpatch table. The IMC array has eight rows. The first six IMC entries occupy one row each, whereas the seventh entry occupies two rows. Of the seven entries, the first six are used to match partially (17 bits) over an instruction's major opcode (bits 0 through 5) and extended opcode (bits 21 through 31). The seventh entry matches in its entirety: a 32-bit full instruction match. As instructions are fetched from storage, they are matched against the IMC entries by the Instruction Fetch Unit's matching facility. If matched, the instruction's processing can be altered based on other information in the matched entry. For example, the instruction can be replaced with microcode from the instruction decode unit's softpatch table.

The 970FX provides various other tracing and performance-monitoring facilities that are beyond the scope of this chapter.

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

Overview


Pearson Education, Inc., 221 River Street, Hoboken, New Jersey 07030, (Pearson) presents this site to provide information about products and services that can be purchased through this site.

This privacy notice provides an overview of our commitment to privacy and describes how we collect, protect, use and share personal information collected through this site. Please note that other Pearson websites and online products and services have their own separate privacy policies.

Collection and Use of Information


To conduct business and deliver products and services, Pearson collects and uses personal information in several ways in connection with this site, including:

Questions and Inquiries

For inquiries and questions, we collect the inquiry or question, together with name, contact details (email address, phone number and mailing address) and any other additional information voluntarily submitted to us through a Contact Us form or an email. We use this information to address the inquiry and respond to the question.

Online Store

For orders and purchases placed through our online store on this site, we collect order details, name, institution name and address (if applicable), email address, phone number, shipping and billing addresses, credit/debit card information, shipping options and any instructions. We use this information to complete transactions, fulfill orders, communicate with individuals placing orders or visiting the online store, and for related purposes.

Surveys

Pearson may offer opportunities to provide feedback or participate in surveys, including surveys evaluating Pearson products, services or sites. Participation is voluntary. Pearson collects information requested in the survey questions and uses the information to evaluate, support, maintain and improve products, services or sites, develop new products and services, conduct educational research and for other purposes specified in the survey.

Contests and Drawings

Occasionally, we may sponsor a contest or drawing. Participation is optional. Pearson collects name, contact information and other information specified on the entry form for the contest or drawing to conduct the contest or drawing. Pearson may collect additional personal information from the winners of a contest or drawing in order to award the prize and for tax reporting purposes, as required by law.

Newsletters

If you have elected to receive email newsletters or promotional mailings and special offers but want to unsubscribe, simply email information@informit.com.

Service Announcements

On rare occasions it is necessary to send out a strictly service related announcement. For instance, if our service is temporarily suspended for maintenance we might send users an email. Generally, users may not opt-out of these communications, though they can deactivate their account information. However, these communications are not promotional in nature.

Customer Service

We communicate with users on a regular basis to provide requested services and in regard to issues relating to their account we reply via email or phone in accordance with the users' wishes when a user submits their information through our Contact Us form.

Other Collection and Use of Information


Application and System Logs

Pearson automatically collects log data to help ensure the delivery, availability and security of this site. Log data may include technical information about how a user or visitor connected to this site, such as browser type, type of computer/device, operating system, internet service provider and IP address. We use this information for support purposes and to monitor the health of the site, identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents and appropriately scale computing resources.

Web Analytics

Pearson may use third party web trend analytical services, including Google Analytics, to collect visitor information, such as IP addresses, browser types, referring pages, pages visited and time spent on a particular site. While these analytical services collect and report information on an anonymous basis, they may use cookies to gather web trend information. The information gathered may enable Pearson (but not the third party web trend services) to link information with application and system log data. Pearson uses this information for system administration and to identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents, appropriately scale computing resources and otherwise support and deliver this site and its services.

Cookies and Related Technologies

This site uses cookies and similar technologies to personalize content, measure traffic patterns, control security, track use and access of information on this site, and provide interest-based messages and advertising. Users can manage and block the use of cookies through their browser. Disabling or blocking certain cookies may limit the functionality of this site.

Do Not Track

This site currently does not respond to Do Not Track signals.

Security


Pearson uses appropriate physical, administrative and technical security measures to protect personal information from unauthorized access, use and disclosure.

Children


This site is not directed to children under the age of 13.

Marketing


Pearson may send or direct marketing communications to users, provided that

  • Pearson will not use personal information collected or processed as a K-12 school service provider for the purpose of directed or targeted advertising.
  • Such marketing is consistent with applicable law and Pearson's legal obligations.
  • Pearson will not knowingly direct or send marketing communications to an individual who has expressed a preference not to receive marketing.
  • Where required by applicable law, express or implied consent to marketing exists and has not been withdrawn.

Pearson may provide personal information to a third party service provider on a restricted basis to provide marketing solely on behalf of Pearson or an affiliate or customer for whom Pearson is a service provider. Marketing preferences may be changed at any time.

Correcting/Updating Personal Information


If a user's personally identifiable information changes (such as your postal address or email address), we provide a way to correct or update that user's personal data provided to us. This can be done on the Account page. If a user no longer desires our service and desires to delete his or her account, please contact us at customer-service@informit.com and we will process the deletion of a user's account.

Choice/Opt-out


Users can always make an informed choice as to whether they should proceed with certain services offered by InformIT. If you choose to remove yourself from our mailing list(s) simply visit the following page and uncheck any communication you no longer want to receive: www.informit.com/u.aspx.

Sale of Personal Information


Pearson does not rent or sell personal information in exchange for any payment of money.

While Pearson does not sell personal information, as defined in Nevada law, Nevada residents may email a request for no sale of their personal information to NevadaDesignatedRequest@pearson.com.

Supplemental Privacy Statement for California Residents


California residents should read our Supplemental privacy statement for California residents in conjunction with this Privacy Notice. The Supplemental privacy statement for California residents explains Pearson's commitment to comply with California law and applies to personal information of California residents collected in connection with this site and the Services.

Sharing and Disclosure


Pearson may disclose personal information, as follows:

  • As required by law.
  • With the consent of the individual (or their parent, if the individual is a minor)
  • In response to a subpoena, court order or legal process, to the extent permitted or required by law
  • To protect the security and safety of individuals, data, assets and systems, consistent with applicable law
  • In connection the sale, joint venture or other transfer of some or all of its company or assets, subject to the provisions of this Privacy Notice
  • To investigate or address actual or suspected fraud or other illegal activities
  • To exercise its legal rights, including enforcement of the Terms of Use for this site or another contract
  • To affiliated Pearson companies and other companies and organizations who perform work for Pearson and are obligated to protect the privacy of personal information consistent with this Privacy Notice
  • To a school, organization, company or government agency, where Pearson collects or processes the personal information in a school setting or on behalf of such organization, company or government agency.

Links


This web site contains links to other sites. Please be aware that we are not responsible for the privacy practices of such other sites. We encourage our users to be aware when they leave our site and to read the privacy statements of each and every web site that collects Personal Information. This privacy statement applies solely to information collected by this web site.

Requests and Contact


Please contact us about this Privacy Notice or if you have any requests or questions relating to the privacy of your personal information.

Changes to this Privacy Notice


We may revise this Privacy Notice through an updated posting. We will identify the effective date of the revision in the posting. Often, updates are made to provide greater clarity or to comply with changes in regulatory requirements. If the updates involve material changes to the collection, protection, use or disclosure of Personal Information, Pearson will provide notice of the change through a conspicuous notice on this site or other appropriate way. Continued use of the site after the effective date of a posted revision evidences acceptance. Please contact us if you have questions or concerns about the Privacy Notice or any objection to any revisions.

Last Update: November 17, 2020