InformIT

The Future of CPUs: What's After Multi-Core?

Date: Oct 27, 2006

Return to the article

As Moore's Law continues to hold, IC designers are finding that they have more and more silicon real estate to play with. David Chisnall hazards some guesses as to what they might do with it.

Predicting the Future

It’s very easy to make accurate predictions about the future of technology. Stuff will get smaller, faster, and cheaper. This has been true for centuries and is unlikely to change—at least until we start running out of oil. Making interesting and accurate predictions is somewhat more difficult.

One trick employed by many futurists is to predict as many things as possible, and then remind people of the correct predictions when they happen, while brushing the less accurate predictions under the carpet. This approach works, to an extent, but isn’t much fun.

One good technique in the computing world is to look at what’s happening in the mainframe and supercomputer communities and predict that the same sorts of things will happen in the personal computer arena. This rule was driven home to me when I attended a talk by an IBM engineer about his company’s new virtualization technology. He commented that his company had an advantage over other people working in the area: Whenever they were stuck, they could go along the hall to the mainframe division and ask how they solved the same problem a couple of decades ago.

This trend is a good guide to the future: Things always trickle down eventually from the high end to the consumer bracket.

Another trend is that the high end is constantly getting smaller. SGI’s mistake was not to realize this truth. Ten years or so ago, SGI was the company to go to for high-end graphics hardware. They still retain this niche; their latest hardware allows a large number of GPUs to share the same RAM and therefore work together tightly. The difference now is that the GPUs are developed by NVIDIA. The people who originally formed NVIDIA used to work for SGI, but their management didn’t want them to produce a consumer-level graphics accelerator, since it would compete with their high-end hardware. These folks went on to form their own company, and now own about 20% of a market that is orders of magnitude larger than the entire market in which SGI competes. Worse from SGI’s perspective is that many of the people who needed high-end hardware a decade ago now barely tax their consumer-grade equipment.

New uses for high-end equipment constantly emerge, but eventually the consumer segment catches up.

The State of the Art

Dual-core CPUs were first made a commercial success by IBM with the POWER4 series some years ago. The idea is simple: Most big-iron machines have a large number of CPUs, and if you put more than one in each CPU package then you can reduce the physical size.

These days, Intel and AMD have jumped on the dual-core bandwagon and are racing toward quad-cores and beyond. This is a logical development, according to Moore’s Law—one of the most misquoted observations in computing. Moore’s Law states that the number of transistors that can be put on a CPU for a fixed financial investment doubles every 12–24 months. (The exact time period varies, depending on when you ask Gordon Moore, but is usually quoted as 18 months.) If you want to spend more money, you can add more transistors; the Extreme Edition Pentiums do this to have more cache, for example.

The question becomes what to do with these spare transistors. The Pentium II, released in 1997, used 7.5 million transistors. The Itanium 2, released in 2004, used 592 million. Most of these were cache. Adding cache to a CPU is a nice, easy way of using up transistors. Cache is very simple, and adding more is only slightly more complicated than copying-and-pasting part of the chip design a few times. Unfortunately, it starts to get into diminishing returns fairly quickly. Once the entire working set of a process fits in cache, adding more provides no benefit.

The next trick is to add more cores, effectively duplicating the entire CPU. Looking at the transistor counts for the last two CPUs, we see that in 2004 it would have been possible to produce an 80-core Pentium II. Within a decade, it will be economically feasible to produce a single chip with 5,000 P6 cores. Unfortunately, the power requirements of such a chip would mean that it would require its own electricity substation—not to mention a steady supply of liquid nitrogen to cool it. It also looks as though memory technology will only be at the stage where a few percent of them can be kept fed with data. Each could have its own memory bus, but then you’ve got a minimum of 10,000 pins—64,000 for a 64-bit memory bus. Even designing the package in which such a chip would be distributed is a significant engineering challenge; designing a motherboard that would connect each RAM channel to a memory bank is a problem that would give most PCB designers recurring nightmares.

Throwing more cache onto chips worked for a little while. Throwing more cores on will work for a little while longer. Eventually, however, a more intelligent solution will be required.

RISC Versus CISC

One of the big debates in the 1980s and 1990s was whether the CISC or RISC approach to instruction set design was correct. The idea behind RISC was to provide a simple set of instructions that could be used to perform complex operations, while the CISC people wanted complex instructions that could be used on their own.

The embodiment of the CISC design was the VAX. Writing VAX assembly was not much different from writing high-level code. In later VAX systems, many of these instructions were microcoded, which meant that they were decomposed into simpler real instructions that were then run on the real hardware.

After the VAX, Digital created the Alpha, a chip at the opposite extreme. The Alpha had a very small instruction set, but it ran incredibly fast. For many years, it was the fastest microprocessor that money could buy. Even now, several of the top 500 supercomputers are Alpha-based, in spite of the fact that the chip hasn’t been in active development for five years.

In the early years, RISC did very well. Compiler writers loved the chips; they could easily remember the instruction sets, and it was easier to map complex language constructs onto sequences of RISC instructions than to try to map them to CISC instructions.

The first problems in the RISC philosophy became apparent with improvements in the way division was handled. Early RISC chips didn’t have a divide instruction; some didn’t even have a multiply instruction. Instead, these were created from sequences of more primitive operations, such as shifts. This wasn’t a problem for software developers; they would just copy the sequence of instructions to accomplish a divide from the architecture handbook, put it in a macro somewhere, and then use it as if they had a divide instruction. Then someone worked out a more efficient way of implementing a divide instruction.

The next generation of CPUs with divide instructions could execute the operation in fewer cycles, while those without took the same number of cycles to execute the series of instructions used as a substitute. This has been taken even further with Intel’s latest Core micro-architecture. Some sequences of simple x86 operations are now combined into a single instruction that’s executed internally.

Some components of the RISC philosophy live on. It’s still widely regarded as a good idea for instruction sets to be orthogonal, for example, because providing multiple ways of doing the same thing is a waste of silicon. The idea of a simple instruction set is being eroded, however. Even modern Power PC and SPARC chips that are marketed as RISC processors wouldn’t be recognized as RISC by those who invented the term.

SIMD and More

While a relatively common feature in the HPC arena, the Pentium with MMX was the first x86 chip to incorporate single instruction, multiple data (SIMD) instructions. These instructions do exactly what their name suggests, providing a way to perform the same operation on multiple sets of data. While a traditional (scalar) instruction might subtract one number from another, the SIMD equivalent could subtract four numbers from four other numbers, performing the same operation on four inputs. This kind of thing is used a lot in image and video processing.

SIMD instructions are relatively cheap to add to CPUs, and provide a good return on investment. If your process is taking 10% of your CPU and you upgrade, you’re unlikely to notice that it’s now taking only 5%. But if you have a process that uses 100% of your CPU, you’re very likely to notice if it takes 2 minutes to run instead of 10. Most of the applications that benefit from SIMD fall into the latter category—activities that use a lot of CPU power—so allowing them to run faster provides a perceptible improvement.

Beyond SIMD, some processors have instructions tailored to specific algorithms. The VIA C3, for example, had instructions dedicated to accelerating AES encryption. The C7 added more for accelerating SHA-1 and SHA-256 hashing. Like graphics, cryptographic algorithms are typically CPU-bound. They’re likely to be even more important in the future, as more data is sent over the network and more machines are mobile. It’s not uncommon for a laptop hard drive to be AES-encrypted in case of theft; accelerating AES on these machines makes anything that uses the disk faster.

Hardware acceleration for cryptography is not new. Several companies produce cryptographic accelerators that sit on a PCI card.

Graphics Processing Units (GPUs)

While only a few computers have a dedicated cryptography card, most have a graphics coprocessor. Each generation of GPUs is more and more general. These days, the GPU is used a lot in HPC applications, because the GPU has an enormous computational throughput. In effect, a GPU is a superscalar streaming vector processor. It handles a number of streams of SIMD instructions in parallel very quickly.

Architecturally, a GPU has a lot in common with a Pentium 4. Both use very long pipelines to allow them to have a lot of instructions in flight at once. And both perform very badly if a branch is predicted incorrectly. This was a problem for Pentium 4 since branches occur, on average, about every seven instructions. It’s not such a problem for a GPU, which is designed for execution of specific operations that don’t involve much branching.

The current situation in the PC world is very similar to that of 20 years ago. Back then, it wasn’t uncommon for a computer to have several processors, which is why we call the processor the central processing unit (CPU). The CPU took on general-purpose calculations and coordinated the activities of the other processors. Commonly, workstations and high-end PCs also had a floating-point unit (FPU) that handled floating-point operations. Starting with the 80486, the FPU was on the same die as the CPU. Another common addition was a memory management unit (MMU). This unit handled the translation between real and virtual memory; these days, you’d be hard-pressed to find a CPU without an MMU built in. A modern computer has a CPU and a parallel coprocessor. It doesn’t take much of a leap to imagine that Intel will eventually start adding a GPU core or two to its CPUs.

At this point, you’re probably thinking that this would limit the possibility of upgrades, so it’s worth taking a step back to see where processors are going. In 2005, Apple’s laptop sales passed its desktop sales for the first time. The rest of the industry is following. The rate of growth of mobile GPU sales exceeded that of desktop GPUs by a significant proportion, and Intel is the largest player in both the GPU and GPU markets. Very few people upgrade the GPU in their laptops.

What Else Can Go on a CPU?

Floating-point coprocessors, memory management units, and vector processors have already been added to modern CPUs. Digital signal processing units (DSPs) have been added to a number of embedded CPUs, and it seems likely that they’ll find their way into consumer CPUs soon.

The first use of additional transistors was to add more execution units, making deeper pipelines and wider superscalar architectures, and then more cache. Now we’re adding entire homogeneous processing units. Each of these only scales to a certain extent, though. The step from one to two cores is a huge improvement; it’s rare for even a CPU-bound process on my computer to be allocated more than 75% of the CPU, and far more common for it to be at around 50%, with the rest shared between other apps and the kernel.

Going from two to four cores is going to be a smaller improvement, but still significant. When you get up to 32 or 64 cores, things get more interesting. It’s almost impossible to write threaded code that scales to this degree and isn’t too buggy to use. It’s easier with an asynchronous message-passing approach, but the popular desktop-development APIs aren’t designed around this model. And, realistically, very little desktop software will need this kind of power. Some will—video editing, for example, can eat about as much CPU power as you throw at it for the foreseeable future. The shrinkage of the high end will continue. These days, many people wouldn’t notice much difference between a 1 GHz Athlon and a 3 GHz Core 2 Duo most of the time. The number of people who need the fastest computer available is already quite small. The number who even need a medium-speed machine is going to shrink.

While mobile computing and datacenter density continue to grow, power consumption is going to be more important. Imagine a 32-core CPU that allows you to turn off cores when not in use. While mobile, you may well find that you need only two or three cores.

Heterogeneous Cores

If you’re using only a small number of the cores on your processor, you might start to wonder why they even exist. The few applications that need all of the cores turned on could run faster on dedicated hardware, so why don’t you use that instead?

We’re starting to see this trend already. Examples include Apple’s Core Video; it will run on your CPU if it needs to, the CPU’s vector unit if it has one, or the GPU if that would be faster. OpenSSL will run on a cryptographic card if one exists, or fall back to the CPU if not. The existence of general-purpose abstract interfaces to this kind of functionality makes it much easier to implement in hardware; only a very small change is required to take advantage of the functionality. We saw the same thing with OpenGL; moving transform and lighting calculations onto the graphics hardware required new drivers to be written, but no modification to existing application code. Most importantly, since dedicated silicon is more efficient than general-purpose hardware for its intended task, the power usage is likely to be lower.

If you have silicon to spare, why not add a GPU onto the die? A cryptographic accelerator? Dedicated hardware for other computationally-expensive algorithms? When they’re not in use, you could turn them off. When they’re needed, they still draw less power than running the same algorithms on general-purpose hardware. The first step here was the integration of FPUs and then SIMD units. The next step will likely be the integration of a GPU on-die. Beyond that, it’s likely to be a matter of which algorithms will benefit the most from dedicated hardware. In some cases, we’ll simply see extensions to the basic instruction set (as happened with floating-point and SIMD instructions) to provide operations that will help a few categories of algorithms. Eventually, we’re likely to see these evolve beyond simple instructions.

One idea I’ve seen that seems quite appealing is to put a field-programmable gate array on-die. This would allow a lot of flexibility in operation, but is likely to come with a significant power cost.

Hardware IPC

One problem with all of these execution units is communicating among them. Moving data between the SIMD unit and the scalar part of a modern CPU is relatively expensive; moving data between the CPU and the GPU, even more so. Communicating between cores typically involves going via a shared cache, or via main memory if the cores don’t share a common cache.

The Transputer, produced in the 1980s, faced a similar problem. It had a large number of (cheap) relatively independent computing units. Each one had four interconnects, allowing it to talk to other processing units in close proximity very quickly. AMD’s HyperTransport is similar, although it’s generally used to implement shared memory, rather than as a message-passing interface.

The closest descendent of the Transputer these days is the Cell. This design has a set of synergistic processing units (SPUs). Apart from having the highest buzzword density of any processor to date, these are interesting in the way that they process data. Most CPUs have a very fine-grained load-and-store mechanism. They load a word (typically 64 bits of data these days) from memory, process it, and write it out. This is a simplification; in practice, they’ll typically interact with a layer of cache, which will look to a lower layer if it can’t provide the data required. The Cell is different; rather than providing a transparent cache, the Cell has a small amount of local memory. It loads a large chunk of data into this space in a single DMA transfer from main memory, and then processes it. On the plus side, this means that you never get a cache miss because all of your data is in "cache." The only difficulty is that you have to work out a way of partitioning your problem to allow it to be solved one small block at a time.

Once an SPU has completed processing a block of data, it might send it back to main memory. Another option is to pass it on to another SPU. This approach is potentially very interesting, but it creates some significant problems for layout when you try scaling it to large numbers of cores. Each core will both consume and produce data. Most will then pass on their output to another core for further processing. The problem comes from the fact that the number of potential recipients is the number of cores. While it’s easy to send a message to the nearest say, four cores, sending it any further away is more difficult. This is even more complex in a system with heterogeneous cores, because some processes are going to need to be run on specific areas of the chip.

The Future

In the next 10 years, it’s likely that hardware designers are going to have more silicon real estate to play with than they know what to do with. The most imaginative designs may well be the ones that yield the best performance, and only time can tell what these might be. I’ve made some guesses, but I’ve been watching the industry long enough to know that any predictions of this nature are likely to be wide of the mark.

800 East 96th Street, Indianapolis, Indiana 46240