Scalable Is (Still) the New Fast
For a long time, one fast processor was better than two slower ones. This rule comes from the fact that it’s trivial to make one CPU act like two slower ones (any multitasking operating system does it all the time), but much more difficult to make two CPUs act like one.
The change came when people started caring about power. In the Pentium era, Intel engineers tell a story of a company in New York that couldn’t upgrade its desktops to the new chip because the building’s power distribution couldn’t handle the increased power usage. This interest in power has become a lot more relevant in recent years. With the Pentium 4 topping 100 watts, it simply wasn’t feasible in a lot of applications. A laptop with a 100W CPU would be lucky to work for half an hour between charges. A datacenter would have serious problems. In California, around 10% of the total power usage these days is datacenters, and this amount is expected to keep growing.
Even home users are starting to notice the power usage of their computers. With electricity costing around 10 cents per kilowatt/hour (kWh), a 100W CPU that’s left on for eight hours a day will cost around $15 more per year than a 50W version—and energy prices keep increasing. With CRTs being replaced by TFTs, the power usage of the CPU starts to become a major contributor to the total drain of the machine.
All other things being equal, a dual-core 500 MHz CPU will consume less power than a 1 GHz CPU, so it’s no surprise that the industry is heading toward lots of simpler cores, rather than smaller numbers of highly clocked ones.
Another factor is the movement of computation. Recent years have seen rapid growth in web services and applications. Since these systems are multi-user, they’re inherently parallel. With this large network latency, the processing time is a small fraction of the total response time, making slower, more parallel chips a big win.
In addition to multi-core, much of the industry is pushing simultaneous multithreading (SMT) and similar solutions. The basic idea is to have two or more copies of the register file, allowing multiple threads to be loaded into the CPU at once. These threads can be switched without any of the overhead normally associated with a context switch.
The granularity of this approach varies a lot from implementation to implementation. At its coarsest, this plan helps alleviate the cost of cache misses; when one thread encounters a cache miss, the other one can be swapped in and run. At the finest level, it can help make full use of a CPU’s execution units, by issuing instructions from the second thread to execution units not used by the first.
This trend is not just found in the datacenter; the fastest computer in the world is currently Blue Gene/L, which uses 700 MHz PowerPC 440 CPUs. While each core is quite slow (the 400 series are designed for embedded applications), it makes up for this lack by having 131,072 of them.