But I Can already Do That…
Most articles about GCD have been filled with comments about how it's possible to do the same thing with existing technologies, or how other languages can already do the same thing. This is true. There is (almost) nothing new, conceptually, in GCD. That said, the same is largely true of OS X as a whole; it is mostly a set of implementations of ideas from the 1980s and earlier. The same is true of most modern operating systems and languages, but that doesn't mean that they are not worth using.
Closures are certainly not new. They were a core part of the Lisp language back in 1958. Similar functionality has been available in C as a GNU extension in the form of nested functions, although these had a few more limitations. That doesn't mean that they are not useful; quite the reverse.
Thread pools, likewise, are not a new concept and neither is the concept of working with queues. The very fact that GCD can be ported to FreeBSD without needing any kernel modifications shows that it was possible to implement something equivalent (although slightly less efficient) in your own code already.
So, what is the novelty in GCD? To answer that, let's first take a look at how you would generally write similar code without using it. The simplest way of indicating that tasks can be run in parallel is to create a new thread for each one. There are some big differences between the cost of adding a task to a queue. Critics of GCD have argued that it's a work-around for Apple's inferior threading system and so isn't required in their favorite OS, so when I look at the costs of both operations I'll just look at the minimum cost and avoid anything Darwin-specific. When you create a new thread, on a typical UNIX-like system, you need to:
- Allocate at least one page for the stack. This will cause a (slow) TLB fault when the thread starts.
- Allocate at least one page for the thread-local storage (TLS) segment and copy its data across. The Darwin linker doesn't support TLS segments, but the pthread_key family of calls incurs similar overhead.
- Create the in-kernel data structures required for the thread and insert it into the run queue (this involves a context switch into kernel space and back).
This is a fairly significant amount of overhead. In contrast, Apple describes the cost of adding a task to a queue as requiring fewer than 15 instructions. This is somewhat misleading. One of those 15 instructions has the LOCK prefix, and so is about the slowest x86 instruction possible, and if there is contention then one of these instructions is a jump, so these 15 instructions may be executed more than once. In common cases, however, the cost is only slightly more than the cost of a function call, and less than the cost of a system call.
That's only half of the story though. If you have eight threads and only one CPU, then the kernel will be periodically swapping them in and out. This has a relatively low minimum overhead; the kernel needs to service a timer interrupt, save the registers from one thread, load them from another, and then return to user space. There are a few other hidden costs, however. Each thread is working on something different and, in a well-designed program, will have mostly disjoint working sets (if they didn't, then you'd get a lot of overhead from locking). That means that each thread will be want different bits of memory to be in the cache and TLB, and every time you swap between them you will get churn on both of these, which will slow everything down.
Compare that with the cost of swapping between eight work queues in one thread. GCD will run a block from one queue to completion, then run one from another, and so on. The overhead for going from one block to the next is about the same as a function call. When one block finishes, its TLB and cache lines aren't needed anymore, so displacing them is not a problem.
This model may seem familiar, as it's basically cooperative multitasking. One way of thinking about this bit of GCD is as an N:M threading model, where the kernel threads are preemptive and the user space threads are cooperative. Cooperative multitasking gives better throughput, but has the disadvantage that one task can kill responsiveness for the entire system. In this case, the tasks are all owned by one process, so the worst that can happen is the process stalls itself. GCD can work around this by spawning a few more kernel threads if the cooperative threads are not completing blocks fast enough.