Writing Concurrent Systems, Part 1: Locking


Date: Sep 13, 2010

Article is provided courtesy of Addison-Wesley Professional.

Return to the article

Writing concurrent code is essential for taking advantage of modern multicore computers. In the first of a three-part series, David Chisnall looks at the simplest mechanism for supporting concurrency: explicit locking.

Buying a computer with only one core is becoming increasingly difficult. The latest generation of ARM designs support up to four cores, so even most mobile phones are likely to be multicore in the next couple of years.

With this fact in mind, it's becoming increasingly important to write code that takes advantage of more than one thread of execution, if you want to make full use of the processing power of a modern computer. It's not always important. If your code uses only 10% of one core, making it concurrent won't buy you anything. You'll have more overhead and something that's harder to debug, and at best it will end up using 5% of each of two cores; more likely, it will use 6–7% of each of two cores.

On the other hand, if your code is using a significant proportion of one core, concurrency is important. Even if the code uses less than 100% of one core, concurrency is important if that code is likely to be on a system with other similar processes. For example, imagine that you have three single-threaded processes running on a dual-core system, each wanting 60% of the CPU. No possible arrangement allows all three to get all the CPU time they want, even though you have more than enough processing power in total. But if you split one of the processes into two threads, each of which uses only 40% of the CPU, then you can fit all of the processes onto your dual-core system.

Concurrency isn't just important for scalability; it also can help improve latency. A process that uses only 10% of the CPU may spend most of its time waiting for I/O. If you program in an asynchronous style, with or without true concurrency, it's easier for your process to do some work on the CPU while waiting for data from the network or the disk.

Over the course of this three-part series, we'll take a look at some approaches to concurrency. This first article considers the simplest form of synchronization: explicit locking.

Big Giant Lock

When people first started wanting to run UNIX kernels on multiprocessor machines, they had to take an inherently serial blob of code and make it support parallel execution. The bottom half of the kernel wasn't such a large problem; it was used to being interrupted periodically by higher-priority interrupt handlers, so true concurrency was only a slight change.

The top half, which handled system calls, was a bigger problem. An early UNIX system had one process running and able to make system calls. When running on a multiprocessor system, however, suddenly two or more processes could call into the kernel at once.

The solution was to wrap the entire system-call handler in a giant lock. When you made a system call, you acquired a lock; when you exited the system call, you released the lock. Only one process at a time could make system calls. This approach had one massive advantage over other approaches: It was very easy to implement.

The "giant lock" idea worked better than you might expect. Most processes only spend around 10–50% of their time in system calls—often even less. With only 2–4 cores, you didn't see much of a performance problem. Because most processes were trying to make system calls at different times, few were held up waiting for the lock.

This approach is quite common even now for libraries. Before multithreading became common, a lot of libraries were designed with some global state. To support multiple threads, they use a global lock, which is acquired and released in every call that touches the global state. Only one thread can be executing in the library at once, but linked programs should be doing most of their work outside of the library.

Finer Granularity

Finer Granularity

Once you have a lock around everything, the next step is to try to move things out of the lock. In a kernel, for example, you can interact with the filesystem and networking subsystems independently. The network stack creating a TCP/IP packet from a stream of bytes for one process shouldn't affect a request from another process to load some data from disk. You can add two locks, one for each subsystem, and have the relevant calls hold the subsystem lock only while they perform work entirely on that subsystem.

This approach has the nice advantage that it can be done incrementally. For example, you might start with a lock for the entire networking subsystem, and then split that giant lock into one lock for each of the layers in the stack. You can also subdivide horizontally; for example, having one lock per connection per layer of the stack.

The more locks you have, the higher the potential for concurrency—but also the more time you'll spend acquiring and releasing those locks, which is a relatively expensive operation. It's very easy to get carried away and discover that your code is now happily spread across a 64-processor machine, with each processor spending 90% of its time acquiring and releasing locks.

There's no benefit from having two locks if you need both of them in order to do anything. Try to make your locks as independent as possible. Ideally, you should never need more than one lock at a time.

Release When Possible

Release When Possible

Often, you don't need to keep a lock while you're doing some work. For example, consider a typical producer/consumer system. The producer inserts some work into a queue, and the consumer takes it off. If you protect the queue with a lock, both parts can run entirely independently except for a brief period.

One example of such a system is a media playback framework. When you read a media stream, you first need to demultiplex it. Then you get a set of audio and video packets. You pass off these packets to decoders, which produce uncompressed frames or samples, which in turn get passed off to the display or to speakers.

Each of these steps is independent. The audio and video decoders can each be decoding a sample while the demultiplexer and the playback components run. All of these processes can run with no locks held. They only need to grab a lock when they interact with the queue.

The important thing about this model is that the separate threads (or co-routines multiplexed among a smaller number of threads) don't need to hold locks most of the time, because they're not touching any global state. In fact, they're behaving a lot more like processes than like threads. In this model, you're using threads for the efficiency and convenience of having a shared address space, but most of the time you're actually working in completely independent processes.

This design makes it much easier to reason about the interactions between bits of your program, because you have completely isolated components, which just use locking for communication, as a lighter-weight alternative to passing data via pipes or similar.

Locking Order and Deadlocks

Locking Order and Deadlocks

The biggest problem with using locks for synchronization is deadlock. In the simplest case, you have two bits of code, each of which holds one lock, but each one needs the other one's lock in order to proceed.

If each bit of code needs only one lock at a time, deadlock will not occur, but unfortunately this isn't always possible. The simplest way of avoiding a deadlock when you need more than one lock is to enforce a locking order, in which you can acquire two locks only by doing so in a specified order.

In this situation, if you acquire one lock and fail to acquire the other, then you must release the first lock and try again. The problem with this plan is that it may not be possible to release the first lock safely. You might need the first lock to start an operation and the second lock to complete it. If you have to release the first lock, then you also have to undo the associated work. Effectively, you need to implement transactions. (We'll look at transactions in part 3 of this series.)

Although this case is the simplest, it can become arbitrarily complex. One classic example is the "dining philosophers" problem. Each philosopher has a chopstick on either side of him and requires two to eat. They all start by picking up the chopstick on their right. Now they're deadlocked—no one can proceed until someone puts down his chopstick.

You can also easily enter a livelock condition from here. If each philosopher puts down the chopstick on his right and tries to pick up the one on the left, then they're stuck that way. They may put the left one down and pick up the one on the right again, oscillating between two states without actually managing to eat.

This problem could be avoided if each chopstick had a number, and they all had to be picked up in order. In this case, all of the philosophers would try to pick up the one on their right, except for the one sitting between the chopsticks with the highest number on one side and the lowest number on the other side. He would have to wait until the person using the chopstick on his left had finished before he could pick up the one on his right. Or he would get there first. In both cases, at least one person would be able to pick up both chopsticks. Once he'd finished eating, at least one more person would be able to start.

Lightweight Locking

Lightweight Locking

The locks that I've talked about so far have been mutual exclusion locks (usually abbreviated to mutexes), occasionally called non-counting semaphores. These are conceptually trivial. They support two operations: lock and unlock. If you try to lock a locked mutex, you wait until it's unlocked. Most implementations provide a few extra features, such as testing whether a mutex is locked or providing a timeout.

If you read Apple's marketing material for Grand Central Dispatch, you'll discover that the new-and-improved synchronization primitives are clever, initially using atomic operations in userspace, and only waiting in the kernel if the lock is contended. If you use pretty much any other platform, system-provided mutexes have been implemented this way for around a decade.

In general, there are three cases for a lock operation: The lock may be free, it may be locked temporarily, or it may be locked for a long time. It's difficult to tell the latter two cases apart, although you can monitor the contention on a lock and use past behavior to try to predict how the lock will behave in the future.

In well-designed code, the first case is the most common. Locks are only used to protect resources that are held briefly and then released. Therefore, it's very important that this case should be fast. Most mutex implementations use a one-word flag in memory to indicate whether the lock is held. This can be tested with an atomic test-and-set operation. If you managed to acquire the lock in this way, it was already unlocked, so you only used a few cycles. If it fails, then you can call down into the kernel and block until the lock is released. When unlocking, you need to notify the kernel to wake up any threads that were blocked while waiting for the lock. (If you're clever, you can test whether this action is needed in userspace, calling the kernel only if a wakeup call is needed.)

In some situations, you might never want to sleep. Real-time systems need to guarantee a maximum response time; one of the ways that they do this is by placing an upper bound on the amount of time during which you're allowed to hold a lock. The Symbian nanokernel is an example of this kind of design.

If your own code has some locks that are mostly likely to be uncontended, and they're never held for more than a few instructions, you might want to use a lightweight spinlock. A spinlock is a specific implementation of a mutex that repeatedly tests whether it can be locked, rather than relinquishing control to the scheduler.

Unfortunately, you can't implement a spinlock in portable C99, because it requires the use of some atomic operations. You can implement one using GCC intrinsic functions, however. The __sync_bool_compare_and_swap() function takes three arguments: The first is an address, the second is the expected value, and the third is the new value. If the value in memory at the address to which the first argument points is equal to the value of the second argument, that value is replaced by the value in the third argument.

This design is implemented using the target architecture's native compare-and-exchange operation, which will ensure that the entire operation is atomic. From this operation, you can build a simple spinlock, like this:

static inline void spinlock(volatile int *lock)
    while(!__sync_bool_compare_and_swap(lock, 0, 1))

This code tries to store the value 1 in the integer if the current value is 0. If this doesn't work, it calls the sched_yield() function, which is part of the pthreads API, and tells the scheduler to run other threads. You can improve this example slightly by having it try the atomic operation a few times before calling sched_yield(). To unlock the spinlock, you don't need an atomic operation:

static inline void spinunlock(volatile int *lock)
    *lock = 0;

The lock is already held, so there's no potential for races. If the write happens at the same time as an atomic test and set, then either it will sneak in just before (in which case the lock operation will succeed) or just after (in which case it will spin). It doesn't even matter if this change requires multiple instructions, because you're only changing a single bit in memory. If you wanted to expand this example into a counting semaphore, you would need an atomic operation.

Specialized Lock Types

Specialized Lock Types

Most threading systems include some special kinds of locks. The POSIX thread API contains one that is particularly useful: read-write locks. You typically end up implementing something like this yourself if you don't get one from your standard APIs.

Read-write locks are designed for the situation in which some tasks can be performed in parallel, but others can't. Reading and writing to a complex data structure is a typical example, hence the name.

Consider a structure like a balanced tree. Any number of threads can simultaneously read values from the tree; however, inserting or removing a node may result in some readers following dangling pointers.

Read-write locks are designed to address this issue. When you acquire the lock, you do so as either a reader or a writer. Any number of readers can hold the lock at the same time, but only one writer. If a writer holds the lock, no reader can.

These locks are typically designed to be very fast in read mode, which is sometimes called shared mode (in the Win32 API, for example). If you have a resource that supports multiple readers but only one writer, a read-write lock will give you a performance advantage versus using a mutex and only allowing one reader or one writer. Because the API version is typically heavily optimized, it's also likely to be much faster than rolling your own.

Doing Without Locks

Doing Without Locks

Locking is simple at the coarse granularity or at the edge of a system. It becomes increasingly complex to get right when you have to deal with lots of locks, but can still be useful if you're careful. In general, however, locks are most useful for implementing higher-level synchronization models.

In part 2 of this series, we'll take a look at some data structures that support concurrent users without locking. These are generally quite complex to understand, and even harder to prove correct, but when implemented correctly can be very efficient indeed.

800 East 96th Street, Indianapolis, Indiana 46240