- Jun 15, 2007
One of the main focuses for DragonFly is clustering. As commodity hardware dropped in price, clusters have become a very cost-effective way of achieving high performance and reliability. In many ways, individual computers are beginning to look more and more like small clusters. Memory in an Opteron system, for example, is split between the CPUs, giving an extra layer of indirection for accessing memory not local to the CPU on which a thread is running. The bandwidth between CPUs is higher and the latency lower than between clusters, but the principle is the same.
One of the biggest problems for this situation is cache coherency. If two processors are accessing the same data in memory, they will each have copies of it in their cache. If one changes it, then the other needs to know to invalidate their cache or problems will occur. SMP machines typically handle this in hardware for their internal caches, but these are not the only caches that an operating system has to deal with. It would be simple if every filesystem operation, for example, involved a round-trip to the disk, but this would be painfully slow (particularly if the disk in question is on a remote fileserver). For this reason, the operating system will keep disk caches in memory, and possibly remote disk caches on the local disk. Keeping these in sync can be a major challenge, as Matt explains:
DragonFly BSD aims to support Single System Image clustering. This concept will be familiar to VMS users, but is relatively rare in the UNIX world. The idea is that a bunch of nodes connected together will appear to be a single, large, SMP machine. All processing resources, memory, disks, and peripherals will be shared.
Of course, a true SMP system will be faster for a lot of cases, but for any workload where a cluster is likely to be appropriate (i.e. anything not IPC latency-bound), deploying on DragonFly should be no more difficult than deploying on a large SMP machine.
DragonFly was one of the first operating systems to have a serious port of Sun's ZFS attempted, and for a while it looked like it would see some heavy use, but this was not to be. Matt spoke briefly on the limitations of ZFS, saying:
ZFS is designed for a more client-server architecture. It assumes that some machines will be responsible for managing storage and others will simply be consumers of storage services. The aim for the new (and, as yet, unnamed) DragonFly filesystem is that each node in a cluster will have some storage space, but the cluster as a whole will see it as a single storage pool. This will allow nodes to make use of local storage for speed and remote storage for reliability, with cached copies of data being moved around in a manner somewhat reminiscent of CMU's Andrew Filesystem.
The design of the filesystem is tied closely to the caching model for cluster operation, but incorporates a number of features that are not specific to cluster use-cases. One is the ability to create arbitrary numbers of snapshots, much like ZFS, and fast recovery. One particularly interesting feature is infinite logless replication. This would make it easy for a laptop user to keep an external hard drive in sync with her mobile disk (for example). When combined with snapshots, this could make backups very easy; snapshot the external drive every day, and have the laptop sync with it whenever it is plugged in. As with many other areas of DragonFly BSD, the filesystem is going to be worth watching over the next few years.
Concurrency is important on a small scale as well, and this typically means threads. FreeBSD 4.x used a pure user space approach to threading. Any synchronous system calls exposed by the C standard library were replaced by their asynchronous versions under-the-hood, and an alarm signal was used to switch between running threads within the same process. This worked, but it meant that all threads in a single process had to run on a single CPU, eliminating most of the point of threading. This is typically referred to as a 1:N model, where one scheduling entity in the kernel mapped to N threads in userspace.
With FreeBSD 5.x, the team tried to move to an N:M model, based on work done by the University of Washington in the '90s and popularized by Solaris. The idea behind this was that the kernel would schedule one thread per processor, and a userspace library would multiplex these together to provide more threads, if required. The advantage of this approach was that it scaled very well, because a context switch into the kernel was not required for switching between threads in the same process on the same CPU.
In recent releases, FreeBSD and Solaris have both moved to a 1:1 model, where the kernel is aware of every user space thread, leaving NetBSD as the only major UNIX to stick with an N:M model. DragonFly has used a 1:1 model from the outset. Matt pointed out the benefits of this:
This does not, of course, preclude the development of an N:M threading library on DragonFly BSD. The Erlang virtual machine is an example of this. Erlang, unlike C, makes it very easy to have thousands or even millions of threads. Doing this using a 1:1 model would cause huge memory usage and context switching overhead, so the Erlang VM implements its own threading model using one kernel thread per CPU and multiplexing these using the kevent mechanism.