How It Works: Filesystems
When you use a computer to store a document, you typically put it in a file. This isn’t a very good abstraction, but it’s familiar. You give this file a name (and maybe some other metadata) and put it into a directory. You might set some permissions on it, too.
Your operating system sees a disk as a block device, a linear array of 512-byte blocks that must be read or written in one go. Unless you’re a FORTH programmer, this doesn’t look at all like what you expect from persistent storage. The filesystem sits between you and the disk and translates between the two models.
In the UNIX sense of the word, a file is an array of bytes. For most filesystems, it’s an array of disk blocks with some associated metadata. The main job of any filesystem is finding which blocks belong to a given file and which belong to no files (and so can be used for new files or appended to an existing file).
MS-DOS used a filesystem called FAT, for the file allocation tables that were at its heart. In spite of the many serious design flaws with FAT, it has steadfastly refused to die. The file allocation tables themselves are a simple array containing the index of the next block in the file. Once you have the first block for any file, you find the next one by looking at the index in the FAT.
One obvious flaw in this approach is that it gives O(n) behavior when seeking to a specific location in a file. If you have 4KB clusters on disk, and want to seek to the end of a 200MB file, you’ll need more than 50,000 lookups in the FAT. If every one of these lookups requires a disk access, this process will take around seven minutes. Even with one disk access per thousand lookups, ait can take half a second. For this reason, good performance requires keeping the FAT in memory.
In the UNIX world, things are done somewhat differently. Every file has a master inode. This is a data structure containing a load of metadata such as the permissions, creation time, and so on, and a short list of blocks. For small files, this is all that’s needed. For larger files, there’s an overflow capability in which a second disk block is used just to store a list of blocks. For really huge files, two or three layers of indirection can be used, so the inode contains an address of a block containing addresses of blocks containing addresses of blocks containing the file. At most, three disk reads are needed to find any address in a file stored like this.
The BeOS file system (BFS) used a slight variation of this system. Instead of storing a set of blocks, it stored a set of block ranges. If the file was stored contiguously on disk, only a single range was required, no matter how big the file became. In the worst case, this technique took twice as much space as storing individual blocks, since two integers had to be stored for a single block.
The inverse part of this is storing the free space. With FAT, files and free space were stored in the same way—free blocks were flagged in the FAT with a special identifier. Finding a free block meant performing a linear search through the table until this identifier was encountered. Since there was no efficient way of finding a group of blocks of a specific size, this led to the problem of fragmentation, the word used to describe what happens when files are not stored contiguously on disk, but rather are stored in fragments. Fragmentation is a problem for mechanical disks, since seeking is quite slow. A mechanical disk might have a seek time of 5 ms. If you need to seek every 512 bytes you read, even if reading an individual block takes no time, the maximum transfer rate you can get is 100 KB/s. In linear reads, the same disk can probably get around 50 MB/s. For this reason, fragmentation is a major problem for filesystems.
While the design of FAT seems silly here, it’s worth noting that it predates MS-DOS and actually was designed for Microsoft BASIC. For storing BASIC programs (rarely more than a few KB each) on low-density floppy disks, FAT isn’t such a bad design.
Storing free space efficiently is a difficult problem, because you need to be able to update the free space map quickly and also find a block of space of a specific efficiently. ZFS has quite a neat solution to this puzzle. A ZFS storage volume is split into chunks, within which free space is tracked independently. Whenever a space is allocated or freed, the range of space and the operation are written to a log. This is a very fast operation, since it’s just writing three values to the disk. Periodically, the log is read, and an AVL tree (a self-balancing tree structure) is created in memory. The tree can be searched easily for free space of a specific size, and can be written out to disk in the same form as the log, with all of the duplicate entries removed. For example, if you allocate block 12 and then free block 12, this pair of operations will cancel each other out in the tree and not be recorded when the tree is converted to log form.