Home > Articles > Operating Systems, Server

How It Works: Filesystems

David Chisnall
  • PrintPrint
  • Share ThisShare This
  • DiscussDiscuss
Close WindowDavid Chisnall

David Chisnall

Learn more…

Inside Modern X11 Programming
Sep 18, 2009
Making JavaScript Fast, Part 2
Sep 15, 2009
Security in Your Pocket: OpenBSD on ARM
Sep 11, 2009
Making JavaScript Fast, Part 1
Sep 8, 2009
The Failure of the GPL
Aug 31, 2009
How Not To Optimize
Aug 21, 2009
A Half-Way Step to Apple’s Source Code: An Interview with David Chisnall
Jun 5, 2009
Advanced Flow Control for Objective-C
Jun 5, 2009
Erica Sadun on the iPhone SDK, OS X, and the Computing Landscape
Jun 5, 2009
From NeXTSTEP to Cocoa: Erik Buck on the Development of Cocoa and Objective-C
Jun 5, 2009
Fun with the Objective-C Runtime
Jun 5, 2009
Marcus Zarra and Matt Long on Core Animation
Jun 5, 2009
Steve Kochan on the Evolution of Objective-C
Jun 5, 2009
The Technology NeXT Gave the World
Jun 5, 2009
Where the Web and the Desktop Meet: An Interview with Lee Barney
Jun 5, 2009
Pandora: An Open Console
Jun 2, 2009
The Future of Wireless Networking
May 15, 2009
GNU or Linux?
May 11, 2009
Debugging C-Family Languages
Mar 27, 2009
How Small Is Your PC? The Rise of Netbooks and Other Small Form-Factor PCs
Mar 23, 2009
David Chisnall's CPU Feature Wishlist
Mar 13, 2009
The Dynamic Languages Renaissance
Jan 30, 2009
Robert Seacord on the CERT C Secure Coding Standard
Dec 15, 2008
Objective-C for C++ Programmers, Part 3
Nov 21, 2008
Objective-C for C++ Programmers, Part 2
Nov 14, 2008
Objective-C for C++ Programmers, Part 1
Nov 7, 2008
Writing Insecure C, Part 3
Oct 24, 2008
Writing Insecure C, Part 2
Oct 17, 2008
Writing Insecure C, Part 1
Oct 10, 2008
iRex iLiad e-Reader: Linux's Answer to the Kindle?
Aug 29, 2008
How It Works: Filesystems
Jun 13, 2008
How the LLVM Compiler Infrastructure Works
May 23, 2008
How It Works: Virtual Memory
May 21, 2008
What Is C For?
May 16, 2008
The Future of eBooks
Apr 25, 2008
Imagining an Open Network
Apr 18, 2008
Understanding How Xen Approaches Device Drivers
Mar 21, 2008
Examining the Legendary HURD Kernel
Mar 14, 2008
Competition Among Open Source Compilers
Feb 1, 2008
Inside Your OS: What is a Process Scheduler, and How Does it Work?
Jan 25, 2008
Bad UI of the Week: Read This (OK/Cancel)
Jan 18, 2008
The End of the Desktop Era
Jan 11, 2008
The What and Why of Open IM
Dec 28, 2007
A Look at the Modern X Server
Dec 21, 2007
The Future of Digital Media
Dec 14, 2007
The Future of Identity
Dec 7, 2007
Bad UI of the Week: Ask Forgiveness, Not Permission
Nov 21, 2007
Copyright Versus Free Software
Nov 16, 2007
Is Computer Science Dying?
Nov 9, 2007
A Brief History of Programming, Part 2
Nov 2, 2007
A Brief History of Programming, Part 1
Oct 26, 2007
The 700MHz Question: Will the Wireless Spectrum Auction Lead to Innovation or More of the Same?
Sep 28, 2007
Bad UI of the Week: The Menu Bar
Aug 24, 2007
The Dark Corners of x86
Aug 17, 2007
Bad UI of the Week: The Cross-Platform User Interface
Aug 17, 2007
Bad UI of the Week: The Mythical "is Like" Operator
Aug 10, 2007
Bad UI of the Week: Don't Make Me Tell You Twice...
Aug 3, 2007
Bad UI of the Week: Kettles and Washing Machines
Jul 27, 2007
The BBC iPlayer Controversy Explained
Jul 20, 2007
Bad UI of the Week: The Mitten Mouse
Jul 20, 2007
Bad User Interface of the Week: File It Under “Bad”
Jul 13, 2007
Bad User Interface of the Week: The DVD
Jul 6, 2007
A Roundup of Free Operating Systems
Jun 22, 2007
DragonFly BSD: UNIX for Clusters?
Jun 15, 2007
CPU Wars, Part 3: Put Your Left ARM In
May 18, 2007
CPU Wars, Part 2: POWER to the People
May 11, 2007
CPU Wars, Part 1: When the Chips Are Down
May 4, 2007
ZFS Uncovered
Apr 6, 2007
Vector Programming with GCC
Mar 30, 2007
Free Software Versus Open Source Software
Mar 16, 2007
What Programming Languages Should You Know?
Mar 9, 2007
Standardizing UNIX
Feb 2, 2007
Prolog: Logic Programming for Rapid Development
Jan 26, 2007
POSIX Parallel Programming, Part 3: Threads
Jan 19, 2007
POSIX Parallel Programming, Part 2: Message Passing
Jan 12, 2007
POSIX Parallel Programming, Part 1
Jan 5, 2007
The Nokia 770 Revisited
Dec 29, 2006
The Open Source Desktop Myth
Dec 22, 2006
Separating Style and Content: LaTeX and Typesetting
Dec 1, 2006
GNUstep: A Free Software alternative to OpenStep
Nov 10, 2006
Behind the Scenes of Objective-C 2.0
Nov 3, 2006
The Future of CPUs: What's After Multi-Core?
Oct 27, 2006
What Makes a Good Programming Language?
Oct 20, 2006
Emulation: Role-Playing for Computers
Oct 13, 2006
NetBSD: Not Just for Toasters
Oct 6, 2006
POSIX Asynchronous I/O
Sep 22, 2006
Breaking Down GPL Version 3
Aug 18, 2006
The Role of Binary Drivers in a Free OS
Aug 4, 2006
Security Is a UI Problem
Jul 28, 2006
Debunking the Myth of High-level Languages
Jul 14, 2006
A Taste of Erlang, a Dynamic, Asynchronous Message-Passing Language
Jun 30, 2006
Alternatives to LAMP
Jun 2, 2006
BSD Packaging Systems
May 26, 2006
DRM: Digital Rights or Digital Restrictions?
May 4, 2006
Introducing OpenBSD 3.9
Apr 28, 2006
The Need for Virtualization and Xen
Mar 31, 2006
Making Effective Software TCO Calculations
Mar 24, 2006
10 Things I Hate About U(NIX) Revisited: Readers Speak
Mar 17, 2006
Comparing Open Source Licenses: GPL vs. BSDL
Feb 3, 2006
BSD: The Other Free UNIX Family
Jan 20, 2006
Measuring the Effectiveness of Application Security Policies
Jan 13, 2006
The Cost of Free Software
Dec 9, 2005
Nokia 770 Internet Tablet Week-long Test Drive
Nov 18, 2005
10 Things I Hate About (U)NIX
Nov 4, 2005
The Lure of Open Source Software: Why Consider It for Your Business?
Oct 14, 2005

Sorry, this author hasn't posted any blogs.

David Chisnall takes a look at one of the least glamorous parts of an operating system - the filesystem - to see what it needs to do, and how a good or bad design can affect user experience.

When you use a computer to store a document, you typically put it in a file. This isn’t a very good abstraction, but it’s familiar. You give this file a name (and maybe some other metadata) and put it into a directory. You might set some permissions on it, too.

Your operating system sees a disk as a block device, a linear array of 512-byte blocks that must be read or written in one go. Unless you’re a FORTH programmer, this doesn’t look at all like what you expect from persistent storage. The filesystem sits between you and the disk and translates between the two models.

Storing Files

In the UNIX sense of the word, a file is an array of bytes. For most filesystems, it’s an array of disk blocks with some associated metadata. The main job of any filesystem is finding which blocks belong to a given file and which belong to no files (and so can be used for new files or appended to an existing file).

MS-DOS used a filesystem called FAT, for the file allocation tables that were at its heart. In spite of the many serious design flaws with FAT, it has steadfastly refused to die. The file allocation tables themselves are a simple array containing the index of the next block in the file. Once you have the first block for any file, you find the next one by looking at the index in the FAT.

One obvious flaw in this approach is that it gives O(n) behavior when seeking to a specific location in a file. If you have 4KB clusters on disk, and want to seek to the end of a 200MB file, you’ll need more than 50,000 lookups in the FAT. If every one of these lookups requires a disk access, this process will take around seven minutes. Even with one disk access per thousand lookups, ait can take half a second. For this reason, good performance requires keeping the FAT in memory.

In the UNIX world, things are done somewhat differently. Every file has a master inode. This is a data structure containing a load of metadata such as the permissions, creation time, and so on, and a short list of blocks. For small files, this is all that’s needed. For larger files, there’s an overflow capability in which a second disk block is used just to store a list of blocks. For really huge files, two or three layers of indirection can be used, so the inode contains an address of a block containing addresses of blocks containing addresses of blocks containing the file. At most, three disk reads are needed to find any address in a file stored like this.

The BeOS file system (BFS) used a slight variation of this system. Instead of storing a set of blocks, it stored a set of block ranges. If the file was stored contiguously on disk, only a single range was required, no matter how big the file became. In the worst case, this technique took twice as much space as storing individual blocks, since two integers had to be stored for a single block.

The inverse part of this is storing the free space. With FAT, files and free space were stored in the same way—free blocks were flagged in the FAT with a special identifier. Finding a free block meant performing a linear search through the table until this identifier was encountered. Since there was no efficient way of finding a group of blocks of a specific size, this led to the problem of fragmentation, the word used to describe what happens when files are not stored contiguously on disk, but rather are stored in fragments. Fragmentation is a problem for mechanical disks, since seeking is quite slow. A mechanical disk might have a seek time of 5 ms. If you need to seek every 512 bytes you read, even if reading an individual block takes no time, the maximum transfer rate you can get is 100 KB/s. In linear reads, the same disk can probably get around 50 MB/s. For this reason, fragmentation is a major problem for filesystems.

While the design of FAT seems silly here, it’s worth noting that it predates MS-DOS and actually was designed for Microsoft BASIC. For storing BASIC programs (rarely more than a few KB each) on low-density floppy disks, FAT isn’t such a bad design.

Storing free space efficiently is a difficult problem, because you need to be able to update the free space map quickly and also find a block of space of a specific efficiently. ZFS has quite a neat solution to this puzzle. A ZFS storage volume is split into chunks, within which free space is tracked independently. Whenever a space is allocated or freed, the range of space and the operation are written to a log. This is a very fast operation, since it’s just writing three values to the disk. Periodically, the log is read, and an AVL tree (a self-balancing tree structure) is created in memory. The tree can be searched easily for free space of a specific size, and can be written out to disk in the same form as the log, with all of the duplicate entries removed. For example, if you allocate block 12 and then free block 12, this pair of operations will cancel each other out in the tree and not be recorded when the tree is converted to log form.

  • Share ThisShare This
  • Your Account

Discussions

Make a New Comment

You must log in in order to post a comment.

Related Resources

Dustin SullivanIf You Are New to Mac/Objective-C Programming...
By Dustin Sullivan on June 5, 2009 No Comments

We recently sat down with several top Objective-C and Cocoa developers to talk about that state of the iPhone and OS X markets as we approach this year's WWDC.  As we were wrapping up, we threw one last question at them out of curiosity, and we thought you'd like to see what some of them said.

It's Here; Put Away Your Pre-Conceptions on What an OS Must Be: Part V
By John Traenkenschuh on May 27, 2009 No Comments

It's been a long while since you had a chance to be excited about a new version of an 'old' OS.  Now is your chance.

It's Here; Put Away Your Pre-Conceptions on What an OS Must Be: Part IV
By John Traenkenschuh on May 27, 20095 Comments

Graphical User Interfaces were important.  So was cost control.  Just what must an OS be?

See All Related Blogs

There are currently no related articles. Please check back later.

Informit Network