Overview of Linux Journaling Filesystems
In This Chapter
Finding, Checking, and Mounting Filesystems
Introduction to Journaling Filesystems
Advantages of Journaling Filesystems
Chapter 1, "Introduction to Filesystems," introduced "journaling filesystem" as the general term for a specific type of local filesystem that helps reduce system restart times. They do this by reducing the amount of information that a system has to examine to verify the integrity of a filesystem.
This chapter begins by providing a detailed overview of how Linux identifies filesystems, verifies that they are usable, and makes them available to users. This discussion builds on the basics of standard Linux/Unix filesystem organization explained in Chapter 1. Understanding how Linux makes filesystems available to users and what's involved in verifying the structure of a "standard" Linux filesystem (ext2fs) provides a firm foundation for discussing journaling filesystems. The remainder of the chapter explains the differences between journaling and non-journaling local filesystems, shows how journaling filesystems track filesystem changes, and highlights the major reasons why these types of filesystems are becoming more common on today's computer systems.
Finding, Checking, and Mounting Filesystems
To explain many of the performance and design advantages of different types of filesystems, it's useful to first understand how a Linux system identifies the filesystems available to it; determines their types; and mounts, supports, and uses them. This section provides a general overview of these topics, which are relevant to any type of filesystem regardless of whether they are local or available over a network.
As discussed in Chapter 1, filesystems are the mechanism for successfully storing and retrieving data on a computer system. The data structures that define the organization of a filesystem must be correct when a filesystem is being used. To users, filesystems are hierarchical collections of files and directories. To Linux and other Unix systems, filesystems consist of many inodes that contain information about files and directories (known as filesystem metadata, data about data) and the data blocks that actually contain the directory entries and file data. It's easy to see the confusion that could arise if multiple inodes in a filesystem thought that some specific data block was a part of the file that they represented.
Suppose that you were editing a status report for your manager and I was working on a file containing my collection of ribald drinking songs. If the inode that identified the blocks in your presentation and the one that pointed to the blocks in my drinking song archive each claimed that a specific data block belonged to its file, one of us is going to be surprised when we actually look at our file.
Filesystems whose internal data structures are correct are referred to as being consistent. It is always the responsibility of the system that hosts a filesystem (that is, on which the filesystem is physically stored) to verify the consistency of that filesystem before making it available to the operating system and to users. This is true regardless of whether the filesystem is a standard local filesystem, a journaling filesystem, or a networked filesystem. In the case of networked filesystems, the server that exports the networked filesystem and manages the physical media on which it is stored must verify its consistency before making it available over the network.
The primary characteristics of consistent filesystems are the following:
A bit in the filesystem's superblock is set to indicate that the filesystem was successfully unmounted when the system was last shut down.
All the filesystem metadata is correct.
Verifying the consistency of a filesystem would be fast if those two points could be verified quickly. Unfortunately, verifying that filesystem metadata is correct actually involves checking a number of different points:
Each allocation unit (whether it is a block or an extent) belongs only to a single file or directory, or is marked as being unused. The list of which blocks are allocated and unused (free) in a filesystem is usually stored in a bitmap for that filesystem, where each bit represents a specific data block. Filesystems that allocate and manage extents rather than just blocks also maintain information about free extents and their size and range.
No file or directory contains a data block marked as being unused in the filesystem bitmap.
Each file or directory in the filesystem is referenced in some other directory in that filesystem. From a user's point of view, this means that there is a directory path to each file or directory in the filesystem.
Each file has only as many parent directories as the reference count in its inode indicates. Although each file exists only in a single physical location on the disk, multiple directories can contain references to the inode that holds information about this file. These references are known as hard links. The file can therefore be accessed through any of these directories, and deleting it from any of these directories decrements the link count. A file is actually deleted only when its link count is 0in other words, when it is no longer referenced by any directory.
Verifying all these relationships may take a while if it's necessary to manually check each of them. Whether this consistency check is necessary is the fundamental difference between journaling and non-journaling filesystems.
Locating and Identifying Filesystems
When your system boots, the boot block on your primary disk identifies the root filesystem and the location of the kernel to boot. As discussed in the previous section, when your system boots, it needs to verify the consistency of each of its local filesystems. The root filesystem initially is mounted read-only for standard processes so that its consistency can be verified. After this is done, it is remounted in read-write mode, and your system verifies the existence and consistency of any other filesystems that it will be using. The list of filesystems available to your system is contained in the file /etc/fstab.
Each line in /etc/fstab provides information about one of the filesystems that should be available to your system. A sample section of an /etc/fstab file looks like the following (it will be different on each computer system):
LABEL=/ / ext2 defaults 1 1 LABEL=/boot /boot ext2 defaults 1 2 LABEL=/home /home ext2 exec,dev,suid,rw,usrquota 1 2 /dev/cdrom /mnt/cdrom iso9660 noauto,owner,ro 0 0 /dev/fd0 /mnt/floppy auto noauto,owner 0 0 /dev/hdb1 /opt xfs defaults 1 2 /dev/hdb6 /opt2 reiserfs exec,dev,suid,rw,notail 1 2 LABEL=/tmp /tmp ext2 defaults 1 2 LABEL=/usr /usr ext2 defaults 1 2 LABEL=/var /var ext2 defaults 1 2 none /proc proc defaults 0 0 none /dev/pts devpts gid=5,mode=620 0 0 /dev/hda7 swap swap defaults 0 0 /dev/cdrom1 /mnt/cdrom1 iso9660 noauto,owner,kudzu,ro 0 0 /dev/cdrom2 /mnt/cdrom2 iso9660 noauto,owner,kudzu,ro 0 0 /dev/lvm/vol1 /books reiserfs exec,dev,suid,rw,notail 1 2 /dev/lvm/vol2 /proj xfs defaults 1 2
The fields in each /etc/fstab entry (that is, each line) are the following:
The first field is the device or remote filesystem to be mounted. This is usually the Linux device file for the partition to be mounted but also can be an entry of the form hostname:directory for networked filesystems such as NFS. Ext2 filesystems also can be identified by the name that they were assigned in the filesystem volume label when the filesystem was created. For example, the entry LABEL=/ in the example /etc/fstab file could be replaced with /dev/hda5 because that is the disk partition where my root filesystem actually lives. However, using labels is more flexible than using specific partition device files because the device file associated with a specific partition may change if the disk containing that partition is moved to another system or if other disks are added to an existing system.
The second field is the directory on which the specified filesystem should be mounted. For special types of filesystems that should not be mounted, such as the Linux /proc filesystem, this field should contain the entry none. Dist partitions that are formatted as swap space contain the entry swap in this field.
The third field identifies the type of filesystem. Common entries in this field are ext2 (the standard Linux local filesystem type), vfat (a Microsoft Windows partition), iso9660 (the standard CD-ROM filesystem), nfs (networked filesystems using Sun's NFS protocol), and swap (swap space). After reading this book, you will also want to use entries such as ext3 (the journaling equivalent of the ext2 filesystem), jfs (a journaling filesystem originally from IBM), reiserfs (the journaling filesystem built into the 2.4 or better Linux kernels shipped with most Linux distributions), and xfs (a journaling filesystem originally from Silicon Graphics). If a filesystem is not currently used but you want to keep an entry for it in /etc/fstab, you can put the word "ignore" in this field, and that filesystem will not be mounted, checked for consistency, and so on.
The types of filesystems that are compiled into your kernel are listed in the file /proc/filesystems, but this can be misleading. Your kernel usually also supports other types of filesystems, but as loadable modules rather than being hard wired into the kernel. For example, the /proc/filesystems file on my system contains the following entries:
Filesystems whose types are prefaced by a nodev entry are not associated with physical devices but are used internally by applications and the operating system.
Several fairly fundamental types of filesystems, such as iso9660 (CD-ROM) filesystems aren't listed. It would be highly unlikely that I wouldn't want support for CD-ROM filesystems, but because I use CDs with filesystems on them infrequently, I specified that they be supported as a loadable module when I configured and built the kernel for this system. This helps keep the kernel as small as possible without sacrificing performance. Devices supported through loadable modules work slightly slower than devices directly supported in the kernel, largely because of the overhead of locating, loading, and placing external calls to the module.
The fourth field contains a comma-separated list of any options to the mount command that should be used when the filesystem is mounted. Many mount options are filesystem-specific, but some common generic ones are the following:
asyncWrites to the filesystem should be done asynchronously.
autoThe filesystem should be automatically mounted when detected or when a command such as mount -a is executed.
defaultsUse the default options: async, auto, dev, exec, nouser, rw, suid.
devThe character or block device containing the filesystem is local to the system.
execYou can execute programs, scripts, or anything else whose permissions indicate that it is executable from that filesystem.
gid=valueSet the group ID of the mounted filesystem to the specified numeric group ID when the filesystem is mounted.
noautoDon't automatically mount when a filesystem is detected or when the command mount -a is issued. Usually used with removable media such as floppies and CD-ROMs.
nouserYou must be root to mount the filesystem; the filesystem can't be mounted by any nonroot user.
ownerThe ownership of the filesystem is set to the user who mounted itusually root if the filesystem is automatically mounted by the system.
roMount the filesystem read-only.
rwMount the filesystem read-write.
suidAllow programs on the filesystem to change the user's user or group ID when it is executed if the user's permission bits indicate that they should do this. Be careful when using this option with imported filesystems that you don't actually administer, because running a program that sets the UID to root is a common way of hacking into a system.
uid=valueSet the user ID of the mounted filesystem to the specified numeric user ID when the filesystem is mounted.
For more information on generic options available to the mount command, see the man page for the mount command in section 8 of the online Linux manual. When discussing each of the filesystems covered in this book, I'll also explain any filesystem-specific mount options associated with that filesystem.
Like the fsck command discussed in the next section, the mount command executes filesystem-specific versions of mount whenever necessary. For example, when filesystems of types smb, smbfs, ncp, or ncpfs are mounted (which use filesystem adapters to access DOS Server Message Block and NetWare Core Protocol filesystems), the mount command attempts to execute files in /sbin with the names /sbin/mount.smb, /sbin/mount.smbfs, /sbin/mount.ncp, and /sbin/mount.ncpfs, respectively.
The fifth field is used by the dump command, a standard Linux/Unix filesystem backup command, to identify filesystems that should be backed up when the dump command is executed. If the fifth field contains a 0 (or is missing), the dump command assumes that the filesystem associated with that /etc/fstab entry does not need to be backed up.
The sixth field is used by the Linux/Unix filesystem consistency checker (discussed in the next section) to identify filesystems whose consistency should be verified when the system is rebooted, and the order in which the consistency of those filesystems should be checked. If the sixth field contains a 0 (or is missing), the fsck program assumes that the filesystem associated with that /etc/fstab entry does not need to be checked.
Verifying Filesystem Consistency
Verifying that a standard filesystem is consistent requires a special utility that checks each filesystem to guarantee that all the items listed in the previous section are true. This is known as the fsck (filesystem consistency check) utility. (Truly ancient Unix folks like myself will fondly remember its conceptual parents, dcheck, icheck, and ncheck.) Each type of filesystem has its own version of fsck that understands the organization of a specific type of filesystem. On Linux systems, all these versions of fsck live in the directory /sbin, and typically have names of the form fsck.filesystem-type. For example, listing this directory on one of my Linux systems shows versions of fsck for ext2, ext3, minix, msdos, and ReiserFS filesystems:
[wvh@distfs /sbin]$ ls -al *fsck* -rwxr-xr-x 2 root root 40316 Jul 12 2000 dosfsck -rwxr-xr-x 3 root root 451240 Jun 24 04:54 e2fsck -rwxr-xr-x 1 root root 16572 Jun 24 04:54 fsck lrwxrwxrwx 3 root root 451240 Jul 1 19:48 fsck.ext2 lrwxrwxrwx 3 root root 451240 Jul 1 19:48 fsck.ext3 -rwxr-xr-x 1 root root 16380 Apr 8 10:12 fsck.minix -rwxr-xr-x 2 root root 40316 Jul 12 2000 fsck.msdos lrwxrwxrwx 1 root root 10 Jun 30 13:55 fsck.reiserfs -> reiserfsck -rwxr-xr-x 1 root root 2408 Jun 1 21:47 fsck.xfs -rwxr-xr-x 1 root root 221820 Mar 5 13:19 reiserfsck
When you restart a computer system, the fsck utility reads the list of filesystems that should be mounted and the type of each filesystem from the text file /etc/fstab, as explained in the previous section.
As mentioned in the previous section, the last field in each line of /etc/fstab indicates whether a filesystem should be checked for consistency and the order in which that consistency check should be done. A value of 0 in this field indicates that the consistency of that filesystem should not be checked. Other numeric values indicate which fsck "pass" the filesystems should be checked in. Because fsck can take a while on large or complex filesystems, multiple copies of fsck can run at the same time and can therefore check different filesystems in parallel. In the example /etc/fstab shown in the previous section, the root filesystem / would be checked first, and then the other filesystems on different disks would be checked in parallel. Filesystems on a single disk are checked sequentiallychecking them in parallel would be slow and a waste of perfectly good disk head movement.
The fsck program then checks whether the filesystem's clean bit is set. If this bit is set, fsck does no further checking and proceeds to the next filesystem. If this bit is not set, fsck begins the potentially laborious process of verifying (and correcting) that filesystem by executing the specific version of fsck associated with that type of filesystem.
Verifying the Consistency of a Non-Journaling (ext2) Filesystem
On non-journaling filesystems, the fsck program has to actually check the integrity of and relationships between all the inodes and data blocks in the filesystem. This can take a substantial amount of time, especially for larger non-journaling filesystems. Eliminating this sort of delay when restarting a system is one of the primary motivations for adopting journaling filesystems. However, even more important than eliminating the time required to do this sort of exhaustive checking of a filesystem is eliminating the need for this sort of consistency checking. If you eliminate the need for this sort of checking, you get reduced startup time for free. More about this later.
This section discusses the version of fsck used to check and repair the consistency of ext2 filesystems. This is the program /sbin/e2fsck, usually executed by the /sbin/fsck wrapper program as fsck.ext2.
All versions of fsck work by making a number of different passes over the filesystem they are checking. Checking the consistency of a filesystem in multiple steps has some distinct advantages:
Enabling a single pass to focus on verifying or collecting a specific aspect of filesystem consistency. This simplifies the operation of the program (as well as simplifying debugging and development!).
Minimizing the resources necessary to perform any single consistency check. Memory allocation by fsck tends to grow during the first few passes, as additive information about the filesystem is collected.
Simplifying collecting the information necessary for full consistency checking. Each subsequent pass can capitalize on the information collected in the previous one.
The ext2 version of fsck is designed to minimize both head movement during disk reads and the number of times that data has to be read from the disk. The five passes of the ext2 version of fsck are the following:
Pass 1 verifies the consistency of all the inodes in the specified filesystem. This pass checks whether the mode of the file or directory associated with that inode is valid, all block references contain valid block numbers, the size and block count fields of the inode are correct, and no data block is associated with multiple inodes. During this pass, fsck gathers three types of information about the filesystem:
It builds bitmaps that identify the inodes in the filesystem that are in use, are directories, are regular files, and so on.
It builds bitmaps that identify the blocks in the filesystem that are in use and any that are claimed by multiple inodes.
It identifies the data blocks associated with each inode that represents a directory.
Pass 2 checks each directory in the filesystem, identifying them using the directory bitmap constructed in pass 1 (rather than having to reread the disk). For each directory, pass 2 verifies that
The length of the directory entry and file/directory name are both valid.
The inode number is greater than 1 and less than the total number of inodes in the filesystem.
The inode number refers to an inode actually in use (as determined by checking the bitmap of used inodes constructed in pass 1).
The first entry in the directory is ., and the inode associated with that entry is the inode of that directory.
The second entry in the directory is ...
Pass 3 verifies the directory structure within the filesystem by marking the root inode for the filesystem as done, and then examining every other directory inode in the filesystem, using the information about its parent inode that was collected in pass 2 to walk up the filesystem from that inode until it reaches a directory inode that is marked done. If this is unsuccessful or a directory inode is visited twice when tracing up the filesystem, the fsck application disconnects the inode from the filesystem and connects it to the lost+found directory located at the top of each filesystem for this purpose.
Pass 4 checks the reference counts for all inodes in the filesystem, comparing the link counts calculated in pass 1 to values computed during passes 2 and 3. Any files that have a link count of zero are connected to the lost+found directory.
Pass 5 verifies the summary information about the filesystem contained in the superblock against information calculated during the previous passes and compares the block and inode bitmaps constructed in previous passes against those located in the filesystem header. If these differ, pass 5 overwrites the on-disk bitmaps with those constructed during this run of fsck.
The easiest way to think of the bitmaps that fsck constructs is as arrays in which each bit is a boolean value associated with the inode whose number corresponds to that bit's offset into the bitmap.
If the bitmap that identifies blocks claimed by multiple inodes is not empty, fsck invokes three "sub-steps" of pass 1, known as passes 1B, 1C, and 1D. Pass 1B rescans the data blocks associated with each inode in the filesystem and builds a complete list of the blocks that are claimed by multiple inodes and the inodes that claim each of them. Pass 1C traverses the entire filesystem hierarchy to identify the path to each file or directory associated with an inode that claims a disk block also claimed by another. Pass 1D prompts the user as to how each duplication should be resolved: either by copying the duplicated block and giving each file or directory a copy, or by simply deleting the affected file or directory.
Note that all the information verified in pass 2 can be checked by examining either the information collected in pass 1 or an inode. Pass 2 therefore does not require any disk I/O other than reading the directory inodes in the filesystem. Pass 2 itself collects the inode number of the parent inode for each directory inode in the filesystem (the inode referenced by the .. entry) but does not do any verification of those values.
Depending on the option with which you've called fsck, it either corrects problems automatically, prompts whether it should correct problems that it has detected, or simply reports problems without doing anything about them. One of the more interesting aspects of fsck is how it handles files and directories that are detected but that are either not linked into the filesystem anywhere or located in a directory whose inode or indirect blocks were damaged so severely that they could not be corrected. Files and directories of this sort are moved to that filesystem's lost+found directory, which is a subdirectory of the root of all ufs, ext2, and standard Unix-like filesystems.
The lost+found directory is created by the mkfs command when you create these types of filesystems (it usually gets inode 11, though this isn't mandatory) and initially consists of an empty directory that has a relatively large number of preallocated directory entries (16384 for ext2 filesystems, by default). These directory slots are preallocated because of the chance that it may be necessary to link many files and subdirectories into this directory in the event of a major filesystem problem. It certainly would be "disappointing" to be running fsck, have it detect major filesystem corruption, find many disconnected files and directories, and have no free directory entries to which to link them. Directory entries are preallocated in lost+found directories because, in terms of anti-Darwinian survival traits, allocating additional indirect blocks to an existing directory when trying to resolve massive filesystem problems is right up there with adjusting the trigger mechanism on a shotgun while staring down the barrel.
If changes have been made to the root filesystem when it is checked by fsck, the system is automatically rebooted at this point. If changes were made to any other filesystem, the filesystem is simply marked as clean and can thus be mounted after the consistency of all filesystems has been checked and repaired.
Getting Information About Filesystems
If you're interested enough about filesystems to read this book, you're probably already familiar with the Linux and Unix commands used to retrieve information about the size and contents of a filesystem. However, just in case, this section provides a quick overview of the commands commonly used to retrieve this information and their most popular options.
Using the df Command
The df (disk free) command provides information about one or more mounted filesystems. By default, without any arguments, it displays the following information about all mounted filesystems:
The device associated with the filesystem
The total size of the filesystem in 1 KB blocks
The amount of space used on the filesystem
The amount of space on the filesystem still available to users
The percentage of the filesystem currently used
The directory on which the filesystem is mounted
When administering a Linux system, you may occasionally see the df command report that 100 percent of a filesystem is in use even though you can still read and write files on that filesystem. This is due to the fact that most filesystems reserve a certain amount of space for use during crises such as when a filesystem fills up. Otherwise, any user who was editing a file when someone else accidentally filled up the filesystem could easily lose his work. The amount of space reserved for such unhappy occasions is set by the command used to create the filesystem. For example, ext2 filesystems reserve 5 percent of the space on an ext2 filesystem when it is created. (You can change this percentage using the mke2fs program's -m option.) For information about the amount of space reserved by other types of filesystems, see the man page for the application used to create filesystems of that type.
The following is sample output from running the df command on one of my systems:
[wvh@journal wvh]$ df Filesystem 1k-blocks Used Available Use% Mounted on /dev/hda5 12389324 2373784 9386196 21% / /dev/hda1 54416 4357 47250 9% /boot /dev/hdc1 5119940 32840 5087100 1% /reiser_test /dev/hdc2 5115336 144 5115192 1% /xfs_test /dev/test/vol1 3145628 32840 3112788 2% /lvm_reiser_test /dev/test/vol2 5238080 144 5237936 1% /lvm_xfs_test
When used on other types of Unix systems, the df command may report filesystem sizes in terms of different-sized blocks. For this reason, it's a good idea to get into the habit of executing the df command on any Unix system as "df -k", which forces the output to be given in terms of 1K blocks (even though this is the default on Linux systems). You don't want to find yourself accidentally miscalculating the amount of space remaining on a filesystem just because the df command for that version of Unix uses a different default block size.
Some other options often used with the df command are shown in Table 3.1.
Table 3.1 Options used with the df command
Displays size information in human-readable form, such as 1.2G, 53M, 348K, and so on
Displays size and usage information in terms of inodes rather than data blocks
Limits the df command to only displaying information about local filesystems
Displays size information in megabytes
These are my favorite df options or the ones that I've seen people using with some frequency. For complete information about all the options available to the df command, see the online manual page.
You can follow the df command and its options with the mountpoint of a specific filesystem if you want information only about that filesystem. You also can provide the name of the device on which a specific filesystem is located to get information about that filesystem regardless of whether it is mounted.
Using the du command
The df command is primarily used by system administrators to verify that the amount of space remaining on various partitions is sufficient for the needs of their users. Both users and system administrators often use the related du (disk usage) command to find out the amount of space used by specific files and directories. Although you can obtain this information about files by simply using ls, the du command provides some convenient options for summarizing the disk usage associated with all the files and subdirectories of a specific directory. These options make it easy to identify users (or system directories) using an inappropriately large amount of disk space. You can largely eliminate the ability of users to use more disk space than they "should" by using quotas on the filesystems where users can create files and directories, but that's another topic that I will discuss later in this section.
Popular options for the du command are shown in Table 3.2.
Table 3.2 Options used with the du command
Lists the disk usage of the specified files and directories and then displays a total of those values.
Displays size in human-readable form such as 1.2G, 53M, 348K, and so on.
Displays size information in terms of 1K blocks.
Counts the size of each hard link to another file. For example, if a certain directory contains six hard links to a specific file, using the -l option counts the size of the file each time a link is encountered and adds that to the total disk space associated with a directory. By default, multiple hard links to a file are ignored in a disk usage summary, and the size of that file is only added to the disk usage total once.
Follows symbolic links when calculating sizes. By default, the size of a symbolic link is the size of the link itself, which is essentially the length of the name of the file or directory to which the link points.
Does not include the size of subdirectories. This option is useful to determine the amount of disk space consumed by all the files in the current directory only.
Only prints a summary of the disk usage for files and directories specified as arguments to the du command. By defaultwith no argumentsthe du -s command summarizes disk usage in the current directory.
For a complete list of all the options available for use with the du command (including some truly arcane ones!), see the online manual page.