5.4 Causes of Read/Write File System Latency
There are several causes of latency in the file system read/write data path. The simplest is that of latency incurred by waiting for physical I/O at the backend of the file system. File systems, however, rarely simply pass logical requests straight through to the backend, so latency can be incurred in several other ways. For example, one logical I/O event can be fractured into two physical I/O events, resulting in the latency penalty of two disk operations. Figure 5.3 shows the layers that could contribute latency.
Figure 5.3 Layers for Observing File System I/O
Common sources of latency in the file system stack include:
- Disk I/O wait (or network/filer latency for NFS)
- Block or metadata cache misses
- I/O breakup (logical I/Os being fractured into multiple physical I/Os)
- Locking in the file system
- Metadata updates
5.4.1 Disk I/O Wait
Disk I/O wait is the most commonly assumed type of latency problem. If the underlying storage is in the synchronous path of a file system operation, then it affects file-system-level latency. For each logical operation, there could be zero (a hit in a the block cache), one, or even multiple physical operations.
This iowait.d script uses the file name and device arguments in the I/O provider to show us the total latency accumulation for physical I/O operations and the breakdown for each file that initiated the I/O. See Chapter 4 for further information on the I/O provider and Section 10.6.1 for information on its arguments.
# ./iowait.d 639 ^C Time breakdown (milliseconds): <on cpu> 2478 <I/O wait> 6326 I/O wait breakdown (milliseconds): file1 236 file2 241 file4 244 file3 264 file5 277 file7 330 ...
5.4.2 Block or Metadata Cache Misses
Have you ever heard the saying "the best I/O is the one you avoid"? Basically, the file system tries to cache as much as possible in RAM, to avoid going to disk for repetitive accesses. As discussed in Section 5.6, there are multiple caches in the file system—the most obvious is the data block cache, and others include meta-data, inode, and file name caches.
5.4.3 I/O Breakup
I/O breakup occurs when logical I/Os are fractured into multiple physical I/Os. A common file-system-level issue arises when multiple physical I/Os result from a single logical I/O, thereby compounding latency.
Output from running the following DTrace script shows VOP level and physical I/Os for a file system. In this example, we show the output from a single read(). Note the many page-sized 8-Kbyte I/Os for the single 1-Mbyte POSIX-level read(). In this example, we can see that a single 1-MByte read is broken into several 4-Kbyte, 8-Kbyte, and 56-Kbyte physical I/Os. This is likely due to the file system maximum cluster size (maxcontig).
# ./fsrw.d Event Device RW Size Offset Path sc-read . R 1048576 0 /var/sadm/install/contents fop_read . R 1048576 0 /var/sadm/install/contents disk_ra cmdk0 R 4096 72 /var/sadm/install/contents disk_ra cmdk0 R 8192 96 <none> disk_ra cmdk0 R 57344 96 /var/sadm/install/contents disk_ra cmdk0 R 57344 152 /var/sadm/install/contents disk_ra cmdk0 R 57344 208 /var/sadm/install/contents disk_ra cmdk0 R 49152 264 /var/sadm/install/contents disk_ra cmdk0 R 57344 312 /var/sadm/install/contents disk_ra cmdk0 R 57344 368 /var/sadm/install/contents disk_ra cmdk0 R 57344 424 /var/sadm/install/contents disk_ra cmdk0 R 57344 480 /var/sadm/install/contents disk_ra cmdk0 R 57344 536 /var/sadm/install/contents disk_ra cmdk0 R 57344 592 /var/sadm/install/contents disk_ra cmdk0 R 57344 648 /var/sadm/install/contents disk_ra cmdk0 R 57344 704 /var/sadm/install/contents disk_ra cmdk0 R 57344 760 /var/sadm/install/contents disk_ra cmdk0 R 57344 816 /var/sadm/install/contents disk_ra cmdk0 R 57344 872 /var/sadm/install/contents disk_ra cmdk0 R 57344 928 /var/sadm/install/contents disk_ra cmdk0 R 57344 984 /var/sadm/install/contents disk_ra cmdk0 R 57344 1040 /var/sadm/install/contents
5.4.4 Locking in the File System
File systems use locks to serialize access within a file (we call these explicit locks) or within critical internal file system structures (implicit locks).
Explicit locks are often used to implement POSIX-level read/write ordering within a file. POSIX requires that writes must be committed to a file in the order in which they are written and that reads must be consistent with the data within the order of any writes. As a simple and cheap solution, many files systems simply implement a per-file reader-writer lock to provide this level of synchronization. Unfortunately, this solution has the unwanted side effect of serializing all accesses within a file, even if they are to non-overlapping regions. The reader-writer lock typically becomes a significant performance overhead when the writes are synchronous (issued with O_DSYNC or O_SYNC) since the writer-lock is held for the entire duration of the physical I/O (typically, in the order of 10 or more milliseconds), blocking all other reads and writes to the same file.
The POSIX lock is the most significant file system performance issue for databases because they typically use a few large files with hundreds of threads accessing them. If the POSIX lock is in effect, then I/O is serialized, effectively limiting the I/O throughput to that of a single disk. For example, if we assume a file system with 10 disks backing it and a database attempting to write, each I/O will lock a file for 10 ms; the maximum I/O rate is around 100 I/Os per second, even though there are 10 disks capable of 1000 I/Os per second (each disk is capable of 100 I/Os per second).
Most file systems using the standard file system page cache (see Section 14.7 in Solaris™ Internals) have this limitation. UFS when used with Direct I/O (see Section 5.6.2) relaxes the per-file reader-writer lock and can be used as a high-performance, uncached file system, suitable for applications such as databases that do their own caching.
5.4.5 Metadata Updates
File system metadata updates are a significant source of latency because many implementations synchronously update the on-disk structures to maintain integrity of the on-disk structures. There are logical metadata updates (file creates, deletes, etc.) and physical metadata updates (updating a block map, for example).
Many file systems perform several synchronous I/Os per metadata update, which limits metadata performance. Operations such as creating, renaming, and deleting files often exhibit higher latency than reads or writes as a result. Another area affected by metadata updates is file-extends, which can require a physical metadata update.