Virtual Memory in the IA-64 Linux Kernel
Linux processes execute in a virtual environment that makes it appear as if each process had the entire address space of the CPU available to itself. This virtual address space extends from address 0 all the way to the maximum address. On a 32-bit platform, such as IA-32, the maximum address is 232 - 1 or 0xffffffff. On a 64-bit platform, such as IA-64, this is 264 -1 or 0xffffffffffffffff.
While it is obviously convenient for a process to be able to access such a huge address space, there are really three distinct, but equally important, reasons for using virtual memory.
Resource virtualization. On a system with virtual memory, a process does not have to concern itself with the details of how much physical memory is available or which physical memory locations are already in use by some other process. In other words, virtual memory takes a limited physical resource (physical memory) and turns it into an infinite, or at least an abundant, resource (virtual memory).
Information isolation. Because each process runs in its own address space, it is not possible for one process to read data that belongs to another process. This improves security because it reduces the risk of one process being able to spy on another process and, e.g., steal a password.
Fault isolation. Processes with their own virtual address spaces cannot overwrite each other's memory. This greatly reduces the risk of a failure in one process triggering a failure in another process. That is, when a process crashes, the problem is generally limited to that process alone and does not cause the entire machine to go down.
In this chapter, we explore how the Linux kernel implements its virtual memory system and how it maps to the underlying hardware. This mapping is illustrated specifically for IA-64. The chapter is structured as follows. The first section provides an introduction to the virtual memory system of Linux and establishes the terminology used throughout the remainder of the chapter. The introduction is followed by a description of the software and hardware structures that form the virtual memory system. Specifically, the second section describes the Linux virtual address space and its representation in the kernel. The third section describes the Linux page tables, and the fourth section describes how Linux manages the translation lookaside buffer (TLB), which is a hardware structure used to accelerate virtual memory accesses. Once these fundamental structures are introduced, the chapter describes the operation of the virtual memory system. Section five explores the Linux page fault handler, which can be thought of as the engine driving the virtual memory system. Section six describes how memory coherency is maintained, that is, how Linux ensures that a process sees the correct values in the virtual memory locations it can access. Section seven discusses how Linux switches execution from one address space to another, which is a necessary step during a process context switch. The chapter concludes with section eight, which provides the rationale for some of the virtual memory choices that were made for the virtual memory system implemented on IA-64.
4.1 Introduction to the Virtual Memory System
The left half of Figure 4.1 illustrates the virtual address space as it might exist for a particular process on a 64-bit platform. As the figure shows, the virtual address space is divided into equal-sized pieces called virtual pages. Virtual pages have a fixed size that is an integer power of 2. For example, IA-32 uses a page size of 4 Kbytes. To maximize performance, IA-64 supports multiple page sizes and Linux can be configured to use a size of 4, 8, 16, or 64 Kbytes. In the figure, the 64-bit address space is divided into 16 pages, meaning that each virtual page would have a size of 264/16 = 260 bytes or 1024 Pbytes (1 Pbyte = 250 bytes). Such large pages are not realistic, but the alternative of drawing a figure with several billion pages of a more realistic size is, of course, not practical either. Thus, for this section, we continue to illustrate virtual memory with this huge page size. The figure also shows that virtual pages are numbered sequentially. We can calculate the virtual page number (VPN) from a virtual address by dividing it by the page size and taking the integer portion of the result. The remainder is called the page offset. For example, dividing virtual address 0x40000000000003f8 by the page size yields 4 and a remainder of 0x3f8. This address therefore maps to virtual page number 4 and page offset 0x3f8.
Let us now turn attention to the right half of Figure 4.1, which shows the physical address space. Just like the virtual space, it is divided into equal-sized pieces, but in physical memory, those pieces are called page frames. As with virtual pages, page frames also are numbered. We can calculate the page frame number (PFN) from a physical address by dividing it by the page frame size and taking the integer portion of the result. The remainder is the page frame offset. Normally, page frames have the same size as virtual pages. However, there are cases where it is beneficial to deviate from this rule. Sometimes it is useful to have virtual pages that are larger than a page frame. Such pages are known as superpages. Conversely, it is sometimes useful to divide a page frame into multiple, smaller virtual pages. Such pages are known as subpages. IA-64 is capable of supporting both, but Linux does not use them.
Figure 4.1. Virtual and physical address spaces.
While it is easiest to think of physical memory as occupying a single contiguous region in the physical address space, in reality it is not uncommon to encounter memory holes. Holes usually are caused by one of three entities: firmware, memory-mapped I/O devices, or unpopulated memory. All three cause portions of the physical address space to be unavailable for storing the content of virtual pages. As far as the kernel is concerned, these portions are holes in the physical memory. In the example in Figure 4.1, page frames 2 and 3 represent a hole. Note that even if just a single byte in a page frame is unusable, the entire frame must be marked as a hole.
4.1.1 Virtual-to-physical address translation
Processes are under the illusion of being able to store data to virtual memory and retrieve it later on as if it were stored in real memory. In reality, only physical memory can store data. Thus, each virtual page that is in use must be mapped to some page frame in physical memory. For example, in Figure 4.1, virtual pages 4, 5, 8, 9, and 11 are in use. The arrows indicate which page frame in physical memory they map to. The mapping between virtual pages and page frames is stored in a data structure called the page table. The page table for our example is shown on the left-hand side of Figure 4.2.
Figure 4.2. Virtual-to-physical address translation.
The Linux kernel is responsible for creating and maintaining page tables but employs the CPU's memory management unit (MMU) to translate the virtual memory accesses of a process into corresponding physical memory accesses. Specifically, when a process accesses a memory location at a particular virtual address, the MMU translates this address into the corresponding physical address, which it then uses to access the physical memory. This is illustrated in Figure 4.2 for the case in which the virtual address is 0x40000000000003f8. As the figure shows, the MMU extracts the VPN (4) from the virtual address and then searches the page table to find the matching PFN. In our case, the search stops at the first entry in the page table since it contains the desired VPN. The PFN associated with this entry is 7. The MMU then constructs the physical address by concatenating the PFN with the frame offset from the virtual address, which results in a physical address of 0x70000000000003f8.
4.1.2 Demand paging
The next question we need to address is how the page tables get created. Linux could create appropriate page-table entries whenever a range of virtual memory is allocated. However, this would be wasteful because most programs allocate much more virtual memory than they ever use at any given time. For example, the text segment of a program often includes large amounts of error handling code that is seldom executed. To avoid wasting memory on virtual pages that are never accessed, Linux uses a method called demand paging. With this method, the virtual address space starts out empty. This means that, logically, all virtual pages are marked in the page table as not present. When accessing a virtual page that is not present, the CPU generates a page fault. This fault is intercepted by the Linux kernel and causes the page fault handler to be activated. There, the kernel can allocate a new page frame, determine what the content of the accessed page should be (e.g., a new, zeroed page, a page loaded from the data section of a program, or a page loaded from the text segment of a program), load the page, and then update the page table to mark the page as present. Execution then resumes in the process with the instruction that caused the fault. Since the required page is now present, the instruction can now execute without causing a page fault.
4.1.3 Paging and swapping
So far, we assumed that physical memory is plentiful: Whenever we needed a page frame to back a virtual page, we assumed a free page frame was available. When a system has many processes or when some processes grow very large, the physical memory can easily fill up. So what is Linux supposed to do when a page frame is needed but physical memory is already full? The answer is that in this case, Linux picks a page frame that backs a virtual page that has not been accessed recently, writes it out to a special area on the disk called the swap space, and then reuses the page frame to back the new virtual page. The exact place to which the old page is written on the disk depends on what kind of swap space is in use. Linux can support multiple swap space areas, each of which can be either an entire disk partition or a specially formatted file on an existing filesystem (the former is generally more efficient and therefore preferable). Of course, the page table of the process from which Linux "stole" the page frame must be updated accordingly. Linux does this update by marking the page-table entry as not present. To keep track of where the old page has been saved, it also uses the entry to record the disk location of the page. In other words, a page-table entry that is present contains the page frame number of the physical page frame that backs the virtual page, whereas a page-table entry that is not present contains the disk location at which the content of the page can be found.
Because a page marked as not present cannot be accessed without first triggering a page fault, Linux can detect when the page is needed again. When this happens, Linux needs to again find an available page frame (which may cause another page to be paged out), read the page content back from swap space, and then update the page-table entry so that it is marked as present and maps to the newly allocated page frame. At this point, the process that attempted to access the paged-out page can be resumed, and, apart from a small delay, it will execute as if the page had been in memory at all along.
The technique of stealing a page from a process and writing it out to disk is called paging. A related technique is swapping. It is a more aggressive form of paging in the sense that it does not steal an individual page but steals all the pages of a process when memory is in short supply. Linux uses paging but not swapping. However, because both paging and swapping write pages to swap space, Linux kernel programmers often use the terms "swapping" and "paging" interchangeably. This is something to keep in mind when perusing the kernel source code.
From a correctness point of view, it does not matter which page is selected for page out, but from a performance perspective, the choice is critical. With a poor choice, Linux may end up paging out the page that is needed in the very next memory access. Given the large difference between disk access latency (on the order of several milliseconds) and memory access latency (on the order of tens of nanoseconds), making the right replacement choices can mean the difference between completing a task in a second or in almost three hours!
The algorithm that determines which page to evict from main memory is called the replacement policy. The provably optimal replacement policy (OPT) is to choose the page that will be accessed farthest in the future. Of course, in general it is impossible to know the future behavior of the processes, so OPT is of theoretical interest only. A replacement policy that often performs almost as well as OPT yet is realizable is the least recently used (LRU) policy. LRU looks into the past instead of the future and selects the page that has not been accessed for the longest period of time. Unfortunately, even though LRU could be implemented, it is still not practical because it would require updating a data structure (such as an LRU list) on every access to main memory. In practice, operating systems use approximations of the LRU policy instead, such as the clock replacement or not frequently used (NFU) policies [11, 69].
In Linux, the page replacement is complicated by the fact that the kernel can take up a variable amount of (nonpageable) memory. For example, file data is stored in the page cache, which can grow and shrink dynamically. When the kernel needs a new page frame, it often has two choices: It could take away a page from the kernel or it could steal a page from a process. In other words, the kernel needs not just a replacement policy but also a memory balancing policy that determines how much memory is used for kernel buffers and how much is used to back virtual pages. The combination of page replacement and memory balancing poses a difficult problem for which there is no perfect solution. Consequently, the Linux kernel uses a variety of heuristics that tend to work well in practice.
To implement these heuristics, the Linux kernel expects the platform-specific part of the kernel to maintain two extra bits in each page-table entry: the accessed bit and the dirty bit. The accessed bit is an indicator that tells the kernel whether the page was accessed (read, written, or executed) since the bit was last cleared. Similarly, the dirty bit is an indicator that tells whether the page has been modified since it was last paged in. Linux uses a kernel thread, called the kernel swap daemon kswapd, to periodically inspect these bits. After inspection, it clears the accessed bit. If kswapd detects that the kernel is starting to run low on memory, its starts to proactively page out memory that has not been used recently. If the dirty bit of a page is set, it needs to write the page to disk before the page frame can be freed. Because this is relatively costly, kswapd preferentially frees pages whose accessed and dirty bits are cleared to 0. By definition such pages were not accessed recently and do not have to be written back to disk before the page frame can be freed, so they can be reclaimed at very little cost.
In a multiuser and multitasking system such as Linux, multiple processes often execute the same program. For example, each user who is logged into the system at a minimum is running a command shell (e.g., the Bourne-Again shell, bash). Similarly, server processes such as the Apache web server often use multiple processes running the same program to better handle heavy loads. If we looked at the virtual space of each of those processes, we would find that they share many identical pages. Moreover, many of those pages are never modified during the lifetime of a process because they contain read-only data or the text segment of the program, which also does not change during the course of execution. Clearly, it would make a lot of sense to exploit this commonality and use only one page frame for each virtual page with identical content.
With N processes running the same program, sharing identical pages can reduce physical memory consumption by up to a factor of N. In reality, the savings are usually not quite so dramatic, because each process tends to require a few private pages. A more realistic example is illustrated in Figure 4.3: Page frames 0, 1, and 5 are used to back virtual pages 1, 2, and 3 in the two processes called bash 1 and bash 2. Note that a total of nine virtual pages are in use, but thanks to page sharing, only six page frames are needed.
Figure 4.3. Two processes sharing the text segment (virtual pages 1 to 3).
Of course, page sharing cannot be done safely unless we can guarantee that none of the shared pages are modified. Otherwise, the changes of one process would be visible in all the other processes and that could lead to unpredictable program behavior.
This is where the page permission bits come into play. The Linux kernel expects the platform-specific part of the kernel to maintain three such bits per page-table entry. They are called the R, W, and X permission bits and respectively control whether read, write, or execute accesses to the page are permitted. If an access that is not permitted is attempted, a page protection violation fault is raised. When this happens, the kernel responds by sending a segmentation violation signal (SIGSEGV) to the process.
The page permission bits enable the safe sharing of page frames. All the Linux kernel has to do is ensure that all virtual pages that refer to a shared page frame have the W permission bit turned off. That way, if a process attempted to modify a shared page, it would receive a segmentation violation signal before it could do any harm.
The most obvious place where page frame sharing can be used effectively is in the text segment of a program: By definition, this segment can be executed and read, but it is never written to. In other words, the text segment pages of all processes running the same program can be shared. The same applies to read-only data pages.
Linux takes page sharing one step further. When a process forks a copy of itself, the kernel disables write access to all virtual pages and sets up the page tables such that the parent and the child process share all page frames. In addition, it marks the pages that were writable before as copy-on-write(COW). If the parent or the child process attempts to write to a copy-on-write page, a protection violation fault occurs. When this happens, instead of sending a segmentation violation signal, the kernel first makes a private copy of the virtual page and then turns the write permission bit for that page back on. At this point, execution can return to the faulting process. Because the page is now writable, the faulting instruction can finish execution without causing a fault again. The copy-on-write scheme is particularly effective when a program does a fork() that is quickly followed by an execve(). In such a case, the scheme is able to avoid almost all page copying, save for a few pages in the stack- or data-segment [9, 64].
Note that the page sharing described here happens automatically and without the explicit knowledge of the process. There are times when two or more processes need to cooperate and want to explicitly share some virtual memory pages. Linux supports this through the mmap() system call and through System V shared memory segments . Because the processes are cooperating, it is their responsibility to map the shared memory segment with suitable permission bits to ensure that the processes can access the memory only in the intended fashion.