16.4 Alternative Architectures
So far, we’ve considered hardware additions for security that start with a computing environment based on a traditional CPU and then either put armor around it or put armored devices next to it. However, many active areas of current research—and also current industrial development—are exploring changing the traditional CPU instead. Some of this work is explicitly motivated by security; some has other motivations but still has relevance to security.
16.4.1 The Conventional Machine
In Section 4.1.1, we reviewed the basic architecture of a conventional system. As Figure 4.1 showed, memory is an array of indexed locations; let’s say that each location is 1 byte wide. The CPU interacts with a memory location by issuing the address of that location on the address bus and indicating the nature of the interaction (e.g, read or write) on a control line. The data in question is then transferred on the data bus; the direction of this transfer depends on whether the operation is a read or a write.
Programs are stored in memory like anything else. The CPU fetches an instruction from memory, internally decodes it, and carries out its operation, which may involve additional reads or writes. The CPU then proceeds to fetch the next instruction.
Current conventional architectures differ from this simple one in three fundamental ways.
Memory management. In this naive model, the address that the CPU issues is the address that the memory sees. As a consequence, when multitasking, each separate program or process must be aware of the addresses the other ones are using. This clearly creates problems for security and fault tolerance, as well as general ease of programming.
To avoid these problems and to enable lots of other flexibility, modern systems introduce a level of indirection: A memory-management unit (MMU) translates the virtual or logical addresses the CPU issues into physical addresses the memory sees (see Figure 4.2). The MMU can also enforce restrictions, such as read-only, by failing to translate the address, for a write request. The MMU, in conjunction with OS trickery, can enforce more exotic models as well, such as copy on write, whereby memory is shared between two processes until one tries to write to it.
When changing the process or memory domain currently active, the CPU can also instruct the MMU to change address spaces: the memory image seen by the CPU.
Privileged instructions. This is a security book, so our natural inclination is to look at everything—including address translation—from a security perspective. From this perspective, a natural question is: What’s the point of using memory management to protect address spaces from rogue programs on the CPU, if the CPU itself is responsible for controlling and configuring the MMU?
This line of thinking led to the introduction of privilege levels. (a) The CPU has some notion of its current privilege level. (b) What an instruction or operation does or whether it’s even permitted depends on the current privilege level. (c) Transitions between privilege levels—in particular, translations to greater privilege—must be carefully controlled.
In the standard textbook model today, a CPU has two5 privilege levels: user and kernel—or, sometimes, unprivileged and privileged, respectively. Typically, important protection-relevant tasks, such as changing MMU settings, can be done only in kernel mode. As discussed in Chapter 4, user-level code can transition to kernel mode only via a system call, or trap, that, via hardware, changes mode but also transfers control to specially designated, and one hopes, trusted code. The standard model uses the terms user and kernel for privileges because of the general intention that operating system code runs in kernel mode, that code from ordinary users runs in user mode, and that the operating system protects itself (and the users) from the users.
Caching. In the naive model, a memory location, once translated, "lives" at some place in ROM or RAM; in this simple model, the CPU does one thing at a time and accesses memory as needed. Modern systems have achieved significant performance improvements, however, by throwing these constraints out the window. Memory no longer needs to be bound to exactly one physical device; rather, we can try to cache frequently used items in faster devices. Consequently, CPUs may have extensive internal caches of memory—and then play various update games to make sure that various devices, such as other CPUs, see consistent views of memory. Caching enables a CPU to execute sequences of instructions without touching external memory. Caching also motivates the development of fancy heuristics, such as prefetching, to attempt to make sure that the right items are in the cache when needed. Processors sometimes separate instruction caches from data caches.
Systems achieve additional performance improvement by doing away with the notion that the CPU execute the instructions one at a time, as written in the program. One way this is done is via pipelining: decomposing the execution of an instruction into several stages and making sure that the hardware for each stage is always busy. Another innovation is superscalar processing—after decomposing the instruction execution into stages, we add extra modules for some of the stages (e.g., a second arithmetic unit). Since idle hardware is wasted hardware, processors also use aggressive heuristics for speculative execution (e.g., guessing the result of a future branch and filling the pipeline based on that assumption) and out-of-order execution (e.g., shuffling instructions and registers around at runtime to improve optimization).
(For more information on modern system architectures, consult one of the standard books in the area. Patterson and Hennessy [PH07] is considered the default textbook; Stokes [Sto07] provides a lighter introduction that focuses more directly on the machines you probably use.)
Privilege levels, syscall traps, and memory management all clearly assist in security. (Indeed, consider how susceptible a modern Internet-connected computer would be if it lacked kernel/user separation.) Sometimes, the lack of sufficient control can be frustrating. For example, if the MMU knew when a CPU’s memory read was really for an instruction fetch, we could cook up a system in which memory regions had "read but do not execute" permission—thus providing a line of defense against stack-code injection attacks (recall Section 6.1). Indeed, this relatively straightforward idea was touted as the revolutionary NX feature by the technical press in recent years.
However, features such as internal caching and pipelining/out-of-order execution make things a bit harder. The relationship between what the internal system is doing and what an external device (such as a PCI card verifying integrity of kernel memory structures) can perceive is much less well defined. For example, suppose that we wanted to ensure that a certain FLASH device could be reprogrammed only when the system was executing a certain trusted module within ROM. Naively, we might add external hardware that sensed the address bus during instruction fetches and enabled FLASH changes only when those fetches were from ROM addresses.
However, if we can’t tell which cached instructions are actually being executed at the moment (let alone whether other code is simply "borrowing" parts of ROM as subroutines or even whether a memory read is looking for data or an instruction), then such techniques cannot work.
As Chapter 4 discussed, in the standard software architecture, an operating system provides services to user-level processes and enforces separation between these processes. As Section 16.4.1 discussed, hardware architecture usually reinforces these features. The aspiration here is that we don’t want processes interfering with or spying on each other or on the OS, unless it’s through a channel explicitly established and monitored by the OS. Figure 16.3 sketches this model.
Figure 16.3 In the conventional system model, the OS provides protection between separate processes.
In many scenarios, this approach to controlled separation may not be sufficient. Rather than providing separation between userland processes, one may prefer separation at the machine level. Rather than an OS protecting processes from each other, an OS and its processes are hoisted from a real machine up to a virtual machine, and another software layer—-usually called a virtual machine monitor (VMM)—protects these virtual machines from each other. Figure 16.4 shows one approach—although, as Section 16.5.1 discusses, many approaches exist here.
Figure 16.4 In virtualization models, processes are partitioned among virtual machines. Here, we sketch the type I approach, whereby the VMM runs on the hardware and provides separation between separate OS instances. In this case, the VMM is often called a hypervisor. Other models exist—see Section 16.5.1.
This idea of creating the illusion of multiple virtual machines within one machine is called virtualization. Initially explored in the early days of mainframe computing, virtualization has become fashionable again. What’s the motivation for virtualization? Since this is a book about security, we tend to think about security first. And indeed, some reasons follow from security.
For one thing, the API between the OS and userland applications can be extraordinarily rich and complicated. This complexity can make it hard to reason about and trust the properties one would like for this separation.
Complexity of API will also likely lead to complexity of implementation: increasing the size and likely untrustworthiness of the TCB.
For another example, modern platforms have seen a continual bleeding of applications into the OS. As a consequence, untrustworthy code, such as device drivers and graphics routines, may execute with kernel privileges. (Indeed, some researchers blame this design choice for Windows’ endless parade of vulnerabilities.)
Other reasons follow from basic economics.
- The API an OS gives to the application is highly specialized and may thus be unsuitable—for example, one can’t easily run a mission-critical Win98 application as a userland process on OSX.
- According to rumor, this practice of giving each well-tested legacy application/OS its own machine leads to CPU utilization percentages in the single digits. Being able to put many on the same machine saves money.
Section 16.5.1 discusses old and new approaches to virtualization and security implications in more detail.
Another trend in commercial CPUs is multicore: putting multiple processors (cores) on a single chip. The vendor motivation for this trend is a bit murky: increased performance is touted, better yield is rumored. However, multicore also raises the potential for security applications: If virtualization can help with a security idea, then wouldn’t giving each virtual machine its own processor be much simpler and more likely to work rather than mucking about with special modes?
One commercial multicore processor, CELL, touts security goals; it features an architecture in which userland processes get farmed off to their own cores, for increased protection from both other userland processes and the kernel. (We discuss this further in Section 16.5.3.)
16.4.4 Armored CPUs
Modern CPUs cache instructions and data internally and thus fetch code and data chunks at a time, instead of piecemeal. As we noted earlier, this behavior can make it hard for external security hardware to know exactly what the CPU is doing. However, this difficulty can be a feature as well as a bug, since it also can be hard for an external adversary to observe what’s happening.
Consequently, if we assume that the adversary cannot penetrate the CPU itself, we might be able to achieve such things as private computation, by being sure that code lives encrypted externally; integrity of code and data, by doing cryptographic checks as the chunks move across the border; and binding data to code, by keeping data encrypted and decrypting it only internally if the right code came in.
Several research projects have built on this idea. XOM (Stanford) explored this idea to implement execute-only memory via simulators. AEGIS (MIT) made it all the way to real FPGA-based prototypes, which also incorporate the SPUF idea, to provide some grounds for the physical-security assumption.
When lamenting the sad state of security in our current cyberinfrastructure, some security old-timers wistfully talk about tagged architectures, which had been explored in early research but had been largely abandoned. Rather than having all data items look alike, this approach tags each data item with special metadata. Implemented in hardware, this metadata gets stored along with the data in memory, gets transmitted along with it on buses—and controls the ways in which the data can be used. Systems in which permissions are based on capabilities (recall Chapter 4) might implement these capabilities as data items with a special tag indicating so; this keeps malicious processes from simply copying and forging capabilities.
Some of these ideas are finding expression again in modern research. For example, many buffer overflow attacks work because the adversary can enter data, as user input, which the program mistakenly uses as a pointer or address. Researchers at the University of Illinois have built, via an FPGA (field-programmable gate array) prototype, a CPU retrofit with an additional metadata line to indicate that a data item is tainted [CXN+05]. The CPU automatically marks user input as tainted. Attempts to use a tagged data item as an address throw a hardware fault. However, certain comparison instructions—as code does when it sanity checks user input—clear the taint tags. Stanford’s TaintBochs project uses software-based virtualization to explore further uses of taintedness and tagging in security contexts [CPG+04].