4.2 Process State
Every process in the system is assigned a unique identifier termed the process identifier (PID). PIDs are the common mechanism used by applications and by the kernel to reference processes. PIDs are used by applications when the latter are sending a signal to a process and when receiving the exit status from a deceased process. Two PIDs are of special importance to each process: the PID of the process itself and the PID of the process's parent process.
The layout of process state was completely reorganized in FreeBSD 5.2. The goal was to support multiple threads that share an address space and other resources. Threads have also been called lightweight processes in other systems. A thread is the unit of execution of a process; it requires an address space and other resources, but it can share many of those resources with other threads. Threads sharing an address space and other resources are scheduled independently and can all do system calls simultaneously. The reorganization of process state in FreeBSD 5.2 was designed to support threads that can select the set of resources to be shared, known as variable-weight processes [Aral et al., 1989].
The developers did the reorganization by moving many components of process state from the process and user structures into separate substructures for each type of state information, as shown in Figure 4.1. The process structure references all the substructures directly or indirectly. The user structure remains primarily as a historic artifact for the benefit of debuggers. The thread structure contains just the information needed to run in the kernel: information about scheduling, a stack to use when running in the kernel, a thread control block (TCB), and other machine-dependent state. The TCB is defined by the machine architecture; it includes the general-purpose registers, stack pointers, program counter, processor-status longword, and memory-management registers.
Figure 4.1 Process state.
In their lightest-weight form, FreeBSD threads share all the process resources including the PID. When additional parallel computation is needed, a new thread is created using the kse_create system call. All the scheduling and management of the threads is handled by a user-level scheduler that is notified of thread transitions via callbacks from the kernel. The user-level thread manager must also keep track of the user-level stacks being used by each of the threads, since the entire address space is shared including the area normally used for the stack. Since the threads all share a single process structure, they have only a single PID and thus show up as a single entry in the ps listing.
Many applications do not wish to share all of a process's resources. The rfork system call creates a new process entry that shares a selected set of resources from its parent. Typically the signal actions, statistics, and the stack and data parts of the address space are not shared. Unlike the lightweight thread created by kse_create, the rfork system call associates a PID with each thread that shows up in a ps listing and that can be manipulated in the same way as any other process in the system. Processes created by fork, vfork, or rfork have just a single thread structure associated with them.
The Process Structure
In addition to the references to the substructures, the process entry shown in Figure 4.1 contains the following categories of information:
-
Process identification: the PID and the parent PID
-
Signal state: signals pending delivery, signal mask, and summary of signal actions
-
Tracing: process tracing information
-
Timers: real-time timer and CPU-utilization counters
The process substructures shown in Figure 4.1 have the following categories of information:
-
Process-group identification: the process group and the session to which the process belongs
-
User credentials: the real, effective, and saved user and group identifiers
-
Memory management: the structure that describes the allocation of virtual address space used by the process; the VM space and its related structures are described more fully in Chapter 5
-
File descriptors: an array of pointers to file entries indexed by the process open file descriptors; also, the open file flags and current directory
-
System call vector: The mapping of system call numbers to actions; in addition to current and deprecated native FreeBSD executable formats, the kernel can run binaries compiled for several other UNIX variants such as Linux, OSF/1, and System V Release 4 by providing alternative system call vectors when such environments are requested
-
Resource accounting: the rlimit structures that describe the utilization of the many resources provided by the system (see Section 3.8)
-
Statistics: statistics collected while the process is running that are reported when it exits and are written to the accounting file; also, includes process timers and profiling information if the latter is being collected
-
Signal actions: the action to take when a signal is posted to a process
-
Thread structure: The contents of the thread structure (described at the end of this section)
The state element of the process structure holds the current value of the process state. The possible state values are shown in Table 4.1. When a process is first created with a fork system call, it is initially marked as NEW. The state is changed to NORMAL when enough resources are allocated to the process for the latter to begin execution. From that point onward, a process's state will fluctuate among NORMAL (where its thread(s) will be either RUNNABLEthat is, preparing to be or actually executing; SLEEPINGthat is, waiting for an event; or STOPPEDthat is, stopped by a signal or the parent process) until the process terminates. A deceased process is marked as ZOMBIE until it has freed its resources and communicated its termination status to its parent process.
Table 4.1. Process states.
State |
Description |
---|---|
NEW |
undergoing process creation |
NORMAL |
thread(s) will be RUNNABLE, SLEEPING, or STOPPED |
ZOMBIE |
undergoing process termination |
The system organizes process structures into two lists. Process entries are on the zombproc list if the process is in the ZOMBIE state; otherwise, they are on the allproc list. The two queues share the same linkage pointers in the process structure, since the lists are mutually exclusive. Segregating the dead processes from the live ones reduces the time spent both by the wait system call, which must scan the zombies for potential candidates to return, and by the scheduler and other functions that must scan all the potentially runnable processes.
Most threads, except the currently executing thread (or threads if the system is running on a multiprocessor), are also in one of two queues: a run queue or a sleep queue. Threads that are in a runnable state are placed on a run queue, whereas threads that are blocked awaiting an event are located on a sleep queue. Stopped threads awaiting an event are located on a sleep queue, or they are on neither type of queue. The run queues are organized according to thread-scheduling priority and are described in Section 4.4. The sleep queues are organized in a data structure that is hashed by event identifier. This organization optimizes finding the sleeping threads that need to be awakened when a wakeup occurs for an event. The sleep queues are described in Section 4.3.
The p_pptr pointer and related lists (p_children and p_sibling) are used in locating related processes, as shown in Figure 4.2 (on page 86). When a process spawns a child process, the child process is added to its parent's p_children list. The child process also keeps a backward link to its parent in its p_pptr pointer. If a process has more than one child process active at a time, the children are linked together through their p_sibling list entries. In Figure 4.2, process B is a direct descendant of process A, whereas processes C, D, and E are descendants of process B and are siblings of one another. Process B typically would be a shell that started a pipeline (see Sections 2.4 and 2.6) including processes C, D, and E. Process A probably would be the system-initialization process init (see Sections 3.1 and 14.6).
Figure 4.2 Process-group hierarchy.
CPU time is made available to threads according to their scheduling class and scheduling priority. As shown in Table 4.2, the FreeBSD kernel has two kernel and three user scheduling classes. The kernel will always run the thread in the highest-priority class. Any kernel-interrupt threads will run in preference to anything else followed by any top-half-kernel threads. Any runnable real-time threads are run in preference to runnable threads in the share and idle classes. Runnable time-share threads are run in preference to runnable threads in the idle class. The priorities of threads in the real-time and idle classes are set by the applications using the rtprio system call and are never adjusted by the kernel. The bottom-half interrupt priorities are set when the devices are configured and never change. The top-half priorities are set based on predefined priorities for each kernel subsystem and never change.
Table 4.2. Thread-scheduling classes.
Range |
Class |
Thread type |
---|---|---|
063 |
ITHD |
Bottom-half kernel (interrupt) |
64-127 |
KERN |
Top-half kernel |
128-159 |
REALTIME |
Real-time user |
160-223 |
TIMESHARE |
Time-sharing user |
224-255 |
IDLE |
Idle user |
The priorities of threads running in the time-share class are adjusted by the kernel based on resource usage and recent CPU utilization. A thread has two scheduling priorities: one for scheduling user-mode execution and one for scheduling kernel-mode execution. The kg_user_pri field associated with the thread structure contains the user-mode scheduling priority, whereas the td_priority field holds the current scheduling priority. The current priority may be different from the user-mode priority when the thread is executing in the top half of the kernel. Priorities range between 0 and 255, with a lower value interpreted as a higher priority (see Table 4.2). User-mode priorities range from 128 to 255; priorities less than 128 are used only when a thread is asleepthat is, awaiting an event in the kerneland immediately after such a thread is awakened. Threads in the kernel are given a higher priority because they typically hold shared kernel resources when they awaken. The system wants to run them as quickly as possible once they get a resource so that they can use the resource and return it before another thread requests it and gets blocked waiting for it.
When a thread goes to sleep in the kernel, it must specify whether it should be awakened and marked runnable if a signal is posted to it. In FreeBSD, a kernel thread will be awakened by a signal only if it sets the PCATCH flag when it sleeps. The msleep() interface also handles sleeps limited to a maximum time duration and the processing of restartable system calls. The msleep() interface includes a reference to a string describing the event that the thread awaits; this string is externally visiblefor example, in ps. The decision of whether to use an interruptible sleep depends on how long the thread may be blocked. Because it is complex to be prepared to handle signals in the midst of doing some other operation, many sleep requests are not interruptible; that is, a thread will not be scheduled to run until the event for which it is waiting occurs. For example, a thread waiting for disk I/O will sleep with signals blocked.
For quickly occurring events, delaying to handle a signal until after they complete is imperceptible. However, requests that may cause a thread to sleep for a long period, such as waiting for terminal or network input, must be prepared to have their sleep interrupted so that the posting of signals is not delayed indefinitely. Threads that sleep interruptibly may abort their system call because of a signal arriving before the event for which they are waiting has occurred. To avoid holding a kernel resource permanently, these threads must check why they have been awakened. If they were awakened because of a signal, they must release any resources that they hold. They must then return the error passed back to them by msleep(), which will be EINTR if the system call is to be aborted after the signal or ERESTART if it is to be restarted. Occasionally, an event that is supposed to occur quickly, such as a disk I/O, will get held up because of a hardware failure. Because the thread is sleeping in the kernel with signals blocked, it will be impervious to any attempts to send it a signal, even a signal that should cause it to exit unconditionally. The only solution to this problem is to change sleep()s on hardware events that may hang to be interruptible.
In the remainder of this book, we shall always use sleep() when referring to the routine that puts a thread to sleep, even when the msleep() interface is the one that is being used.
The Thread Structure
The thread structure shown in Figure 4.1 contains the following categories of information:
-
Scheduling: the thread priority, user-mode scheduling priority, recent CPU utilization, and amount of time spent sleeping
-
Thread state: the run state of a thread (runnable, sleeping); additional status flags; if the thread is sleeping, the wait channel, the identity of the event for which the thread is waiting (see Section 4.3), and a pointer to a string describing the event
-
Machine state: the machine-dependent thread information
-
TCB: the user- and kernel-mode execution states
-
Kernel stack: the per-thread execution stack for the kernel
Historically, the kernel stack was mapped to a fixed location in the virtual address space. The reason for using a fixed mapping is that when a parent forks, its runtime stack is copied for its child. If the kernel stack is mapped to a fixed address, the child's kernel stack is mapped to the same addresses as its parent kernel stack. Thus, all its internal references, such as frame pointers and stack-variable references, work as expected.
On modern architectures with virtual address caches, mapping the user structure to a fixed address is slow and inconvenient. FreeBSD 5.2 removes this constraint by eliminating all but the top call frame from the child's stack after copying it from its parent so that it returns directly to user mode, thus avoiding stack copying and relocation problems.
Every thread that might potentially run must have its stack resident in memory because one task of its stack is to handle page faults. If it were not resident, it would page fault when the thread tried to run, and there would be no kernel stack available to service the page fault. Since a system may have many thousands of threads, the kernel stacks must be kept small to avoid wasting too much physical memory. In FreeBSD 5.2 on the PC, the kernel stack is limited to two pages of memory. Implementors must be careful when writing code that executes in the kernel to avoid using large local variables and deeply nested subroutine calls, to avoid overflowing the run-time stack. As a safety precaution, some architectures leave an invalid page between the area for the run-time stack and the data structures that follow it. Thus, overflowing the kernel stack will cause a kernel-access fault instead of disastrously overwriting other data structures. It would be possible to simply kill the process that caused the fault and continue running. However, the FreeBSD kernel panics on a kernel-access fault because such a fault shows a fundamental design error in the kernel. By panicking and creating a crash dump, the error can usually be pinpointed and corrected.