This book describes the design and implementation of the BSD operating system--previously known as the Berkeley version of UNIX. Today, BSD is found in nearly every variant of UNIX, and is widely used for Internet services and firewalls, timesharing, and multiprocessing systems. Readers involved in technical and sales support can learn the capabilities and limitations of the system; applications developers can learn effectively and efficiently how to interface to the system; systems programmers can learn how to maintain, tune, and extend the system. Written from the unique perspective of the system's architects, this book delivers the most comprehensive, up-to-date, and authoritative technical information on the internal structure of the latest BSD system.
As in the previous book on 4.3BSD (with Samuel Leffler), the authors first update the history and goals of the BSD system. Next they provide a coherent overview of its design and implementation. Then, while explaining key design decisions, they detail the concepts, data structures, and algorithms used in implementing the system's facilities. As an in-depth study of a contemporary, portable operating system, or as a practical reference, readers will appreciate the wealth of insight and guidance contained in this book.Highlights of the book:
At press time, the source code for the 4.4BSD-Lite Release 2 system, as well as that for the FreeBSD version of 4.4BSD, which is compiled and ready to run on PC-compatible hardware, are available from Walnut Creek CDROM. Contact Walnut Creek for more information at 1-800-786-9907, or use email@example.com, or http://www.cdrom.com/.
The NetBSD distribution is compiled and ready to run on most workstation architectures. For more information, contact the NetBSD Project at majordomo@NetBSD.ORG (send a message body of "lists"), or http://www.NetBSD.ORG/. A fully supported commercial release, BSD/OS, is available from Berkeley Software Design, Inc., at 1-800-800-4273, firstname.lastname@example.org, or http://www.bsdi.com/.
The OpenBSD distribution is compiled and ready to run on a wide variety of workstation architectures, has been extensively vetted for security and reliability, and comes with significant security software already installed. You can order it on CD-ROM (directly bootable on four popular architectures) or download it via the Internet. For more information, visit the OpenBSD project's Web Site at http://www.OpenBSD.org/.
1. OVERVIEW.1. History and Goals.
History of the UNIX System.
AT&T UNIX System III and System V.
Berkeley Software Distributions.
UNIX in the World.
BSD and Other Systems.
The Influence of the User Community.
Design Goals of 4BSD.
4.2BSD Design Goals.
4.3BSD Design Goals.
4.4BSD Design Goals.
References.2. Design Overview of 4.4BSD.
4.4BSD Facilities and the Kernel.
Process Groups and Sessions.
BSD Memory-Management Design Decisions.
Memory Management Inside the Kernel.
Descriptors and I/O.
Multiple Filesystem Support.
Entry to the Kernel.
Return from the Kernel.
Returning from a System Call.
Traps and Interrupts.
I/O Device Interrupts.
Statistics and Process Scheduling.
Adjustment of the Time.
User, Group, and Other Identifiers.
Process Groups and Sessions.
II. PROCESSES.4. Process Management.
Introduction to Process Management.
The Process Structure.
The User Structure.
Low-Level Context Switching.
Voluntary Context Switching.
Calculations of Process Priority.
Process Run Queues and Context Switching.
Comparison with POSIX Signals.
Posting of a Signal.
Delivering a Signal.
Process Groups and Sessions.
References.5. Memory Management.
Processes and Memory .
Advantages of Virtual Memory.
Hardware Requirements for Virtual Memory.
Overview of the 4.4BSD Virtual-Memory System.
Kernel Memory Management.
Kernel Maps and Submaps.
Kernel Address-Space Allocation.
4.4BSD Process Virtual-Address Space.
Mapping to Objects.
Objects to Pages.
Collapsing of Shadow Chains.
5.6 Creation of a New Process.
Reserving Kernel Resources.
Duplication of the User Address Space.
Creation of a New Process Without Copying.
Execution of a File.
Process Manipulation of Its Address Space.
Change of Process Size.
Change of Protection.
Termination of a Process.
The Pager Interface.
The Pageout Daemon.
The Swap-In Process.
The Role of the pmap Module.
Initialization and Startup.
Mapping Allocation and Deallocation.
Change of Access and Wiring Attributes for Mappings.
Management of Page-Usage Information.
Initialization of Physical Pages.
Management of Internal Data Structures.
III. I/O System.6. I/O System Overview.
I/O Mapping from User to Device.
Entry Points for Block-Device Drivers.
Sorting of Disk I/O Requests.
Raw Devices and Physical I/O.
Entry Points for Character-Device Drivers.
Descriptor Management and Services.
Open File Entries.
Management of Descriptors.
Multiplexing I/O on Descriptors.
Implementation of Select.
Movement of Data Inside the Kernel.
The Virtual-Filesystem Interface.
Contents of a Vnode.
Exported Filesystem Services.
The Name Cache.
Implementation of Buffer Management.
Simple Filesystem Layers.
The Union Mount Filesystem.
References.7. Local Filesystems.
Hierarchical Filesystem Management.
Structure of an Inode.
Finding of Names in Directories.
Other Filesystem Semantics.
Large File Sizes.
References.8. Local Filestores.
Overview of the Filestore.
The Berkeley Fast Filesystem.
Organization of the Berkeley Fast Filesystem.
Optimization of Storage Utilization.
Reading and Writing to a File.
The Log-Structured Filesystem.
Organization of the Log-Structured Filesystem.
Reading of the Log.
Writing to the Log.
The Buffer Cache.
Creation of a File.
Reading and Writing to a File.
The Memory-Based Filesystem.
Organization of the Memory-Based Filesystem.
References.9. The Network Filesystem.
History and Overview.
NFS Structure and Operation.
The NFS Protocol.
The 4.4BSD NFS Implementation.
RPC Transport Issues.
Techniques for Improving Performance.
References.10. Terminal Handling.
The tty Structure.
Process Groups, Sessions, and Terminal Control.
RS-232 and Modem Control.
Output Line Discipline.
Output Top Half.
Output Bottom Half.
Input Bottom Half.
Input Top Half.
The stop Routine.
The ioctl Routine.
Closing of Terminal Devices.
Other Line Disciplines.
Serial Line IP Discipline.
Graphics Tablet Discipline.
IV. INTERPROCESS COMMUNICATION.11. Interprocess Communication.
Use of Sockets.
Implementation Structure and Overview.
Mbuf Utility Routines.
Passing Access Rights.
Passing Access Rights in the Local Domain.
References.12. Network Communication.
Protocol User-Request Routine.
Protocol Control-Output Routine.
Interface between Protocol and Network Interface.
Kernel Routing Tables.
User-Level Routing Policies.
User-Level Routing Interface: Routing Socket.
Buffering and Congestion Control.
Protocol Buffering Policies.
Additional Network-Subsystem Topics.
Address Resolution Protocol.
References.13. Network Protocols.
Internet Network Protocols.
Internet Ports and Associations.
Protocol Control Blocks.
User Datagram Protocol (UDP).
Internet Protocol (IP).
Transmission Control Protocol (TCP).
TCP Connection States.
Estimation of Round-Trip Time.
TCP Input Processing.
TCP Output Processing.
Sending of Data.
Avoidance of the Silly-Window Syndrome.
Avoidance of Small Packets.
Delayed Acknowledgments and Window Updates.
Buffer and Window Sizing.
Avoidance of Congestion with Slow Start.
Internet Control Message Protocol (ICMP).
OSI Implementation Issues.
Summary of Networking and Interprocess Communication.
Creation of a Communication Channel.
Sending and Receiving of Data.
Termination of Data Transmission or Reception.
V. SYSTEM OPERATION.14. System Startup.
The boot Program.
System Data Structures.
New Autoconfiguration Data Structures.
New Autoconfiguration Functions.
System Shutdown and Autoreboot.
Passage of Information To and From the Kernel.
This book is an extensive revision of the first authoritative and full-length description of the design and implementation of the research versions of the UNIX system developed at the University of California at Berkeley. Most detail is given about 4.4BSD, which incorporates the improvements of the previous Berkeley versions. Although 4.4BSD includes nearly 500 utility programs in addition to the kernel, this book concentrates almost exclusively on the kernel.
The UNIX System
The UNIX system runs on computers ranging from personal home systems to the largest supercomputers. It is the operating system of choice for most multiprocessor, graphics, and vector-processing systems, and is widely used for its original purpose of timesharing. It is the most common platform for providing network services (from FTP to WWW) on the Internet. It is the most portable operating system ever developed. This portability is due partly to its implementation language, C Kernighan & Ritchie, 1978 (which is itself one of the most widely ported languages), and partly to the elegant design of the system. Many of the system's features are imitated in other systems O'Dell, 1987.
Since its inception in 1969 Ritchie & Thompson, 1978, the UNIX system has developed in a number of divergent and rejoining streams. The original developers continued to advance the state of the art with their Ninth and Tenth Edition UNIX inside AT&T Bell Laboratories, and then their Plan 9 successor to UNIX. Meanwhile, AT&T licensed UNIX System V as a product, before selling it to Novell. Novell passed the UNIX trademark to X/OPEN and sold the source code and distribution rights to Santa Cruz Operation (SCO). Both System V and Ninth Edition UNIX were strongly influenced by the Berkeley Software Distributions produced by the Computer Systems Research Group (CSRG) of the University of California at Berkeley.
Berkeley Software Distributions
These Berkeley systems have introduced several useful programs and facilities to the UNIX community:
4.2BSD, 4.3BSD, and 4.4BSD are the bases for the UNIX systems of many vendors, and are used internally by the development groups of many other vendors. Many of these developments have also been incorporated by System V, or hav e been added by vendors whose products are otherwise based on System V.
The implementation of the TCP/IP networking protocol suite in 4.2BSD and 4.3BSD, and the availability of those systems, explain why the TCP/IP networking protocol suite is implemented so widely throughout the world. Numerous vendors have adapted the Berkeley networking implementations, whether their base system is 4.2BSD, 4.3BSD, 4.4BSD, System V, or even Digital Equipment Corporation's VMS or Microsoft's Winsock interface in Windows '95 and Windows/NT.
4BSD has also been a strong influence on the POSIX (IEEE Std 1003.1) operating-system interface standard, and on related standards. Several features--such as reliable signals, job control, multiple access groups per process, and the routines for directory operations--have been adapted from 4.3BSD for POSIX.
Material Covered in this Book
This book is about the internal structure of 4.4BSD Quarterman et al, 1985, and about the concepts, data structures, and algorithms used in implementing 4.4BSD's system facilities. Its level of detail is similar to that of Bach's book about UNIX System V Bach, 1986; however, this text focuses on the facilities, data structures, and algorithms used in the Berkeley variant of the UNIX operating system. The book covers 4.4BSD from the system-call level down--from the interface to the kernel to the hardware itself. The kernel includes system facilities, such as process management, virtual memory, the I/O system, filesystems, the socket IPC mechanism, and network protocol implementations. Material above the system-call level--such as libraries, shells, commands, programming languages, and other user interfaces--is excluded, except for some material related to the terminal interface and to system startup. Like Organick's book about Multics Organick, 1975, this book is an in-depth study of a contemporary operating system.
Where particular hardware is relevant, the book refers to the Hewlett-Packard HP300 (Motorola 68000-based) architecture. Because 4.4BSD was developed on the HP300, that is the architecture with the most complete support, so it provides a convenient point of reference.
Readers who will benefit from this book include operating-system implementors, system programmers, UNIX application developers, administrators, and curious users. The book can be read as a companion to the source code of the system, falling as it does between the manual CSRG, 1994 and the code in detail of treatment. But this book is specifically neither a UNIX programming manual nor a user tutorial (for a tutorial, see Libes & Ressler, 1988). Familiarity with the use of some version of the UNIX system (see, for example, Kernighan & Pike, 1984), and with the C programming language (see, for example, Kernighan & Ritchie, 1988) would be extremely useful.
Use in Courses on Operating Systems
This book is suitable for use as a reference text to provide background for a primary textbook in a second-level course on operating systems. It is not intended for use as an introductory operating-system textbook; the reader should have already encountered terminology such as memory management, process scheduling, and I/O systems Silberschatz & Galvin, 1994. Familiarity with the concepts of network protocols Tanenbaum, 1988; Stallings, 1993; Schwartz, 1987 will be useful for understanding some of the later chapters.
Exercises are provided at the end of each chapter. The exercises are graded into three categories indicated by zero, one, or two asterisks. The answers to exercises that carry no asterisks can be found in the text. Exercises with a single asterisk require a step of reasoning or intuition beyond a concept presented in the text. Exercises with two asterisks present major design projects or open research questions.
This text discusses both philosophical and design issues, as well as details of the actual implementation. Often, the discussion starts at the system-call level and descends into the kernel. Tables and figures are used to clarify data structures and control flow. Pseudocode similar to the C language is used to display algorithms. Boldface font identifies program names and filesystem pathnames. Italics font introduces terms that appear in the glossary and identifies the names of system calls, variables, routines, and structure names. Routine names (other than system calls) are further identified by the name followed by a pair of parenthesis (e.g., malloc() is the name of a routine, whereas argv is the name of a variable).
The book is divided into five parts, organized as follows:
Part 1, Overview
Three introductory chapters provide the context for the complete operating system and for the rest of the book. Chapter 1, History and Goals, sketches the historical development of the system, emphasizing the system's research orientation. Chapter 2, Design Overview of 4.4BSD, describes the services offered by the system, and outlines the internal organization of the kernel. It also discusses the design decisions that were made as the system was developed. Sections 2.3 through 2.14 in Chapter 2 give an overview of their corresponding chapter. Chapter 3, Kernel Services, explains how system calls are done, and describes in detail several of the basic services of the kernel.
Part 2, Processes
The first chapter in this part--Chapter 4, Process Management--lays the foundation for later chapters by describing the structure of a process, the algorithms used for scheduling the execution of processes, and the synchronization mechanisms used by the system to ensure consistent access to kernel-resident data structures. In Chapter 5, Memory Management, the virtual-memory!=management system is discussed in detail.
Part 3, I/O System
First, Chapter 6, I/O System Overview, explains the system interface to I/O and describes the structure of the facilities that support this interface. Following this introduction are four chapters that give the details of the main parts of the I/O system. Chapter 7, Local Filesystems, details the data structures and algorithms that implement filesystems as seen by application programs. Chapter 8, Local Filestores, describes how local filesystems are interfaced with local media. Chapter 9, The Network Filesystem, explains the network filesystem from both the server and client perspectives. Chapter 10, Terminal Handling, discusses support for character terminals, and provides a description of a character-oriented device driver.
Part 4, Interprocess Communication
Chapter 11, Interprocess Communication, describes the mechanism for providing communication between related or unrelated processes. Chapters 12 and 13, Network Communication and Network Protocols, are closely related, as the facilities explained in the former are implemented by specific protocols, such as the TCP/IP protocol suite, explained in the latter.
Part 5, System Operation
Chapter 14, System Startup, discusses system startup, shutdown, and configuration, and explains system initialization at the process level, from kernel initialization to user login.
The book is intended to be read in the order that the chapters are presented, but the parts other than Part 1 are independent of one another and can be read separately. Chapter 14 should be read after all the others, but knowledgeable readers may find it useful independently.
At the end of the book are a Glossary with brief definitions of major terms and an Index. Each chapter contains a Reference section with citations of related material.
Current information about the availability of 4.4BSD source code can be found at Addison-Wesley's web site. See the catalog listing for this book. At press time, the source code for the 4.4BSD-Lite Release 2 system, as well as that for the FreeBSD version of 4.4BSD, which is compiled and ready to run on PC-compatible hardware, are available from Walnut Creek CDROM. Contact Walnut Creek for more information at 1-800-786-9907, or use email@example.com, or http://www.cdrom.com/. The NetBSD distribution is compiled and ready to run on most workstation architectures. For more information, contact the NetBSD Project at majordomo@NetBSD.ORG (send a message body of "lists"), or http://www.NetBSD.ORG/. A fully supported commercial release, BSD/OS, is available from Berkeley Software Design, Inc., at 1-800-800-4273, firstname.lastname@example.org, or http://www.bsdi.com/. The 4.4BSD manuals are jointly published by Usenix and O'Reilly. O'Reilly sells the five volumes individually or in a set (ISBN 1-56592-082-1): 1-800-889-8969, email@example.com, or http://www.ora.com.
For you diehards who actually read to the end of the preface, your reward is finding out that you can get T-shirts that are a reproduction of the the original artwork drawn by John Lasseter for the cover of this book (yes, he is the John Lasseter of Walt Disney/Pixar fame who masterminded the production of "Toy Story"). These shirts were made available to the people who helped with the creation, reviewing, and editing of the book and to those folks who first reported errors in the book. A variation on these shirts that is clearly different from the originals (so as not to diminish the rarity of the ones that people had to work to get) is now available. For further information on purchasing a shirt, send a self-addressed envelope (United States residents please include return postage) toM. K. McKusick
Alternatively, you can send mail to mckusick@McKusick.COM with subject line "T-shirt Information Request" or visit the "History of BSD T-shirts" web page at http://www.zilker.net/users/beastie/index.html.
We extend special thanks to Mike Hibler (University of Utah) who coauthored Chapter 5 on memory management, and to Rick Macklem (University of Guelph), whose NFS papers provided much of the material on NFS for Chapter 9.
We also thank the following people who read and commented on nearly the entire book: Paul Abrahams (Consultant), Susan LoVerso (Orca Systems), George Neville-Neil (Wind River Systems), and Steve Stepanek (California State University, Northbridge).
We thank the following people, all of whom read and commented on early drafts of the book: Eric Allman (Pangaea Reference Systems), Eric Anderson (University of California at Berkeley), Mark Andrews (Alias Research), Mike Beede (Secure Computing Corporation), Paul Borman (Berkeley Software Design), Peter Collinson (Hillside Systems), Ben Cottrell (NetBSD user), Patrick Cua (De La Salle University, Philippines), John Dyson (The FreeBSD Project), Sean Eric Fagan (BSD developer), Mike Fester (Medieus Systems Corporation), David Greenman (The FreeBSD Project), Wayne Hathaway (Auspex Systems), John Heidemann (University of California at Los Angeles), Jeff Honig (Berkeley Software Design), Gordon Irlam (Cygnus Support), Alan Langerman (Orca Systems), Sam Leffler (Silicon Graphics), Casimir Lesiak (NASA/Ames Research Center), Gavin Lim (De La Salle University, Philippines), Steve Lucco (Carnegie Mellon University), Jan-Simon Pendry (Sequent, UK), Arnold Robbins (Georgia Institute of Technology), Peter Salus (UNIX historian), Wayne Sawdon (Carnegie Mellon University), Margo Seltzer (Harvard University), Keith Sklower (University of California at Berkeley), Keith Smith (Harvard University), and Humprey C. Sy (De La Salle University, Philippines).
This book was produced using James Clark's implementations of pic, tbl, eqn, and groff. The index was generated by awk scripts derived from indexing programs written by Jon Bentley and Brian Kernighan Bentley & Kernighan, 1986. Most of the art was created with xfig. Figure placement and widow elimination were handled by the groff macros, but orphan elimination and production of even page bottoms had to be done by hand.
We encourage readers to send us suggested improvements or comments about typographical or other errors found in the book; please send electronic mail to bsdbook-bugs@McKusick.COM.
Errata for ''The Design and Implementation of the 4.4BSD Operating Syatem''
Changes made since the first printing, May 1996: Page Chapter/Section Change xi c00/6.t ISDN => ISBN xvi c00/8.t principle architect => principal architect 293 c08/3.t could not written. => could not be written. 402 c12/sockaddrdl.xfig adl_data => sld_data 438 c13/arpanetaddr.fig host and IMP fields reversed
Changes made since the second printing, August 1996:
No changes were made in this printing.
Changes made since the third printing, October 1997:
Page Chapter/Section Change xi c00/7.t Northbridge => Northridge xv c00/8.t update winecellar hostname 3 c01/1.t concentrate the BSD => concentrate on the BSD 19 c01/refs Ritchie 1987 and Ritchie 1984b are reversed 23 c02/2.t specific architecture => specific architectures 59 c03/4.t raises and lowers => raises or lowers 80 c04/procstate.xfig protocol control block => process control block 89 c04/3.t parents => parent's 94 c04/4.t do => of 102 c04/7.t SIGCONT => SIGCONT signal 130 c05/3.t request => block 164-165 c05/fault.code extraneous "first_page = NULL" in [H] 178 c05/D.t directory pmap.c => directory in the file pmap.c 213 c06/4.t show that the => show the 228 c06/6.t process => processes 229 c06/6.t function => functions 242 c07/1.t Stateless filesystem => Stateless filesystems 242 c07/1.t lookup, is started => lookup is started, 247 c07/fstree.fig ls => groff; 8 => 10 to correspond to Fig 7.6 262 c07/6.t allow a 64-bit files => allow 64-bit files 271 c08/2.t than is the => than in the 271 c08/2.t old filesystem had 1024-byte blocks, 8x transfer 300 c08/3.t likely be => likely to be 320 c09/2.t can improved => can improve 323-324 c09/2.t some other vendors support TCP in NFS V2 345 c10/6.t interchange getc and putc return value 352 c10/8.t terminal member of session and process group 363-364 c11/1.t pipes have 4 of 6 properties 383 c11/6.t and of => and for 407 c12/2.t source and destination reversed in PRU_CONNECT 415 c12/4.t IF_ENQUEUE(&ipintr, => IF_ENQUEUE(&ipintrq, 418-419 c12/rtentry.tbl add rt_use entry 422 c12/5.t search for 22.214.171.124 starting from this point 422 c12/5.t later in the search => earlier in the search 430 c12/8.t the number of the data => the number of data 430 c12/8.t addition combinations => additional combinations 432 c12/8.t from Ethernet device => from the Ethernet device 441 c13/1.t but allows the => but allow the 441 c13/1.t on networks. => on other networks. 445 c13/2.t when Ip forwards => when IP forwards 451-452 c13/4.t at about 250 Kbyte per second => at about 250KHz 460 c13/5.t add "or by receiving an ACK for a FIN" 460 c13/5.t add "another ACK should be sent" 460 c13/5.t missing => resent 462 c13/5.t local, => local and; delete spurious "and" 468 c13/7.t data returns => data return 471 c13/7.t provide new => provide a new 471 c13/7.t the than receiver's => than the receiver's 473 c13/7.t dropped packet => retransmitted packet 477 c13/7.t that it => that the sender 480 c13/9.t section Section => Section 485 c13/e.t the passive server tried => the server tried 486 c13/e.t xref to exercise 12.25 => 13.25