Home > Articles > Operating Systems, Server

This chapter is from the book

4.4 Thread Scheduling

Traditionally, the FreeBSD scheduler had an ill-defined set of hooks spread through the kernel. In FreeBSD 5.0, these hooks were regularized and a well-defined API was created so that different schedulers could be developed. Since FreeBSD 5.0, the kernel has had two schedulers available:

  • The historic 4.4BSD scheduler found in the file /sys/kern/sched_4bsd.c. This scheduler is described at the beginning of this section.

  • The new ULE scheduler first introduced in FreeBSD 5.0 and found in the file /sys/kern/sched_ule.c [Roberson, 2003]. The name is not an acronym. If the underscore in its file name is removed, the rationale for its name becomes apparent. This scheduler is described at the end of this section.

Because a busy system makes thousands of scheduling decisions per second, the speed with which scheduling decisions are made is critical to the performance of the system as a whole. Other UNIX systems have added a dynamic scheduler switch that must be traversed for every scheduling decision. To avoid this overhead, FreeBSD requires that the scheduler be selected at the time the kernel is built. Thus, all calls into the scheduling code are resolved at compile time rather than going through the overhead of an indirect function call for every scheduling decision. By default, kernels up through FreeBSD 5.1 use the 4.4BSD scheduler. Beginning with FreeBSD 5.2, the ULE scheduler is used by default.

The 4.4BSD Scheduler

All threads that are runnable are assigned a scheduling priority that determines in which run queue they are placed. In selecting a new thread to run, the system scans the run queues from highest to lowest priority and chooses the first thread on the first nonempty queue. If multiple threads reside on a queue, the system runs them round robin; that is, it runs them in the order that they are found on the queue, with equal amounts of time allowed. If a thread blocks, it is not put back onto any run queue. If a thread uses up the time quantum (or time slice) allowed it, it is placed at the end of the queue from which it came, and the thread at the front of the queue is selected to run.

The shorter the time quantum, the better the interactive response. However, longer time quanta provide higher system throughput because the system will have less overhead from doing context switches and processor caches will be flushed less often. The time quantum used by FreeBSD is 0.1 second. This value was empirically found to be the longest quantum that could be used without loss of the desired response for interactive jobs such as editors. Perhaps surprisingly, the time quantum has remained unchanged over the past 20 years. Although the time quantum was originally selected on centralized timesharing systems with many users, it is still correct for decentralized workstations today. While workstation users expect a response time faster than that anticipated by the timesharing users of 20 years ago, the shorter run queues on the typical workstation makes a shorter quantum unnecessary.

Time-Share Thread Scheduling

The FreeBSD time-share-scheduling algorithm is based on multilevel feedback queues. The system adjusts the priority of a thread dynamically to reflect resource requirements (e.g., being blocked awaiting an event) and the amount of resources consumed by the thread (e.g., CPU time). Threads are moved between run queues based on changes in their scheduling priority (hence the word feedback in the name multilevel feedback queue). When a thread other than the currently running thread attains a higher priority (by having that priority either assigned or given when it is awakened), the system switches to that thread immediately if the current thread is in user mode. Otherwise, the system switches to the higher-priority thread as soon as the current thread exits the kernel. The system tailors this short-term scheduling algorithm to favor interactive jobs by raising the scheduling priority of threads that are blocked waiting for I/O for 1 or more seconds and by lowering the priority of threads that accumulate significant amounts of CPU time.

Short-term thread scheduling is broken up into two parts. The next section describes when and how a thread's scheduling priority is altered; the section after that describes the management of the run queues and the interaction between thread scheduling and context switching.

Calculations of Thread Priority

A thread's scheduling priority is determined directly by two values associated with the thread structure: kg_estcpu and kg_nice. The value of kg_estcpu provides an estimate of the recent CPU utilization of the thread. The value of kg_nice is a user-settable weighting factor that ranges numerically between -20 and 20. The normal value for kg_nice is 0. Negative values increase a thread's priority, whereas positive values decrease its priority.

A thread's user-mode scheduling priority is calculated after every four clock ticks (typically 40 milliseconds) that it has been found running by this equation:

Equation 4.1

04equ01.gif


Values less than PRI_MIN_TIMESHARE (160) are set to PRI_MIN_TIMESHARE (see Table 4.2); values greater than PRI_MAX_TIMESHARE (223) are set to PRI_MAX_TIMESHARE. This calculation causes the priority to decrease linearly based on recent CPU utilization. The user-controllable kg_nice parameter acts as a limited weighting factor. Negative values retard the effect of heavy CPU utilization by offsetting the additive term containing kg_estcpu. Otherwise, if we ignore the second term, kg_nice simply shifts the priority by a constant factor.

The CPU utilization, kg_estcpu, is incremented each time that the system clock ticks and the thread is found to be executing. In addition, kg_estcpu is adjusted once per second via a digital decay filter. The decay causes about 90 percent of the CPU usage accumulated in a 1-second interval to be forgotten over a period of time that is dependent on the system load average. To be exact, kg_estcpu is adjusted according to

Equation 4.2

04equ02.gif


where the load is a sampled average of the sum of the lengths of the run queue and of the short-term sleep queue over the previous 1-minute interval of system operation.

To understand the effect of the decay filter, we can consider the case where a single compute-bound thread monopolizes the CPU. The thread's CPU utilization will accumulate clock ticks at a rate dependent on the clock frequency. The load average will be effectively 1, resulting in a decay of

04equ03.gif


If we assume that the thread accumulates Ti clock ticks over time interval i and that kg_nice is zero, then the CPU utilization for each time interval will count into the current value of kg estcpu according to

04equ04.gif


Thus, after five decay calculations, only 13 percent of T0 remains present in the current CPU utilization value for the thread. Since the decay filter is applied once per second, we can also say that about 90 percent of the CPU utilization is forgotten after 5 seconds.

Threads that are runnable have their priority adjusted periodically as just described. However, the system ignores threads blocked awaiting an event: These threads cannot accumulate CPU usage, so an estimate of their filtered CPU usage can be calculated in one step. This optimization can significantly reduce a system's scheduling overhead when many blocked threads are present. The system recomputes a thread's priority when that thread is awakened and has been sleeping for longer than 1 second. The system maintains a value, kg_slptime, that is an estimate of the time a thread has spent blocked waiting for an event. The value of kg_slptime is set to 0 when a thread calls sleep() and is incremented once per second while the thread remains in a SLEEPING or STOPPED state. When the thread is awakened, the system computes the value of kg_estcpu according to

Equation 4.3

04equ05.gif


and then recalculates the scheduling priority using Eq. 4.1. This analysis ignores the influence of kg_nice; also, the load used is the current load average rather than the load average at the time that the thread blocked.

Thread-Priority Routines

The priority calculations used in the short-term scheduling algorithm are spread out in several areas of the system. Two routines, schedcpu() and roundrobin(), run periodically. Schedcpu() recomputes thread priorities once per second, using Eq. 4.2, and updates the value of kg_slptime for threads blocked by a call to sleep(). The roundrobin() routine runs 10 times per second and causes the system to reschedule the threads in the highest-priority (nonempty) queue in a round-robin fashion, which allows each thread a 100-millisecond time quantum.

The CPU usage estimates are updated in the system clock-processing module, hardclock(), which executes 100 times per second. Each time that a thread accumulates four ticks in its CPU usage estimate, kg_estcpu, the system recalculates the priority of the thread. This recalculation uses Eq. 4.1 and is done by the resetpriority() routine. The decision to recalculate after four ticks is related to the management of the run queues described in the next section. In addition to issuing the call from hardclock(), each time setrunnable() places a thread on a run queue, it also calls resetpriority() to recompute the thread's scheduling priority. This call from wakeup() to setrunnable() operates on a thread other than the currently running thread. So setrunnable() invokes updatepri() to recalculate the CPU usage estimate according to Eq. 4.3 before calling resetpriority(). The relationship of these functions is shown in Figure 4.5.

04fig05.gifFigure 4.5 Procedural interface to priority calculation.

Thread Run Queues and Context Switching

The kernel has a single set of run queues to manage all the thread scheduling classes shown in Table 4.2. The scheduling-priority calculations described in the previous section are used to order the set of timesharing threads into the priority ranges between 160 and 223. The real-time threads and the idle threads priorities are set by the applications themselves but are constrained by the kernel to be within the ranges 128 to 159 and 224 to 255, respectively. The number of queues used to hold the collection of all runnable threads in the system affects the cost of managing the queues. If only a single (ordered) queue is maintained, then selecting the next runnable thread becomes simple but other operations become expensive. Using 256 different queues can significantly increase the cost of identifying the next thread to run. The system uses 64 run queues, selecting a run queue for a thread by dividing the thread's priority by 4. To save time, the threads on each queue are not further sorted by their priorities.

The run queues contain all the runnable threads in main memory except the currently running thread. Figure 4.6 (on page 104) shows how each queue is organized as a doubly linked list of thread structures. The head of each run queue is kept in an array. Associated with this array is a bit vector, rq_status, that is used in identifying the nonempty run queues. Two routines, runq_add() and runq_remove (), are used to place a thread at the tail of a run queue, and to take a thread off the head of a run queue. The heart of the scheduling algorithm is the runq_choose() routine. The runq_choose() routine is responsible for selecting a new thread to run; it operates as follows:

  1. Ensure that our caller acquired the sched_lock.

  2. Locate a nonempty run queue by finding the location of the first nonzero bit in the rq_status bit vector. If rq_status is zero, there are no threads to run, so select the idle loop thread.

  3. Given a nonempty run queue, remove the first thread on the queue.

  4. If this run queue is now empty as a result of removing the thread, reset the appropriate bit in rq_status.

  5. Return the selected thread.

04fig06.gifFigure 4.6 Queueing structure for runnable threads.

The context-switch code is broken into two parts. The machine-independent code resides in mi_switch(); the machine-dependent part resides in cpu_switch(). On most architectures, cpu_switch() is coded in assembly language for efficiency.

Given the mi_switch() routine and the thread-priority calculations, the only missing piece in the scheduling facility is how the system forces an involuntary context switch. Remember that voluntary context switches occur when a thread calls the sleep() routine. Sleep() can be invoked by only a runnable thread, so sleep() needs only to place the thread on a sleep queue and to invoke mi_switch() to schedule the next thread to run. Often an interrupt thread will not want to sleep() itself but will be delivering data that will cause the kernel to want to run a different thread than the one that was running before the interrupt. Thus, the kernel needs a mechanism to request that an involuntary context switch be done at the conclusion of the interrupt.

This mechanism is handled by setting the currently running thread's TDF_NEEDRESCHED flag and then posting an asynchronous system trap (AST). An AST is a trap that is delivered to a thread the next time that that thread is preparing to return from an interrupt, a trap, or a system call. Some architectures support ASTs directly in hardware; other systems emulate ASTs by checking an AST flag at the end of every system call, trap, and interrupt. When the hardware AST trap occurs or the AST flag is set, the mi_switch() routine is called, instead of the current thread resuming execution. Rescheduling requests are made by the swi_sched(), resetpriority(), setrunnable(), wake up(), roundrobin(), and schedcpu() routines.

With the advent of multiprocessor support FreeBSD can preempt threads executing in kernel mode. However, such preemption is generally not done, so the worst-case real-time response to events is defined by the longest path through the top half of the kernel. Since the system guarantees no upper bounds on the duration of a system call, FreeBSD is decidedly not a real-time system. Attempts to retrofit BSD with real-time thread scheduling have addressed this problem in different ways [Ferrin & Langridge, 1980; Sanderson et al., 1986].

The ULE Scheduler

The ULE scheduler was developed as part of the overhaul of FreeBSD to support SMP. A new scheduler was undertaken for several reasons:

  • To address the need for processor affinity in SMP systems

  • To provide better support for symmetric multithreading (SMT)—processors with multiple, on chip, CPU cores

  • To improve the performance of the scheduling algorithm so that it is no longer dependent on the number of threads in the system

The goal of a multiprocessor system is to apply the power of multiple CPUs to a problem, or set of problems, to achieve a result in less time than it would run on a single-processor system. If a system has the same number of runnable threads as it does CPUs, then achieving this goal is easy. Each runnable thread gets a CPU to itself and runs to completion. Typically, there are many runnable threads competing for a few processors. One job of the scheduler is to ensure that the CPUs are always busy and are not wasting their cycles. When a thread completes its work, or is blocked waiting for resources, it is removed from the processor on which it was running. While a thread is running on a processor, it brings its working set—the instructions it is executing and the data on which it is operating—into the memory cache of the CPU. Migrating a thread has a cost. When a thread is moved from one processor to another, its in-cache working set is lost and must be removed from the processor on which it was running and then loaded into the new CPU to which it has been migrated. The performance of an SMP system with a naive scheduler that does not take this cost into account can fall beneath that of a single-processor system. The term processor affinity describes a scheduler that only migrates threads when necessary to give an idle processor something to do.

Many microprocessors now provide support for symmetric multithreading where the processor is built out of multiple CPU cores, each of which can execute a thread. The CPU cores in an SMT processor share all the processor's resources, such as memory caches and access to main memory, so they are more tightly synchronized than the processors in an SMP system. From a thread's perspective, it does not know that there are other threads running on the same processor because the processor is handling them independently. The one piece of code in the system that needs to be aware of the multiple cores is the scheduling algorithm. The SMT case is a slightly different version of the processor affinity problem presented by an SMP system. Each CPU core can be seen as a processor with its own set of threads. In an SMP system composed of CPUs that support SMT, the scheduler treats each core on a processor as a less powerful resource but one to which it is cheaper to migrate threads.

The original FreeBSD scheduler maintains a global list of threads that it traverses once per second to recalculate their priorities. The use of a single list for all threads means that the performance of the scheduler is dependent on the number of tasks in the system, and as the number of tasks grow, more CPU time must be spent in the scheduler maintaining the list. A design goal of the ULE scheduler was to avoid the need to consider all the runnable threads in the system to make a scheduling decision.

The ULE scheduler creates a set of three queues for each CPU in the system. Having per-processor queues makes it possible to implement processor affinity in an SMP system.

One queue is the idle queue, where all idle threads are stored. The other two queues are designated current and next. Threads are picked to run, in priority order, from the current queue until it is empty, at which point the current and next queues are swapped and scheduling is started again. Threads in the idle queue are run only when the other two queues are empty. Real-time and interrupt threads are always inserted into the current queue so that they will have the least possible scheduling latency. Interactive threads are also inserted into the current queue to keep the interactive response of the system acceptable. A thread is considered to be interactive if the ratio of its voluntary sleep time versus its run time is below a certain threshold. The interactivity threshold is defined in the ULE code and is not configurable. ULE uses two equations to compute the interactivity score of a thread. For threads whose sleep time exceeds their run time Eq 4.4 is used:

Equation 4.4

04equ06.gif


When a thread's run time exceeds its sleep time, Eq. 4.5 is used instead.

Equation 4.5

04equ07.gif


The scaling factor is the maximum interactivity score divided by two. Threads that score below the interactivity threshold are considered to be interactive; all others are noninteractive. The sched_interact_update() routine is called at several points in a threads existence—for example when the thread is awakened by a wakeup() call—to update the thread's run time and sleep time. The sleep and runtime values are only allowed to grow to a certain limit. When the sum of the run time and sleep time pass the limit, they are reduced to bring them back into range. An interactive thread whose sleep history was not remembered at all would not remain interactive, resulting in a poor user experience. Remembering an interactive thread's sleep time for too long would allow the thread to more than its fair share of the CPU. The amount of history that is kept and the interactivity threshold are the two values that most strongly influence a user's interactive experience on the system.

Noninteractive threads are put into the next queue and are scheduled to run when the queues are switched. Switching the queues guarantees that a thread gets to run at least once every two queue switches regardless of priority, which ensures fair sharing of the processor.

There are two mechanisms used to migrate threads among multiple processors. When a CPU has no work to do in any of its queues, it marks a bit in a bit-mask shared by all processors that says that it is idle. Whenever an active CPU is about to add work to its own run queue, it first checks to see if it has excess work and if another processor in the system is idle. If an idle processor is found, then the thread is migrated to the idle processor using an interprocessor interrupt (IPI). Making a migration decision by inspecting a shared bitmask is much faster than scanning the run queues of all the other processors. Seeking out idle processors when adding a new task works well because it spreads the load when it is presented to the system.

The second form of migration, called push migration, is done by the system on a periodic basis and more aggressively offloads work to other processors in the system. Twice per second the sched_balance() routine picks the most-loaded and least-loaded processors in the system and equalizes their run queues. The balancing is done only between two processors because it was thought that two processor systems would be the most common and to prevent the balancing algorithm from being too complex and adversely affecting the performance of the scheduler. Push migration ensures fairness among the runnable threads. For example, with three runnable threads on a two-processor system, it would be unfair for one thread to get a processor to itself while the other two had to share the second processor. By pushing a thread from the processor with two threads to the processor with one thread, no single thread would get to run alone indefinitely.

Handling the SMT case is a derivative form of load balancing among full-fledged CPUs and is handled by processor groups. Each CPU core in an SMT processor is given its own kseq structure, and these structures are grouped under a kseq group structure. An example of a single processor with two cores is shown in Figure 4.7 (on page 108). In an SMP system with multiple SMT capable processors there would be one processor group per CPU. When the scheduler is deciding to which processor or core to migrate a thread, it will try to pick a core on the same processor before picking one on another processor because that is the lowest-cost migration path.

04fig07.gifFigure 4.7 Processor with two cores.

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

Overview


Pearson Education, Inc., 221 River Street, Hoboken, New Jersey 07030, (Pearson) presents this site to provide information about products and services that can be purchased through this site.

This privacy notice provides an overview of our commitment to privacy and describes how we collect, protect, use and share personal information collected through this site. Please note that other Pearson websites and online products and services have their own separate privacy policies.

Collection and Use of Information


To conduct business and deliver products and services, Pearson collects and uses personal information in several ways in connection with this site, including:

Questions and Inquiries

For inquiries and questions, we collect the inquiry or question, together with name, contact details (email address, phone number and mailing address) and any other additional information voluntarily submitted to us through a Contact Us form or an email. We use this information to address the inquiry and respond to the question.

Online Store

For orders and purchases placed through our online store on this site, we collect order details, name, institution name and address (if applicable), email address, phone number, shipping and billing addresses, credit/debit card information, shipping options and any instructions. We use this information to complete transactions, fulfill orders, communicate with individuals placing orders or visiting the online store, and for related purposes.

Surveys

Pearson may offer opportunities to provide feedback or participate in surveys, including surveys evaluating Pearson products, services or sites. Participation is voluntary. Pearson collects information requested in the survey questions and uses the information to evaluate, support, maintain and improve products, services or sites, develop new products and services, conduct educational research and for other purposes specified in the survey.

Contests and Drawings

Occasionally, we may sponsor a contest or drawing. Participation is optional. Pearson collects name, contact information and other information specified on the entry form for the contest or drawing to conduct the contest or drawing. Pearson may collect additional personal information from the winners of a contest or drawing in order to award the prize and for tax reporting purposes, as required by law.

Newsletters

If you have elected to receive email newsletters or promotional mailings and special offers but want to unsubscribe, simply email information@informit.com.

Service Announcements

On rare occasions it is necessary to send out a strictly service related announcement. For instance, if our service is temporarily suspended for maintenance we might send users an email. Generally, users may not opt-out of these communications, though they can deactivate their account information. However, these communications are not promotional in nature.

Customer Service

We communicate with users on a regular basis to provide requested services and in regard to issues relating to their account we reply via email or phone in accordance with the users' wishes when a user submits their information through our Contact Us form.

Other Collection and Use of Information


Application and System Logs

Pearson automatically collects log data to help ensure the delivery, availability and security of this site. Log data may include technical information about how a user or visitor connected to this site, such as browser type, type of computer/device, operating system, internet service provider and IP address. We use this information for support purposes and to monitor the health of the site, identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents and appropriately scale computing resources.

Web Analytics

Pearson may use third party web trend analytical services, including Google Analytics, to collect visitor information, such as IP addresses, browser types, referring pages, pages visited and time spent on a particular site. While these analytical services collect and report information on an anonymous basis, they may use cookies to gather web trend information. The information gathered may enable Pearson (but not the third party web trend services) to link information with application and system log data. Pearson uses this information for system administration and to identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents, appropriately scale computing resources and otherwise support and deliver this site and its services.

Cookies and Related Technologies

This site uses cookies and similar technologies to personalize content, measure traffic patterns, control security, track use and access of information on this site, and provide interest-based messages and advertising. Users can manage and block the use of cookies through their browser. Disabling or blocking certain cookies may limit the functionality of this site.

Do Not Track

This site currently does not respond to Do Not Track signals.

Security


Pearson uses appropriate physical, administrative and technical security measures to protect personal information from unauthorized access, use and disclosure.

Children


This site is not directed to children under the age of 13.

Marketing


Pearson may send or direct marketing communications to users, provided that

  • Pearson will not use personal information collected or processed as a K-12 school service provider for the purpose of directed or targeted advertising.
  • Such marketing is consistent with applicable law and Pearson's legal obligations.
  • Pearson will not knowingly direct or send marketing communications to an individual who has expressed a preference not to receive marketing.
  • Where required by applicable law, express or implied consent to marketing exists and has not been withdrawn.

Pearson may provide personal information to a third party service provider on a restricted basis to provide marketing solely on behalf of Pearson or an affiliate or customer for whom Pearson is a service provider. Marketing preferences may be changed at any time.

Correcting/Updating Personal Information


If a user's personally identifiable information changes (such as your postal address or email address), we provide a way to correct or update that user's personal data provided to us. This can be done on the Account page. If a user no longer desires our service and desires to delete his or her account, please contact us at customer-service@informit.com and we will process the deletion of a user's account.

Choice/Opt-out


Users can always make an informed choice as to whether they should proceed with certain services offered by InformIT. If you choose to remove yourself from our mailing list(s) simply visit the following page and uncheck any communication you no longer want to receive: www.informit.com/u.aspx.

Sale of Personal Information


Pearson does not rent or sell personal information in exchange for any payment of money.

While Pearson does not sell personal information, as defined in Nevada law, Nevada residents may email a request for no sale of their personal information to NevadaDesignatedRequest@pearson.com.

Supplemental Privacy Statement for California Residents


California residents should read our Supplemental privacy statement for California residents in conjunction with this Privacy Notice. The Supplemental privacy statement for California residents explains Pearson's commitment to comply with California law and applies to personal information of California residents collected in connection with this site and the Services.

Sharing and Disclosure


Pearson may disclose personal information, as follows:

  • As required by law.
  • With the consent of the individual (or their parent, if the individual is a minor)
  • In response to a subpoena, court order or legal process, to the extent permitted or required by law
  • To protect the security and safety of individuals, data, assets and systems, consistent with applicable law
  • In connection the sale, joint venture or other transfer of some or all of its company or assets, subject to the provisions of this Privacy Notice
  • To investigate or address actual or suspected fraud or other illegal activities
  • To exercise its legal rights, including enforcement of the Terms of Use for this site or another contract
  • To affiliated Pearson companies and other companies and organizations who perform work for Pearson and are obligated to protect the privacy of personal information consistent with this Privacy Notice
  • To a school, organization, company or government agency, where Pearson collects or processes the personal information in a school setting or on behalf of such organization, company or government agency.

Links


This web site contains links to other sites. Please be aware that we are not responsible for the privacy practices of such other sites. We encourage our users to be aware when they leave our site and to read the privacy statements of each and every web site that collects Personal Information. This privacy statement applies solely to information collected by this web site.

Requests and Contact


Please contact us about this Privacy Notice or if you have any requests or questions relating to the privacy of your personal information.

Changes to this Privacy Notice


We may revise this Privacy Notice through an updated posting. We will identify the effective date of the revision in the posting. Often, updates are made to provide greater clarity or to comply with changes in regulatory requirements. If the updates involve material changes to the collection, protection, use or disclosure of Personal Information, Pearson will provide notice of the change through a conspicuous notice on this site or other appropriate way. Continued use of the site after the effective date of a posted revision evidences acceptance. Please contact us if you have questions or concerns about the Privacy Notice or any objection to any revisions.

Last Update: November 17, 2020