Operating Systems 101: Virtualisation — CPU Scheduling(iii)

E.Y.

6 min readJun 16, 2021

This is a course note of an online course I took relating to the Operating System on educative.io.

Chapter 3. CPU Scheduling

A historical review of different CPU scheduling strategies and their pros and cons.

Two important assessing metrics :

turnaround time: time from when the job arrives to the time it completes
response time: time from when the job arrives in a system to the first time it is scheduled

Time-based rules

FIFO: First In, First Out : Running with FIFO can run into convoy effect, where a number of relatively-short potential consumers of a resource get queued behind a heavyweight resource consumer.

SJF: Shortest Job First : it runs the shortest job first, then the next shortest, and so on.

Shortest Time-to-Completion First (STCF): Any time a new job enters the system, the STCF scheduler determines which of the remaining jobs has the least time left, and schedules that one.

Round-Robin (RR): Instead of running jobs to completion, RR runs a job for a time slice (a scheduling quantum) and then switches to the next job in the run queue. It rotates until all jobs are finished. Note that the length of a time slice must be a multiple of the timer-interrupt period; thus if the timer interrupts every 10 milliseconds, the time slice could be 10, or any other multiple of 10 ms.

Priority-based rules

T MLFQ has a number of distinct queues, each assigned a different priority level. MLFQ uses priorities to decide which job should run at a given time in each job queue.If more than one job in a same queue and have the same priority, then round-robin rule is applied among those jobs.

Rule A: If Priority(A) > Priority(B), AA runs (BB doesn’t).
Rule B: If Priority(A) = Priority(B), AA & BB run in RR.

Note that the queue is dynamic, meaning MLFQ change the priority of a job based on its observed behaviour. So if a job involves frequent I/O input, its priority becomes higher. If a job occupies CPU for long time, its priority becomes lower.

The priority of a job is decided as such:

When a job enters the system, it is placed at the highest priority.
If a job uses up an entire time slice while running, its priority is reduced.
If a job gives up the CPU before the time slice is up, it stays at the same priority level.

Problems with MLFQ

Starvation: if there are “too many” interactive jobs in the system, they will combine to consume all CPU time, and thus long-running jobs will never receive any CPU time (they starve).
Game the scheduler. E.g. some deceptive job might issue a random I/O operation before the time slice finishes and thus relinquish the CPU, to remain in the same queue, and thus gain a higher percentage of CPU time.

Preventing starvation

To prevent CPU-bound task starving, we can periodically boost the priority of all the jobs by moving all the jobs to the topmost queue.

As a result, a CPU-bound job will share the CPU with other high-priority jobs in a round-robin fashion, and thus eventually receive service. Also if a CPU-bound job has become interactive, the scheduler treats it properly once it has received the priority boost.

Preventing gaming of scheduler

The solution here is to keep track of the time slice any process uses at a give period and once a process has used its quota, it is downgrade to the next priority queue. Whether it uses the time slice in one long burst or many small ones does not matter.

Fairness-based rules

How can we design a scheduler to share the CPU in a proportional manner?

Lottery scheduling:

We can use one basic concept: tickets to represent the share of a resource that a process should receive. Lottery scheduling achieves this probabilistically by holding a lottery every so often. The winning ticket determines which process runs.

Ticket currency

Currency allows a user with a set of tickets to allocate tickets among their own jobs in whatever currency they would like; the system then automatically converts said currency into the correct global value.

User A 100 (global currency)
   -> 500 (A’s currency) to A1 ->  50 (global currency)
   -> 500 (A’s currency) to A2 ->  50 (global currency)
 User B (global currency)
   ->  10 (B’s currency) to B1 -> 100 (global currency)

Ticket transfer

With transfers, a process can temporarily hand off its tickets to another process. This ability is especially useful in a client/server setting, where a client process sends a message to a server asking it to do some work on the client’s behalf.

Ticket inflation

With inflation, a process can temporarily raise or lower the number of tickets it owns. In an environment where a group of processes trusts one another; if a process needs more CPU time, it can boost its ticket value as a way to reflect that need to the system.

Implementation

All you need is a random number generator to pick the winning ticket, a data structure to track the processes of the system (e.g., a list), and the total number of tickets.

The Linux Completely Fair Scheduler (CFS)

To achieve its efficiency goals, CFS aims to spend very little time making scheduling decisions. Its goal is to fairly divide a CPU evenly among all competing processes. It does so through a simple counting-based technique known as virtual runtime (vruntime).

As each process runs, it accumulates vruntime. In the most basic case, each process’s vruntime increases at the same rate, in proportion with physical time. When a scheduling decision occurs, CFS will pick the process with the lowest vruntime to run next.

Weighting

CFS also enables controls over process priority through the nice level of a process. A positive nice value imply lower priority and negative value imply higher priority; when you’re too nice, you just don’t get as much (scheduling) attention. CFS maps the nice value of each process to a weight:

static const int prio_to_weight[40] = {/* -20 */ 88761, 71755, 56483, 46273, 36291};

One major focus of CFS is efficiency, aka. find the next job to run as quickly as possible. It addresses this by keeping processes in a red-black tree.

Multi-CPU-based rules

Hardware caches

The problem with multi-CPU scheduling lies with hardware caches.

Hardware caches based on locality, temporal locality and spatial locality. For temporal locality: if a piece of data is accessed, it is likely to be accessed again in the near future. For spatial locality, if a program accesses a data item at address x, it is likely to access data items near x.

Cache coherence

The basic solution is provided by the hardware by monitoring memory accessed. One way to do this is bus snooping — each cache pays attention to memory updates by observing the bus that connects them to main memory. When a CPU then sees an update for a data item it holds in its cache, it will notice the change and either invalidate its copy or update it .

Multi-queue multiprocessor scheduling (MQMS)

In MQMS, our basic scheduling framework consists of multiple scheduling queues. Each queue will likely follow a particular scheduling discipline, such as round robin. When a job enters the system, it is placed on exactly one scheduling queue, according to some heuristic. Then it is scheduled essentially independently, thus avoiding the problems of information sharing and synchronisation found in the single-queue approach.

Affinity mechanism: To handle this problem, most MQMS provide affinity for some jobs by keeping them in one queue, but move others around to balance the load.
Load imbalance: MQMS achieves load balance through migration, e.g. work stealing — a (source) queue that is low on jobs will occasionally peek at another (target) queue, to see how full it is. If the target queue is more full than the source queue, the source will “steal” one or more jobs from the target to help balance load.

Linux Multiprocessor Schedulers

Interestingly, in the Linux community, three different schedulers arose: the O(1) scheduler, the Completely Fair Scheduler (CFS), and the BF Scheduler (BFS).

The O(1) scheduler is a priority-based multi-queue scheduler (similar to the MLFQ).
CFS, in contrast, is a deterministic proportional-share multi-queue scheduler (more like Stride scheduling).
BFS, the single-queue scheduler, also proportional-share, but based on a more complicated scheme known as Earliest Eligible Virtual Deadline First (EEVDF).

That’s so much of it!

Happy Reading!