28 Aug 2022

[paper review] Fairness and Interactive Performance of O(1) and CFS Linux Kernel Schedulerss

Abstract

For both medium and high load, CFS has higher average latency (5ms ~ 10ms) than O(1).
The greater the load, the greater average latency.
CFS parameters also has an impact on the average / tail latency.
The nice value also has impact of CFS. The lower nice value, the greater time slice.
The nice value also has impact of O(1).

Review

paper: Fairness and interactive performance of O(1) and CFS Linux kernel schedulers

The testing result is base on the workload of linux desktop, not the workload of server.

The design goals of CFS are to provide fair CPU resource allocation among executing tasks without sacrificing interactive performance.

O(1)

… static priorities corresponds to a nice value … from -20 (highest) to 19 (lowest) …

Each task is allocated a base time slice based on its static priority/nice value using the equation:

$$\begin{align*} \large Base \space Time \space Slice \space (ms) \space = \lbrace_{if \space static \space priority \space \geqslant \space 120 \space (140-static \space priority) \space \times \space 5}^{if \space static \space priority \space < \space 120 \space (140-static \space priority) \space \times \space 20} \end{align*}$$

CFS

… was designed to provide good interactive performance while maximizing overall CPU utilization. Even during high load, the interactivity shall be maintained.

Fairness Test and Result

In the O(1) scheduler, some children finish executing faster than others.

… CFS scheduler distributes its CPU time among each tasks in a fairer manner compared to O(1) scheduler.

Interactivity Test and Result

For both medium and high load, the interactive task has lower latencies when running on O(1). Although CFS has higher average latency, the latency is well below human’s reaction time to visual stimulus, which is 180 to 200ms.

CFS scheduler has lower maximum latency recorded in all three types of background loads.

In Real Work

Basic

1. CFS parameters

cpu.cfs_period_us: the length of period
cpu.cfs_quota_us: run-time within period
cpu.cfs_burst_us: maximum accumulated run-time

2. CFS Throttling Statistics

cpu.stat.nr_periods: Number of enforcement intervals that have elapsed.
cpu.stat.nr_throttled: Number of times the group has been throttled/limited.
cpu.stat.throttled_time: The total time duration (in nanoseconds) for which entities of the group have been throttled.
cpu.stat.nr_bursts: Number of periods burst occurs.
cpu.stat.burst_time: Cumulative wall-time (in nanoseconds) that any CPUs has used above quota in respective periods.

3. How CFS Work

Pseudocode How CFS Work (without cpu.cfs_burst_us):

# simulate a process executing under CFS

while process_need_cpu_us > 0 {
	process_need_cpu_us -= cpu.cfs_quota_us
	if process_need_cpu_us > 0 {
		# cfs cpu.stat
		cpu.stat.nr_periods++
		cpu.stat.nr_throttled++
		cpu.stat.throttled_time += (cpu.cfs_period_us - cpu.cfs_quota_us) * 1000
		
		# wait next period
		sleep(cpu.cfs_period_us - cpu.cfs_quota_us)
		
	}
	
}

Pseudocode How CFS Work (with cpu.cfs_burst_us):

# simulate a process executing under CFS

curr_cpu.cfs_burst_us = cpu.cfs_quota_us

while process_need_cpu_us > 0 {
	
	process_need_cpu_us -= cpu.cfs_quota_us
	
	if process_need_cpu_us > 0 && curr_cpu.cfs_burst_us > cpu.cfs_quota_us {
		# trigger burst
		process_need_cpu_us -= (curr_cpu.cfs_burst_us - cpu.cfs_quota_us)

		# cpu.stat burst
		cpu.stat.nr_bursts++
		cpu.stat.burst_time += curr_cpu.cfs_burst_us - cpu.cfs_quota_us
	}
	
	if process_need_cpu_us < 0 {
		# reset
		curr_cpu.cfs_burst_us = cpu.cfs_quota_us
		
		# accumulate unused quota into curr_cpu.cfs_burst_us
		curr_cpu.cfs_burst_us = min(cpu.cfs_burst_us, curr_cpu.cfs_burst_us - process_need_cpu_us)
		
	} else if process_need_cpu_us > 0 {
		# cpu.stat
		cpu.stat.nr_periods++
		cpu.stat.nr_throttled++
		cpu.stat.throttled_time += (cpu.cfs_period_us - cpu.cfs_quota_us) * 1000
		
		# wait next period
		sleep(cpu.cfs_period_us - max(cpu.cfs_quota_us, curr_cpu.cfs_burst_us))
	}
}

Workload

1. Decrease P95/P99 Latency on Burst Workload

increase both value but keep cpu.cfs_quota_us / cpu.cfs_period_us unchange, more tasks can be completed in one cpu.cfs_period_us
use cpu.cfs_burst_us (linux kernel >= 5.14), usable only when there are some unused quota from previous periods (unused quota can be keep <= cpu.cfs_burst_us - cpu.cfs_quota_us)
give up CFS, using cpuset (cpu affinity) to limit cpu usage [uber blog - avoiding-cpu-throttling-in-a-containerized-environment]

2. Load Balancing on Cloud (Containers)

least connections should be the first choice algorithm for load balancing gateway
when used least connections to do load balancing, the large ratio that the server oversolded, the less resource (cpu) usage show (because of cpu throttling)

Dashboard && Monitor && Alert Metrics

1. CFS Metrics

cpu.stat.nr_periods
cpu.stat.nr_throttled
cpu.stat.throttled_time
cpu.stat.nr_bursts
cpu.stat.burst_time

Get CPU Throttled Stat:

root@-:~# cat /sys/fs/cgroup/cpu/cpu.stat
nr_periods 0
nr_throttled 0
throttled_time 0

2. Business Indicators

P95/P99 lantency of given API
latency between all deployed servers

Warning:

the greater the cpu.stat.nr_throttled, the greater the tail latency

3. ELK Dashboard Metrics

using histogram displays the number of requests handled by each server over a specified time period
using histogram displays the average time about the handled requests of each server over a specified time period (average latency between servers)

Nutshell