[paper review] Fairness and Interactive Performance of O(1) and CFS Linux Kernel Schedulerss
Abstract
- For both medium and high load, CFS has higher average latency (5ms ~ 10ms) than O(1).
- The greater the load, the greater average latency.
- CFS parameters also has an impact on the average / tail latency.
- The nice value also has impact of CFS. The lower nice value, the greater time slice.
- The nice value also has impact of O(1).
Review
paper: Fairness and interactive performance of O(1) and CFS Linux kernel schedulers
The testing result is base on the workload of linux desktop, not the workload of server.
The design goals of CFS are to provide fair CPU resource allocation among executing tasks without sacrificing interactive performance.
O(1)
… static priorities corresponds to a nice value … from -20 (highest) to 19 (lowest) …
Each task is allocated a base time slice based on its static priority/nice value using the equation:
$$\begin{align*} \large Base \space Time \space Slice \space (ms) \space = \lbrace_{if \space static \space priority \space \geqslant \space 120 \space (140-static \space priority) \space \times \space 5}^{if \space static \space priority \space < \space 120 \space (140-static \space priority) \space \times \space 20} \end{align*}$$
CFS
… was designed to provide good interactive performance while maximizing overall CPU utilization. Even during high load, the interactivity shall be maintained.
Fairness Test and Result
In the O(1) scheduler, some children finish executing faster than others.
… CFS scheduler distributes its CPU time among each tasks in a fairer manner compared to O(1) scheduler.
Interactivity Test and Result
For both medium and high load, the interactive task has lower latencies when running on O(1). Although CFS has higher average latency, the latency is well below human’s reaction time to visual stimulus, which is 180 to 200ms.
CFS scheduler has lower maximum latency recorded in all three types of background loads.
In Real Work
Basic
1. CFS parameters
- cpu.cfs_period_us: the length of period
- cpu.cfs_quota_us: run-time within period
- cpu.cfs_burst_us: maximum accumulated run-time
2. CFS Throttling Statistics
- cpu.stat.nr_periods: Number of enforcement intervals that have elapsed.
- cpu.stat.nr_throttled: Number of times the group has been throttled/limited.
- cpu.stat.throttled_time: The total time duration (in nanoseconds) for which entities of the group have been throttled.
- cpu.stat.nr_bursts: Number of periods burst occurs.
- cpu.stat.burst_time: Cumulative wall-time (in nanoseconds) that any CPUs has used above quota in respective periods.
3. How CFS Work
Pseudocode How CFS Work (without cpu.cfs_burst_us):
# simulate a process executing under CFS
while process_need_cpu_us > 0 {
process_need_cpu_us -= cpu.cfs_quota_us
if process_need_cpu_us > 0 {
# cfs cpu.stat
cpu.stat.nr_periods++
cpu.stat.nr_throttled++
cpu.stat.throttled_time += (cpu.cfs_period_us - cpu.cfs_quota_us) * 1000
# wait next period
sleep(cpu.cfs_period_us - cpu.cfs_quota_us)
}
}
Pseudocode How CFS Work (with cpu.cfs_burst_us):
# simulate a process executing under CFS
curr_cpu.cfs_burst_us = cpu.cfs_quota_us
while process_need_cpu_us > 0 {
process_need_cpu_us -= cpu.cfs_quota_us
if process_need_cpu_us > 0 && curr_cpu.cfs_burst_us > cpu.cfs_quota_us {
# trigger burst
process_need_cpu_us -= (curr_cpu.cfs_burst_us - cpu.cfs_quota_us)
# cpu.stat burst
cpu.stat.nr_bursts++
cpu.stat.burst_time += curr_cpu.cfs_burst_us - cpu.cfs_quota_us
}
if process_need_cpu_us < 0 {
# reset
curr_cpu.cfs_burst_us = cpu.cfs_quota_us
# accumulate unused quota into curr_cpu.cfs_burst_us
curr_cpu.cfs_burst_us = min(cpu.cfs_burst_us, curr_cpu.cfs_burst_us - process_need_cpu_us)
} else if process_need_cpu_us > 0 {
# cpu.stat
cpu.stat.nr_periods++
cpu.stat.nr_throttled++
cpu.stat.throttled_time += (cpu.cfs_period_us - cpu.cfs_quota_us) * 1000
# wait next period
sleep(cpu.cfs_period_us - max(cpu.cfs_quota_us, curr_cpu.cfs_burst_us))
}
}
Workload
1. Decrease P95/P99 Latency on Burst Workload
- increase both value but keep cpu.cfs_quota_us / cpu.cfs_period_us unchange, more tasks can be completed in one cpu.cfs_period_us
- use cpu.cfs_burst_us (linux kernel >= 5.14), usable only when there are some unused quota from previous periods (unused quota can be keep <= cpu.cfs_burst_us - cpu.cfs_quota_us)
- give up CFS, using cpuset (cpu affinity) to limit cpu usage [uber blog - avoiding-cpu-throttling-in-a-containerized-environment]
2. Load Balancing on Cloud (Containers)
- least connections should be the first choice algorithm for load balancing gateway
- when used least connections to do load balancing, the large ratio that the server oversolded, the less resource (cpu) usage show (because of cpu throttling)
Dashboard && Monitor && Alert Metrics
1. CFS Metrics
- cpu.stat.nr_periods
- cpu.stat.nr_throttled
- cpu.stat.throttled_time
- cpu.stat.nr_bursts
- cpu.stat.burst_time
Get CPU Throttled Stat:
root@-:~# cat /sys/fs/cgroup/cpu/cpu.stat
nr_periods 0
nr_throttled 0
throttled_time 0
2. Business Indicators
- P95/P99 lantency of given API
- latency between all deployed servers
Warning:
- the greater the cpu.stat.nr_throttled, the greater the tail latency
3. ELK Dashboard Metrics
- using histogram displays the number of requests handled by each server over a specified time period
- using histogram displays the average time about the handled requests of each server over a specified time period (average latency between servers)
References
- Fairness and interactive performance of O(1) and CFS Linux kernel schedulers
- linux kernel - cfs cpu throttling even in low cpu usage < 4.18
- linux kernel - cfs cpu.cfs_burst_us >= 5.14
- linux kernel - cpu.cfs_burst_us benchmark compare
- linux kernel - scheduler cfs
- uber blog - vertical-cpu-scaling
- uber blog - avoiding-cpu-throttling-in-a-containerized-environment
- kubernetes issue - CFS unnecessary throttling
- kubernetes - kubernetes-cpu-requests-limits
- man - cgroups - Set Upper and Lower Limit
- linux kernel - cgroup-v2
- linux kernel - CFS Bandwidth Control
- ELK - kibana Analyze time series data