Nutshell

Thoughts come and go, words stay eternal

29 Jul 2022

Abstract

  1. least_conn is a load balance algorithm, which distributed incoming request to the server with least connected connections
  2. least_conn try to quantify the load by the number of connections, but it depends on the duration of request process time
  3. Some Basic Rules:
    1. if a real server’s cpu usage lower than it should be, then this server must has some issue about performance, higher latency for others component or lower cpu performance

Introduction

In Math

Distribute requests to n server, least_conn try to archive a goal every server as busy as others.

total time t that n servers finished requests:

$$ \begin{align} \\ &t = \sum_{i=1}^nt_i = \sum_{i=1}^n(t_{cpu\_i} + t_{other\_i}) \\ &{t_{i}}\text{ : the time that }{i^{th}} \text{ server finish the distributed requests} \\ &{t_{cpu\_i}}\text{ : the cpu time that }{i^{th}}\text{ server finish the distributed requests} \\ &{t_{other\_i}}\text{ : the rest time that }{i^{th}}\text{ server finish the distributed requests} \\ \end{align} $$

what least_conn (same weight) algorithm want to archive is :

$$ \begin{align} \\ t_i &\approx t_{i+1} \\ t_{cpu\_i} + t_{other\_i} &\approx t_{cpu\_i+1} + t_{other\_i+1} \\ \\ \end{align} $$

Proof: server has lower cpu usage must has performance issue

Conditions:

  1. all servers have same configuration, included software and hardware
  2. all servers have same weight

Proof in math:

$$ \begin{align} \\ \text{ Proof: } \\ &\because \ \exists \ t_{cpu\_i}\downarrow \ \bigwedge t_{cpu\_i} + t_{other\_i} \approx t_{cpu\_i\_+\_1} + t_{other\_i\_+\_1} \\ \\ &\therefore \ t_{other\_i}\uparrow \\ Q.E.D \\ \end{align} $$

Conclusion - Issues:

  1. cpu performance
    1. higher cpu steal: oversolde in cloud or container enviroment
    2. lower cpu frequence: cause by bad power policy or others
    3. cpu throttling: CFS/cgroup
  2. latancy of others component
    1. network: bandwidth / pps / others
    2. io: io read/write perofmance issue

Conclusion - Troubleshooting:

  1. cpu performance:
    1. VM: monitor cpu steal metrics
      1. metrics:
        • cpu steal
        • cpu frequence or power policy
      2. command:
        • top or htop
        • sar or vmstat
      3. tools:
    2. Container: monitor cgroup’s statistics cpu.stat
      1. metrics:
        • nr_periods: Number of enforcement intervals that have elapsed.
        • nr_throttled: Number of times the group has been throttled/limited.
        • throttled_time: The total time duration (in nanoseconds) for which entities of the group have been throttled.
        • nr_bursts: Number of periods burst occurs.
        • burst_time: Cumulative wall-time (in nanoseconds) that any CPUs has used above quota in respective periods.
      2. command:
        • cat /sys/fs/cgroup/cpu,cpuacct/*/cpu.stat
      3. tools:
  2. lantancy of others component:
    1. network:
      1. metrics:
        1. bandwidth
        2. pps
    2. io:
      1. metrics:
        1. iops
        2. avg queue size
      2. command:
        1. iostat
        2. sar
      3. tools:

Reference

  1. CFS - statistics
  2. cadvisor - container monitor
  3. prometheus
  4. kubernetes - CFS quotas can lead to unnecessary throttling
  5. linux - CPU_frequency_scaling