linux负载均衡和系统负载分析笔记

1 负载均衡

1.1 计算负载

1.1.1 PELT算法简介

从Linux3.8内核以后进程的负载计算不仅考虑权重，⽽且跟踪每个调度实体的历史负载情况，该算法称为PELT(Per-entity Load Tracking)

《奔跑吧Linux内核》卷1：基础架构；P505

1.1.2 记录负载的数据结构struct sched_avg

1.1.2.1 定义

/*
 * The load_avg/util_avg accumulates an infinite geometric series
 * (see __update_load_avg() in kernel/sched/fair.c).
 *
 * [load_avg definition]
 *
 *   load_avg = runnable% * scale_load_down(load)
 *
 * where runnable% is the time ratio that a sched_entity is runnable.
 * For cfs_rq, it is the aggregated load_avg of all runnable and
 * blocked sched_entities.
 *
 * [util_avg definition]
 *
 *   util_avg = running% * SCHED_CAPACITY_SCALE
 *
 * where running% is the time ratio that a sched_entity is running on
 * a CPU. For cfs_rq, it is the aggregated util_avg of all runnable
 * and blocked sched_entities.
 *
 * load_avg and util_avg don't direcly factor frequency scaling and CPU
 * capacity scaling. The scaling is done through the rq_clock_pelt that
 * is used for computing those signals (see update_rq_clock_pelt())
 *
 * N.B., the above ratios (runnable% and running%) themselves are in the
 * range of [0, 1]. To do fixed point arithmetics, we therefore scale them
 * to as large a range as necessary. This is for example reflected by
 * util_avg's SCHED_CAPACITY_SCALE.
 *
 * [Overflow issue]
 *
 * The 64-bit load_sum can have 4353082796 (=2^64/47742/88761) entities
 * with the highest load (=88761), always runnable on a single cfs_rq,
 * and should not overflow as the number already hits PID_MAX_LIMIT.
 *
 * For all other cases (including 32-bit kernels), struct load_weight's
 * weight will overflow first before we do, because:
 *
 *    Max(load_avg) <= Max(load.weight)
 **
 * Then it is the load_weight's responsibility to consider overflow
 * issues.
 */
struct sched_avg {
    u64             last_update_time;
    u64             load_sum;
    u64             runnable_load_sum;
    u32             util_sum;
    u32             period_contrib;
    unsigned long           load_avg;
    unsigned long           runnable_load_avg;
    unsigned long           util_avg;
    struct util_est         util_est;
} ____cacheline_aligned;

1.1.2.2 struct sched_avg成员变量含义

1.1.3 数据结构组织关系

进程队列

进程调度实体

1.1.4 ___update_load_avg() 和 ___update_load_sum();

___update_load_avg()：计算量化负载(load_avg) 和实际算⼒(util_avg)。
___update_load_sum()：计算工作负载

《奔跑吧Linux内核》卷1：基础架构；P515

《Linux内核深度解析》P104

1.1.5 查看单个进程的负载信息

例如，查看pid为7202进程的负载信息

# cat /proc/7202/sched | grep se.avg
se.avg.load_sum                              :                   15
se.avg.runnable_sum                          :                15521
se.avg.util_sum                              :                15521
se.avg.load_avg                              :                    0
se.avg.runnable_avg                          :                    0
se.avg.util_avg                              :                    0
se.avg.last_update_time                      :       42221565865984
se.avg.util_est.ewma                         :                    8
se.avg.util_est.enqueued                     :                    8

1.1.6 查看公平队列(cfs_rq)的负载信息

# cat /sys/kernel/debug/sched/debug
......
cfs_rq[0]:/
  ......
  .load_avg                      : 0  
  .runnable_avg                  : 1  
  .util_avg
  ......
cfs_rq[1]:/
  ......
  .load_avg                      : 2
  .runnable_avg                  : 6
  .util_avg
  ......
cfs_rq[2]:/
  ......
  .load_avg                      : 0
  .runnable_avg                  : 0
  .util_avg                      : 0
  ......

1.1.7 中断处理程序占用的负载

需要打开内核配置：CONFIG_HAVE_SCHED_AVG_IRQ

1.2 完全公平调度类的负载均衡

1.2.1 调度域和调度组

1.2.1.1 简介

调度域实际上是⼀个CPU集合，它们的⼯作量应该由内核保持平衡。《深⼊理解LINUX内核》P285

内核按照处理器拓扑层次划分调度域层次，每个调度域包含多个调度组。《Linux内核深度解析》P100

调度组是负载均衡调度的最⼩单位。在最低层级的调度域中，通常⼀个调度组描述⼀个CPU。

调度域和调度组的关系。《奔跑吧Linux内核》卷1：基础架构；P521

只有在某个调度域的某个组的总⼯作量远远低于同⼀个调度域的另⼀个组的⼯作量时，才把进程从⼀个CPU迁移到另⼀个CPU。

《深⼊理解LINUX内核》P285

1.2.1.2 调度域数据结构：struct sched_domain;

1.2.1.3 调度域的相关配置：/sys/kernel/debug/sched/domains/

/sys/kernel/debug/sched/domains/cpuX/domainX/目录下的内容实际上就是struct sched_domain的成员变量。

# tree /sys/kernel/debug/sched/domains/

1.2.1.4 查看调度域统计信息：/proc/schedstat

# cat /proc/schedstat
version 15
timestamp 4295985456
cpu0 0 0 0 0 0 0 32014461618 3027065056 256137
domain0 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
cpu1 0 0 0 0 0 0 36107022392 3413653518 238915
domain0 22 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
cpu2 0 0 0 0 0 0 35249116446 3157909064 252470
domain0 44 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
cpu3 0 0 0 0 0 0 32014334332 2839418262 228644
domain0 88 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
cpu4 0 0 0 0 0 0 32069354312 3238003779 243171
domain0 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
cpu5 0 0 0 0 0 0 30239063906 3177296363 292539
domain0 22 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
cpu6 0 0 0 0 0 0 37679206082 2521244461 255856
domain0 44 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
cpu7 0 0 0 0 0 0 30929844200 2433414883 252047
domain0 88 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Domain statistics
-----------------
One of these is produced per domain for each cpu described. (Note that if
CONFIG_SMP is not defined, *no* domains are utilized and these lines
will not appear in the output.)

domain<N> <cpumask> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36

The first field is a bit mask indicating what cpus this domain operates over.

The next 24 are a variety of load_balance() statistics in grouped into types
of idleness (idle, busy, and newly idle):

1) # of times in this domain load_balance() was called when the
cpu was idle
2) # of times in this domain load_balance() checked but found
the load did not require balancing when the cpu was idle
3) # of times in this domain load_balance() tried to move one or
more tasks and failed, when the cpu was idle
4) sum of imbalances discovered (if any) with each call to
load_balance() in this domain when the cpu was idle
5) # of times in this domain pull_task() was called when the cpu
was idle
6) # of times in this domain pull_task() was called even though
the target task was cache-hot when idle
7) # of times in this domain load_balance() was called but did
not find a busier queue while the cpu was idle
8) # of times in this domain a busier queue was found while the
cpu was idle but no busier group was found
9) # of times in this domain load_balance() was called when the
cpu was busy
10) # of times in this domain load_balance() checked but found the
load did not require balancing when busy
11) # of times in this domain load_balance() tried to move one or
more tasks and failed, when the cpu was busy
12) sum of imbalances discovered (if any) with each call to
load_balance() in this domain when the cpu was busy
13) # of times in this domain pull_task() was called when busy
14) # of times in this domain pull_task() was called even though the
target task was cache-hot when busy
15) # of times in this domain load_balance() was called but did not
find a busier queue while the cpu was busy
16) # of times in this domain a busier queue was found while the cpu
was busy but no busier group was found
17) # of times in this domain load_balance() was called when the
cpu was just becoming idle
18) # of times in this domain load_balance() checked but found the
load did not require balancing when the cpu was just becoming idle
19) # of times in this domain load_balance() tried to move one or more
tasks and failed, when the cpu was just becoming idle
20) sum of imbalances discovered (if any) with each call to
load_balance() in this domain when the cpu was just becoming idle
21) # of times in this domain pull_task() was called when newly idle
22) # of times in this domain pull_task() was called even though the
target task was cache-hot when just becoming idle
23) # of times in this domain load_balance() was called but did not
find a busier queue while the cpu was just becoming idle
24) # of times in this domain a busier queue was found while the cpu
was just becoming idle but no busier group was found

Next three are active_load_balance() statistics:

25) # of times active_load_balance() was called
26) # of times active_load_balance() tried to move a task and failed
27) # of times active_load_balance() successfully moved a task

Next three are sched_balance_exec() statistics:

28) sbe_cnt is not used
29) sbe_balanced is not used
30) sbe_pushed is not used

Next three are sched_balance_fork() statistics:

31) sbf_cnt is not used
32) sbf_balanced is not used
33) sbf_pushed is not used

Next three are try_to_wake_up() statistics:

34) # of times in this domain try_to_wake_up() awoke a task that
last ran on a different cpu in this domain
35) # of times in this domain try_to_wake_up() moved a task to the
waking cpu because it was cache-cold on its own cpu anyway
36) # of times in this domain try_to_wake_up() started passive balancing

《Documentation/scheduler/sched-stats.rst》

1.2.2 负载均衡的流程

1.2.2.1 流程图

《Linux内核深度解析》P107

《奔跑吧Linux内核》卷1：基础架构；P530

1.2.2.2 找出最忙的调度组: find_busiest_group();

相关函数：update_sd_lb_stats()、calculate_imbalance() 和 update_sg_lb_stats();

《Linux内核深度解析》P107

《深⼊理解LINUX内核》P288

《深⼊Linux内核架构》P99

《奔跑吧Linux内核》卷1：基础架构；P531

1.2.2.3 detach_tasks() / attach_tasks()

detach_tasks()

便利最繁忙的就绪队列中的所有的进程，找出适合被迁移的进程，然后让这些进程退出就绪队列。

attach_tasks()

把刚才从最繁忙就绪队列中迁出的进程都迁⼊当前CPU的就绪队列中。

《奔跑吧Linux内核》卷1：基础架构；P530

1.2.2.4 迁移线程： migration/<cpu_id>

如果负载均衡失败，即没有迁移⼀个进程，那么为最忙处理器设置主动负载均衡标志，记录当前处理器作为迁移⽬标，向最忙处理器的停机⼯作队列添加⼀个⼯作，⼯作函数是active_load_balance_cpu_stop，唤醒最忙处理器的迁移线程。迁移线程将会从停机⼯作队列取出⼯作，执⾏主动的负载均衡。

《Linux内核深度解析》P107

《深⼊Linux内核架构》P100

1.2.3 进程迁移的代价

1.3 限期调度类的负载均衡

调度器选择下⼀个限期进程的时候，如果当前正在执⾏的进程是限期进程，将会试图从限期进程超载的处理器把限期进程拉过来。

限期进程超载的定义：

限期运⾏队列⾄少有2个限期进程。
⾄少有⼀个限期进程绑定到多个处理器。

《Linux内核深度解析》P96

1.4 实时调度类的负载均衡

调度器选择下一个实时进程时，如果当前处理器的实时运⾏队列中的进程的最⾼调度优先级⽐当前正在执⾏的进程的调度优先级低，将会试图从实时进程超载的处理器把可推送实时进程拉过来。

实时进程超载的定义：

实时运⾏队列⾄少有2个实时进程。
⾄少有⼀个可推送实时进程。可推送实时进程是指绑定到多个处理器的实时进程，可以在处理器之间迁移。
《Linux内核深度解析》P98

1.5 调试

/sys/kernel/debug/tracing/events/sched/sched_migrate_task/

2 单个处理器核的负载(使用率)

可以通过命令“sar -P ALL 1”查看处理器核的使用率信息，也可以生成使用率图表，请看Linux下性能分析的可视化图表工具_linux 热力图-CSDN博客

3 系统负载

3.1 1分钟、5分钟、 15分钟内的平均负载

3.1.1 简介

展⽰了系统中的负载需求：系统中处于可运⾏状态的，以及不可中断等待状态的任务的数量。
《BPF之巅.洞悉Linux系统和应⽤性能》P198

1分钟、5分钟、15分钟的平均负载数据含义请看一篇读懂｜Linux系统平均负载_系统负载怎么算-CSDN博客

3.1.2 查看方式

执行以下命令

uptime
top / htop
w
cat /proc/loadavg

3.2 Pressure Stall Information (PSI)

3.2.1 简介

An interface has now been added in Linux 4.20 that provides such a breakdown: pressure stall information (PSI), which gives averages for CPU, memory, and I/O.
《SystemsPerformance_ EnterpriseandtheCloud(2020,Pearson)》P257

3.2.2 /proc/pressure/cpu

# cat /proc/pressure/cpu 
some avg10=0.00 avg60=0.00 avg300=0.00 total=6305749
full avg10=0.00 avg60=0.00 avg300=0.00 total=0

The "some" line indicates the share of time in which at least some tasks are stalled on a given resource.
The "full" line indicates the share of time in which all non-idle tasks are stalled on a given resource simultaneously.
Documentation/accounting/psi.rst