实验介绍

在上一篇文章中，作者通过给Alder Lake（12th gen i5 1240p）安装Ubuntu22.04，终于把PMU用起来了

$ dmesg | grep PMU
[    0.127326] Performance Events: XSAVE Architectural LBR, PEBS fmt4+-baseline,  AnyThread deprecated, Alderlake Hybrid events, 32-deep LBR, full-width counters, Intel PMU driver.
[    0.127326] core: cpu_core PMU driver: 
[    0.127326] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
[    0.019855] core: cpu_atom PMU driver: PEBS-via-PT 
[    4.096409] RAPL PMU: API unit is 2^-32 Joules, 4 fixed counters, 655360 ms ovfl timer
[    4.096413] RAPL PMU: hw unit of domain pp0-core 2^-14 Joules
[    4.096414] RAPL PMU: hw unit of domain package 2^-14 Joules
[    4.096415] RAPL PMU: hw unit of domain pp1-gpu 2^-14 Joules
[    4.096416] RAPL PMU: hw unit of domain psys 2^-14 Joules

先来简单实验一下，检验一下CPU的分支预测功能。

在实验前，先要有一个给定预期的程序，于是，我就写了一个死循环，如下所示

int main()
{
	while (1);
	return 0;
}

命名为dead_loop.c

$ gcc -g -O0 dead_loop.c -o dead_loop
$ ./dead_loop
$ ps -ef | grep dead_loop #查看进程ID
$ perf stat -p [dead_loop的进程ID]

Performance counter stats for process id '9177':

         12,727.43 msec task-clock                #    1.000 CPUs utilized          
               168      context-switches          #   13.200 /sec                   
                24      cpu-migrations            #    1.886 /sec                   
                 0      page-faults               #    0.000 /sec                   
    55,753,235,526      cpu_core/cycles/          #    4.381 G/sec                  
     <not counted>      cpu_atom/cycles/                                              (0.00%)
   130,101,578,945      cpu_core/instructions/    #   10.222 G/sec                  
     <not counted>      cpu_atom/instructions/                                        (0.00%)
   130,082,139,921      cpu_core/branches/        #   10.221 G/sec                  
     <not counted>      cpu_atom/branches/                                            (0.00%)
            17,389      cpu_core/branch-misses/   #    1.366 K/sec                  
     <not counted>      cpu_atom/branch-misses/                                       (0.00%)

      12.730348071 seconds time elapsed

这里有两个数据值得关注一下：
（1）cpu_core/instructions为什么是10G/sec
（2）cpu_core/branch-misses为什么不是0

两个疑问

每秒指令数为什么有10G，而运行时频率只有4302.811 MHz

这里就涉及到IPC（Instructions Per Cycle）了，再看一下我们的程序执行的指令

$ objdump -S dead_loop

在这里插入图片描述
CPU（准确讲应该是CPU的某一个核）其实一直在疯狂的执行这条jmp指令。也就是说执行jmp的时候，IPC大约是2

IPC为什么会大于1

为此，笔者找到了一篇参考文献[1]，其中讲到超标量处理，如下图所示
在这里插入图片描述
有两个ALU，配合对应的register file结构，在1个时钟周期中能执行2条指令。这也是为什么笔者这款处理器出现每秒10G指令数的原因。

为什么会有分支预测失败

按理说，这个程序都没有分支跳转，哪里来的分支预测失败呢？突然想到，上面报的分支预测可能是内核态导致的，于是试验了一下

$ perf stat -e branch-misses:u -p [dead_loop的PID] #这次PID是20910
Performance counter stats for process id '20910':

                 0      cpu_core/branch-misses:u/                                     (100.00%)
                 0      cpu_atom/branch-misses:u/                                     (0.00%)

      26.299047205 seconds time elapsed

和预期的一样，branch-misses成为0了。再看看内核态的分支预测miss数

$ perf stat -e branch-misses:u -p 20910
Performance counter stats for process id '20910':

             5,670      cpu_core/branch-misses:k/                                   
     <not counted>      cpu_atom/branch-misses:k/                                     (0.00%)

       6.563240004 seconds time elapsed