从汇编层看64位程序运行——likely提示编译器的优化案例和底层实现分析

news2025/4/16 12:00:14

大纲

代码
分析
- with_attributes::pow
- no_attributes::pow
- 分析

我们在《Modern C++——使用分支预测优化代码性能》一文中介绍了likely提示编译器进行编译优化，但是我们又讲了最终优化不是对分支顺序的调换，那么它到底做了什么样的优化，让整体性能提升20%呢？

代码

我们先回顾下代码

#include <chrono>
#include <cmath>
#include <iomanip>
#include <iostream>
#include <random>
#include <functional>
 
namespace with_attributes {
    constexpr double pow(double x, long long n) noexcept {
        if (n <= 0) [[unlikely]]
            return 1;
        else [[likely]]
            return x * pow(x, n - 1);
    }
} // namespace with_attributes
 
namespace no_attributes {
    constexpr double pow(double x, long long n) noexcept {
        if (n <= 0)
            return 1;
        else
            return x * pow(x, n - 1);
    }
} // namespace no_attributes

double calc(double x, std::function<double(double, long long)> f) noexcept {
    constexpr long long precision{16LL};
    double y{};
    for (auto n{0LL}; n < precision; n += 2LL)
        y += f(x, n);
    return y;
}

double gen_random() noexcept {
    static std::random_device rd;
    static std::mt19937 gen(rd());
    static std::uniform_real_distribution<double> dis(-1.0, 1.0);
    return dis(gen);
}
 
volatile double sink{}; // ensures a side effect
 
int main() {
    auto benchmark = [](auto fun, auto rem)
    {
        const auto start = std::chrono::high_resolution_clock::now();
        for (auto y{1ULL}; y != 500'000'000ULL; ++y)
            sink = calc(gen_random(), fun);
        const std::chrono::duration<double> diff =
            std::chrono::high_resolution_clock::now() - start;
        std::cout << "Time: " << std::fixed << std::setprecision(6) << diff.count()
                  << " sec " << rem << std::endl; 
    };
 
    benchmark(with_attributes::pow, "(with attributes)");
    benchmark(no_attributes::pow, "(without attributes)");
    benchmark(with_attributes::pow, "(with attributes)");
    benchmark(no_attributes::pow, "(without attributes)");
    benchmark(with_attributes::pow, "(with attributes)");
    benchmark(no_attributes::pow, "(without attributes)");
}

以及执行效果
在这里插入图片描述

分析

现在我们开始探究性能提升的本质原因。

常言道：代码之前了无秘密。但是对于C++的程序，这个“代码”指的是汇编代码。所以我们需要查看with_attributes::pow和no_attributes::pow的底层实现。

with_attributes::pow

在这里插入图片描述

no_attributes::pow

在这里插入图片描述

分析

我们看到with_attributes::pow和no_attributes::pow在汇编层的实现结构是类似的，但是它们并没有遵从C++代码的结构。因为完全按照C++代码结构进行编译，其结构应该类似于

test   %rdi,%rdi
jle XXXXX_RET_ADDRESS
sub    $0x1,%rdi
movsd  %xmm0,0x8(%rsp)
call pow
……

由于我们开启了o3等级编译优化，所以编译器对with_attributes::pow和no_attributes::pow都进行了优化。

我们分析no_attributes::pow的优化方案：

test %rdi,%rdi检测n
jle 0x555555555c20 <_ZN13no_attributes3powEdx+112>如果n小于等于0，则跳转到+112位置，即返回1。
movapd %xmm0,%xmm1如果n大于0，则将x值放入xmm1寄存器。
cmp $0x1,%rdi对比n是否等于1。
je 0x555555555c1c <_ZN13no_attributes3powEdx+108>如果n是1，则运行+108位置，即返回x的值。因为pow(x,1)=x* pow(x,0)= x。
cmp $0x2,%rdi对比x是否等于2。
je 0x555555555c18 <_ZN13no_attributes3powEdx+104>如果n是2，则运行+104位置。它会执行一次x*x（mulsd %xmm1,%xmm0）。这是因为pow(x,2)=x * pow(x,1)= x * x * pow(x, 0) = x * x * 1 = x * x。

后面逻辑以此类推。
在这里插入图片描述