为什么编译器在这里错过矢量化？

Question

OopsUser

Asked: 2025-04-27 21:24:35 +0800 CST2025-04-27 21:24:35 +0800 CST 2025-04-27 21:24:35 +0800 CST

为什么分支（未命中）预测不会影响性能（C++）？

772

在尝试衡量分支未命中预测的影响时，我注意到分支未命中预测根本没有任何惩罚。

根据著名的堆栈溢出问题：为什么处理排序数组比处理未排序数组更快？

我写了一段简单的代码来测量分支预测的惩罚。

用随机数填充数组
计算 5 以上的数字（应该有很多预测错误） - 测量它
对数组进行排序
计算 5 以上的数字（应该很少有预测失误）——测量它

运行代码后，我得到了两次测量几乎相同的结果。

已测试：

Visual Studio 2017，发布（最大优化（优先速度）（/O2）），Windows。
Linux，g++-Ofast

然后，我取出了上面链接的问题中的原始代码，排序后的数组仍然没有任何改进。这是为什么呢？分支预测的好处又在哪里呢？

#include <iostream>
#include <vector>
#include <algorithm>
#include <random>
#include <chrono>

int main()
{
    // Step 1: Allocate a vector of size 1 million
    std::vector<int> vec(100'000'000);

    // Step 2: Fill it with random numbers between 0 and 10
    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_int_distribution<> dis(0, 10);

    for (auto& val : vec)
    {
        val = dis(gen);
    }

    // Step 3: Count numbers above 5 (and measure time)
    auto start = std::chrono::high_resolution_clock::now();
    int count_above_5 = 0;
    for (size_t i = 0; i < vec.size(); i++)
    {
        if (vec[i] < 5)
        {
            ++count_above_5;
        }
    }

    auto end = std::chrono::high_resolution_clock::now();

    auto duration_before_sort = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();

    std::cout << "Count of numbers above 5 (before sorting): " << count_above_5 << std::endl;
    std::cout << "Time taken (before sorting): " << duration_before_sort << " ns" << std::endl;

    // Step 4: Sort the array
    std::sort(vec.begin(), vec.end());

    // Step 5: Count numbers above 5 in the sorted array (and measure time)

    start = std::chrono::high_resolution_clock::now();
    count_above_5 = 0;
    for (size_t i = 0; i < vec.size(); i++)
    {
        if (vec[i] < 5)
        {
            ++count_above_5;
        }
    }
    end = std::chrono::high_resolution_clock::now();


    auto duration_after_sort = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();

    std::cout << "Count of numbers above 5 (after sorting): " << count_above_5 << std::endl;
    std::cout << "Time taken (after sorting):  " << duration_after_sort << " ns" << std::endl;

    return 0;
}

1 个回答

Voted

Jérôme Richard · Answer 1 · 2025-04-27T22:20:22+08:00

TL;DR： GCC、MSVC 和 Clang 都生成无分支汇编代码，因此没有实际分支，也不会受到分支（未命中）预测的影响。

在我的机器上，使用 GCC 14.2.0 的 Linux 系统上，分支预测不会带来任何影响，因为实际上没有分支。事实上， GCC使用 SIMD 指令生成了无分支的汇编代码：

180:   
    movdqu  xmm2,XMMWORD PTR [rax]   ; Load a block of items from the array `vec`
    movdqa  xmm1,XMMWORD PTR [rsp]   ; Reload xmm1 from memory
    add     rax,0x10                 ; Move the `rax` pointer to the next block
    pcmpgtd xmm1,xmm2                ; Compare items of the block with 5 and move the mask in `xmm1`
    psubd   xmm0,xmm1                ; Increment the number of item found in `xmm0`
    cmp     rax,rbp                  ; Loop until we reach the last block
    jne     180

在Godbolt上，我们可以看到 MSVC 和 Clang 都生成了无分支代码。以下是 MSVC 生成的代码（没有使用 SIMD 指令，而是cmovge使用了效率较低的指令）：

$LL7@main:
    lea     eax, DWORD PTR [rdi+1]
    cmp     DWORD PTR [rsi+rcx*4], 5
    cmovge  eax, edi
    mov     edi, eax
    inc     rcx
    cmp     rcx, r14
    jb      SHORT $LL7@main

这里是 Clang 生成的代码（使用 SIMD 指令并展开循环 4 次）：

.LBB0_10:
    movdqu  xmm0, xmmword ptr [rbx + 4*r12 - 48]
    movdqu  xmm1, xmmword ptr [rbx + 4*r12 - 32]
    movdqu  xmm2, xmmword ptr [rbx + 4*r12 - 16]
    movdqu  xmm3, xmmword ptr [rbx + 4*r12]
    movdqa  xmm4, xmm5
    pcmpgtd xmm4, xmm0
    psubd   xmm6, xmm4
    movdqa  xmm0, xmm5
    pcmpgtd xmm0, xmm1
    psubd   xmm7, xmm0
    movdqa  xmm0, xmm5
    pcmpgtd xmm0, xmm2
    psubd   xmm6, xmm0
    movdqa  xmm0, xmm5
    pcmpgtd xmm0, xmm3
    psubd   xmm7, xmm0
    add     r12, 16
    cmp     r12, 100012
    jne     .LBB0_10

为什么分支（未命中）预测不会影响性能（C++）？

重新格式化数字，在固定位置插入分隔符

为什么 C++20 概念会导致循环约束错误，而老式的 SFINAE 不会？

VScode 自动卸载扩展的问题（Material 主题）

Vue 3：创建时出错“预期标识符但发现‘导入’”[重复]

具有指定基础类型但没有枚举器的“枚举类”的用途是什么？

如何修复未手动导入的模块的 MODULE_NOT_FOUND 错误？

`(表达式，左值) = 右值` 在 C 或 C++ 中是有效的赋值吗？为什么有些编译器会接受/拒绝它？

在 C++ 中，一个不执行任何操作的空程序需要 204KB 的堆，但在 C 中则不需要

PowerBI 目前与 BigQuery 不兼容：Simba 驱动程序与 Windows 更新有关

AdMob：MobileAds.initialize() - 对于某些设备，“java.lang.Integer 无法转换为 java.lang.String”

为什么分支（未命中）预测不会影响性能（C++）？

1 个回答

相关问题