一、Cache/TLB
根据空间局部性和时间局部性原理,cpu中会引入Cache/TLB, 来加速数据读取。但是因为大型应用中,由于多个重要硬件结构(包括缓存、TLB和分支预测器)面临巨大压力,大型二进制文件往往表现出较差的CPU性能。
二、二进制编译优化
1、主动Cahce优化
主动识别出可能导致cache miss的原因,对二进制的bss,data,text段进行指定排布编译。
2、反馈式优化
openeuler 20.03版本支持以下优化选项的编译器下载地址:
https://repo.huaweicloud.com/openeuler/openEuler-20.03-LTS/update/aarch64/Packages/
https://repo.openeuler.org/openEuler-20.03-LTS-SP1/EPOL/update/aarch64/Packages/
PGO(Profile-guided optimization)通常也叫做 FDO(Feedback-directed optimization),它是一种编译优化技术,它的原理是编译器使用程序的运行时 profiling 信息,生成更高质量的代码,从而提高程序的性能。
LTO (Link Time Optimization)和 PGO (Profile Guided Optimization)
参考链接:
字节跳动在PGO反馈优化技术上的探索与实践 - 掘金 (juejin.cn)
【信创】 JED on 鲲鹏(ARM) 调优步骤与成果 - 知乎 (zhihu.com)
开源选项-GCC for openEuler选项支持-用户指南 (GCC for openEuler)-…-文档首页-鲲鹏社区 (hikunpeng.com)
1、FDO
Feedback-directed optimization 反馈驱动优化(FDO),又成为Profiling引导优化(PGO)
依赖于基于instrumentation的profiling分析,这需要应用程序的特殊检测构建来收集概要数据
对应的编译选项: -fauto-profile=
参考链接:
Optimize Options (Using the GNU Compiler Collection (GCC))
GCC编译器高效利用cache的原理和参数 - 知乎 (zhihu.com)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
-fprofile-use ¶
-fprofile-use=path
Enable profile feedback-directed optimizations, and the following optimizations, many of which are generally profitable only with profile feedback available:
-fbranch-probabilities -fprofile-values
-funroll-loops -fpeel-loops -ftracer -fvpt
-finline-functions -fipa-cp -fipa-cp-clone -fipa-bit-cp
-fpredictive-commoning -fsplit-loops -funswitch-loops
-fgcse-after-reload -ftree-loop-vectorize -ftree-slp-vectorize
-fvect-cost-model=dynamic -ftree-loop-distribute-patterns
-fprofile-reorder-functions
Before you can use this option, you must first generate profiling information. See Program Instrumentation Options, for information about the -fprofile-generate option.
By default, GCC emits an error message if the feedback profiles do not match the source code. This error can be turned into a warning by using -Wno-error=coverage-mismatch. Note this may result in poorly optimized code. Additionally, by default, GCC also emits a warning message if the feedback profiles do not exist (see -Wmissing-profile).
If path is specified, GCC looks at the path to find the profile feedback data files. See -fprofile-dir.
-fauto-profile
-fauto-profile=path
Enable sampling-based feedback-directed optimizations, and the following optimizations, many of which are generally profitable only with profile feedback available:
-fbranch-probabilities -fprofile-values
-funroll-loops -fpeel-loops -ftracer -fvpt
-finline-functions -fipa-cp -fipa-cp-clone -fipa-bit-cp
-fpredictive-commoning -fsplit-loops -funswitch-loops
-fgcse-after-reload -ftree-loop-vectorize -ftree-slp-vectorize
-fvect-cost-model=dynamic -ftree-loop-distribute-patterns
-fprofile-correction
path is the name of a file containing AutoFDO profile information. If omitted, it defaults to fbdata.afdo in the current directory.
Producing an AutoFDO profile data file requires running your program with the perf utility on a supported GNU/Linux target system. For more information, see https://perf.wiki.kernel.org/.
E.g.
perf record -e br_inst_retired:near_taken -b -o perf.data \
-- your_program
Then use the create_gcov tool to convert the raw profile data to a format that can be used by GCC. You must also supply the unstripped binary for your program to this tool. See https://github.com/google/autofdo.
E.g.
create_gcov --binary=your_program.unstripped --profile=perf.data \
--gcov=profile.afdo
The following options control compiler behavior regarding floating-point arithmetic. These options trade off between speed and correctness. All must be specifically enabled.
2、LTO
Link Time Optimization 链接时优化(LTO)
参考链接:
LTO (GNU Compiler Collection (GCC) Internals)
Optimize Options (Using the GNU Compiler Collection (GCC))
1
2
3
4
5
6
7
8
9
-flto[=n]
This option runs the standard link-time optimizer. When invoked with source code, it generates GIMPLE (one of GCC’s internal representations) and writes it to special ELF sections in the object file. When the object files are linked together, all the function bodies are read from these ELF sections and instantiated as if they had been part of the same translation unit.
To use the link-time optimizer, -flto and optimization options should be specified at compile time and during the final link. It is recommended that you compile all the files participating in the same link with the same options and also specify those options at link time. For example:
gcc -c -O2 -flto foo.c
gcc -c -O2 -flto bar.c
gcc -o myprog -flto -O2 foo.o bar.o
The first two invocations to GCC save a bytecode representation of GIMPLE into special ELF sections inside foo.o and bar.o. The final invocation reads the GIMPLE bytecode from foo.o and bar.o, merges the two files into a single internal image, and compiles the result as usual. Since both foo.o and bar.o are merged into a single image, this causes all the interprocedural analyses and optimizations in GCC to work across the two files as if they were a single one. This means, for example, that the inliner is able to inline functions in bar.o into functions in foo.o and vice-versa.
3、BOLT
Binary Optimization and Layout Tool:一个建立在LLVM框架之上的链接后优化器
该优化复用编译器中来自插桩反馈优化或自动反馈优化的profile,将其转换为BOLT格式的profile并调用BOLT,自动完成链接后优化。
它表明编译时间、链接时间和链接后时间FDO都不能取代其他FDO,而是互补的。
1
优化代码布局(code layout) 是一项对于性能提升的重要优化,目前在编译时和链接时都有对应可行的优化手段。而 BOLT(Binary Optimization and Layout Tool) 则是链接后优化(post-link optimizer),使用基于采样的 profile 信息,甚至可以对已经进行过 FDO(feedback-driven optimization) 和 LTO(link-time optimization) 之后的二进制,再次提升其运行性能,所以这是一个可作为补充的优化手段。
参考链接:
BOLT: 链接后优化技术简介 - 知乎 (zhihu.com)
安装llvm-bolt
1
2
3
4
5
6
7
8
9
10
11
AutoBOLT模式:
该模式必须和选项-fauto-profile或-fprofile-use共同使用,必须增加-Wl,-q保留重定位信息。以test程序为例:
gcc -g -O2 -o test test.c -fauto-profile=test.gcda -fauto-bolt -Wl,-q
或
gcc -g -O2 -o test test.c -fprofile-use -fauto-bolt -Wl,-q
BOLT use模式:
该模式需要提前准备好BOLT优化所需要的profile。该profile可以使用AutoBOLT模式获取,也可以使用perf2bolt工具获取。
gcc -g -O2 -o test test.c -fbolt-use=dafa.fdata -Wl,-q
4、fprefetch-loop-arrays
预取优化 + 反馈式预取优化
参考链接:
鲲鹏社区-官网丨凝心聚力 共创行业新价值 (hikunpeng.com)
Optimize Options (Using the GNU Compiler Collection (GCC))
原生:
1
2
3
4
5
6
-fprefetch-loop-arrays
If supported by the target machine, generate instructions to prefetch memory to improve the performance of loops that access large arrays.
This option may generate better or worse code; results are highly dependent on the structure of loops within the source code.
Disabled at level -Os.
openeuler gcc增强:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
-fprefetch-loop-arrays=0:原始预取算法(无赋值,默认为0);
-fprefetch-loop-arrays=1:简化分支的预取距离算法;
-fprefetch-loop-arrays=2:分支加权的预取距离算法;
-fprefetch-loop-arrays=[value] -fauto-profile=xxx.gcov -fcache-misses-profile=xxx.gcov:反馈式软件预取;
可配置参数:
--param param-prefetch-func-topn=n:筛选前n个热点函数,默认值:3
--param param-prefetch-ref-topn=n:筛选前n个热点访存对象,默认值:5
--param param-high-loop-execution-rate=n:筛选执行率高于n%的循环,默认值:95%
5、others
参考链接: 鲲鹏社区-官网丨凝心聚力 共创行业新价值 (hikunpeng.com)
选项 -fipa-reorder-fields:内存空间布局优化,根据结构体中成员的占用空间大小,将成员从大到小排列,以减少边界对齐引入的padding,来减少结构体整体占用的内存大小,以提高cache的命中率。
-fipa-struct-reorg: 内存空间布局优化,将结构体成员在内存中的排布进行新的排列组合,来提高cache的命中率。
三、性能影响
1、numa场景
避免线程在多核上调度,出现 1> L1 cache频繁失效 2>可能跨numa,导致性能下降
常见手段:
绑核、绑numa
代码段副本
2、减少TLB miss
常见手段:
TLB miss
3、减少调度
常见手段:
核隔离
4、内存分配器
常见手段:
tcmalloc
5、应用优化
常见手段:
perf热点函数,消减热点
6、编译器优化
常见手段:
PGO 反馈式编译优化