文章目录
- 前言
- 一、Dynamic Tracing
- 二、kprobes
- 2.1 perf kprobe 的使用
- 2.2 kprobe Arguments
- 3.3 tcp_sendmsg()
- 3.3.1 Kernel: tcp_sendmsg()
- 3.3.2 Kernel: tcp_sendmsg() with size
- 3.3.2 Kernel: tcp_sendmsg() line number and local variable
- 三、uprobes的使用
- 3.1 perf uprobe 的使用
- 3.2 uprobe Arguments
- 参考资料
前言
perf使用术语探测事件来指kprobes、uprobes和USDT探测。这些是“动态”的,必须先初始化,然后才能跟踪它们:默认情况下,它们不在perf列表的输出中(某些USDT探针可能存在,因为它们已自动初始化)。一旦初始化,它们将被列为“Tracepoint event”。
一、Dynamic Tracing
跟踪点和动态跟踪之间的区别如下图所示,该图说明了常见跟踪点库的覆盖范围:
图片来自于:https://www.brendangregg.com/perf.html
tracepoint和kprobe之间的比较如下表所示:
虽然动态跟踪可以看到所有内容,但它也是一个不稳定的接口,因为它检测原始代码。这意味着您开发的任何动态跟踪工具都可能在内核补丁或更新后崩溃。首先尝试使用静态跟踪点,因为它们的接口应该更稳定。它们也可以更容易使用和理解,因为它们的设计考虑到了跟踪最终用户。
动态跟踪的一个好处是,它可以在实时系统上启用,而无需重新启动任何东西。您可以使用已经运行的内核或应用程序,然后开始动态检测,它(安全地)修补内存中的指令以添加检测。这意味着在您开始使用该功能之前,该功能的开销为零。前一刻,您的二进制文件未经修改且全速运行,下一刻,它将运行您动态添加的一些额外的检测指令。一旦您使用完动态跟踪会话,这些指令最终应该被删除。
使用动态跟踪和执行额外指令时的开销与检测事件的频率乘以每个检测所做的工作有关。
对于内核分析,使用CONFIG_KPROBES=y和CONFIG_KPROBE_EVENTS=y来启用内核动态跟踪,而CONFIG_FRAME_POINTER=y用于基于帧指针的内核堆栈。对于用户级分析,CONFIG_UPROBES=y和CONFIG_UROBE_EVENTS=y用于用户级动态跟踪。
使用 perf-probe 命令动态跟踪函数:
NAME
perf-probe - Define new dynamic tracepoints
DESCRIPTION
This command defines dynamic tracepoint events, by symbol and registers without debuginfo, or by C expressions (C line numbers, C function names, and C local variables) with debuginfo.
常用的一些参数选项:
-a, --add=
Define a probe event (see PROBE SYNTAX for detail).
-d, --del=
Delete probe events. This accepts glob wildcards(*, ?) and character classes(e.g. [a-z], [!A-Z]).
-l, --list[=[GROUP:]EVENT]
List up current probe events. This can also accept filtering patterns of event names. When this is used with --cache, perf shows all cached probes instead of the live probes.
二、kprobes
2.1 perf kprobe 的使用
以下是创建和使用kprobe的典型工作流,在本例中,用于检测do_nanosleep()内核函数。
动态跟踪的函数要在内核符号文件/proc/kallsyms中:
[root@localhost ~]# cat /proc/kallsyms | grep do_nanosleep
ffffffffb0f66b40 t do_nanosleep
在没有动态跟踪do_nanosleep之前:
[root@localhost ~]# perf list probe:do_nanosleep
List of pre-defined events (to be used in -e):
Metric Groups:
使用perf list查询不到do_nanosleep事件。
接下来动态跟踪do_nanosleep函数:
perf probe --add do_nanosleep
perf record -e probe:do_nanosleep -a sleep 5
perf script
perf probe --del do_nanosleep
kprobe是使用probe子命令和–add(–add是可选的)创建的,当不再需要它时,将使用probe和–del删除它。以下是该序列的输出,包括列出探测事件:
(1)添加动态事件:
[root@localhost ~]# perf probe --add do_nanosleep
Added new event:
probe:do_nanosleep (on do_nanosleep)
You can now use it in all perf tools, such as:
perf record -e probe:do_nanosleep -aR sleep 1
(2)查看动态事件(do_nanosleep初始化后,将被列为“Tracepoint event”):
[root@localhost ~]# perf list probe:do_nanosleep
List of pre-defined events (to be used in -e):
probe:do_nanosleep [Tracepoint event]
Metric Groups:
(3)record/script 动态事件
[root@localhost ~]# perf record -e probe:do_nanosleep -aR sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.618 MB perf.data (3 samples) ]
[root@localhost ~]# perf script
sleep 30562 [001] 63910.281197: probe:do_nanosleep: (ffffffffb0f66b40)
sleep 30568 [000] 63910.340391: probe:do_nanosleep: (ffffffffb0f66b40)
pool 6556 [002] 63910.844546: probe:do_nanosleep: (ffffffffb0f66b40)
[root@localhost ~]# cat /proc/kallsyms | grep do_nanosleep
ffffffffb0f66b40 t do_nanosleep
(4)删除动态事件
[root@localhost ~]# perf probe --del probe:do_nanosleep
Removed event: probe:do_nanosleep
(5)kretprobe
可以通过添加%return来检测函数的返回:
[root@localhost ~]# perf probe --add do_nanosleep%return
Added new event:
probe:do_nanosleep (on do_nanosleep%return)
You can now use it in all perf tools, such as:
perf record -e probe:do_nanosleep -aR sleep 1
这使用kretprobe:
[root@localhost ~]# perf probe --add do_nanosleep%return
Added new event:
probe:do_nanosleep (on do_nanosleep%return)
You can now use it in all perf tools, such as:
perf record -e probe:do_nanosleep -aR sleep 1
[root@localhost ~]#
[root@localhost ~]# perf record -e probe:do_nanosleep -aR sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.611 MB perf.data (2 samples) ]
[root@localhost ~]# perf script
pool 6556 [000] 64259.662728: probe:do_nanosleep: (ffffffffb0f66b40 <- ffffffffb08c6a6b)
sleep 518 [001] 64260.171569: probe:do_nanosleep: (ffffffffb0f66b40 <- ffffffffb08c6a6b)
[root@localhost ~]# perf probe --del probe:do_nanosleep
Removed event: probe:do_nanosleep
2.2 kprobe Arguments
至少有四种不同的方式将参数插入内核函数。
(1)
首先,如果内核debuginfo可用,那么perf可以获得函数变量(包括参数)的信息。使用–vars选项列出do_nanosleep()kprobe的变量:
-V, --vars=
Show available local variables at given probe point. The argument syntax is same as PROBE SYNTAX, but NO ARGs.
[root@localhost ~]# perf probe --vars do_nanosleep
Available variables at do_nanosleep
@<do_nanosleep+0>
enum hrtimer_mode mode
struct hrtimer_sleeper* t
此输出显示名为mode和t的变量,它们是do_nanosleep()的入口参数。可以在创建探头时添加这些内容,以便在记录时包含这些内容。例如,添加模式:
[root@localhost ~]# perf probe 'do_nanosleep mode'
Added new event:
probe:do_nanosleep (on do_nanosleep with mode)
You can now use it in all perf tools, such as:
perf record -e probe:do_nanosleep -aR sleep 1
[root@localhost ~]# perf record -e probe:do_nanosleep -a
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.605 MB perf.data (3 samples) ]
[root@localhost ~]# perf script
pool 6556 [000] 64979.664437: probe:do_nanosleep: (ffffffffb0f66b40) mode=0x1
sleep 5962 [002] 64980.279301: probe:do_nanosleep: (ffffffffb0f66b40) mode=0x1
pool 6556 [000] 64980.316230: probe:do_nanosleep: (ffffffffb0f66b40) mode=0x1
输出显示 mode 变量等于1。
(2)
第二,如果内核debuginfo不可用(正如我在生产中经常发现的那样),那么可以通过它们的寄存器位置读取参数。一个技巧是使用相同的系统(相同的硬件和内核)并在其上安装内核调试信息以供参考。然后,可以使用-n(dry run)和-v(verbose)选项查询此参考系统以查找寄存器位置,以执行探测:
[root@localhost ~]# perf probe -nv 'do_nanosleep mode'
......
Found 1 probe_trace_events.
Opening /sys/kernel/debug/tracing//kprobe_events write=1
Writing event: p:probe/do_nanosleep _text+7760704 mode=%si:u32
Added new event:
probe:do_nanosleep (on do_nanosleep with mode)
......
由于它是一个 dry run,因此不会创建事件。但是输出显示了模式变量的位置:它位于寄存器%si中,并打印为32位十六进制数(u32)。现在可以通过复制和粘贴模式声明字符串(mode=%si:x32)在无debuginfo系统上使用此语法:
[root@localhost ~]# perf probe 'do_nanosleep mode=%si:u32'
Added new event:
probe:do_nanosleep (on do_nanosleep with mode=%si:u32)
You can now use it in all perf tools, such as:
perf record -e probe:do_nanosleep -aR sleep 1
[root@localhost ~]# perf record -e probe:do_nanosleep -a
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.606 MB perf.data (3 samples) ]
[root@localhost ~]# perf script
pool 6556 [000] 65807.996795: probe:do_nanosleep: (ffffffffb0f66b40) mode=0x1
sleep 11789 [002] 65808.370905: probe:do_nanosleep: (ffffffffb0f66b40) mode=0x1
pool 6556 [000] 65808.648566: probe:do_nanosleep: (ffffffffb0f66b40) mode=0x1
只有当系统具有相同的处理器ABI和内核版本时,这才有效,否则可能会检测到错误的寄存器位置。
(3)
第三,如果您知道处理器ABI,可以自己确定寄存器位置。下一节中给出了一个 uprobes 的示例。
(4)
第四,有一个新的内核调试信息源:BPF类型格式(BTF)。默认情况下,这更可能是可用的,将来的perf版本应该支持它作为备用的debuginfo源。
(5)
对于使用kretprobe检测的do_nanosleep的返回,可以使用特殊的$retval变量读取返回值:
perf probe 'do_nanosleep%return $retval'
查看内核源代码以确定返回值包含的内容。
3.3 tcp_sendmsg()
3.3.1 Kernel: tcp_sendmsg()
此示例显示在Linux 3.10.0内核上检测内核tcp_sendmsg()函数:
[root@localhost perf]# cat /proc/kallsyms | grep tcp_sendmsg
ffffffffb0e93fc0 T tcp_sendmsg
[root@localhost perf]# perf probe --add tcp_sendmsg
Added new event:
probe:tcp_sendmsg (on tcp_sendmsg)
You can now use it in all perf tools, such as:
perf record -e probe:tcp_sendmsg -aR sleep 1
[root@localhost perf]# perf record -e probe:tcp_sendmsg -a -g -- sleep 5
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.632 MB perf.data (42 samples) ]
[root@localhost perf]# perf report --stdio
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 42 of event 'probe:tcp_sendmsg'
# Event count (approx.): 42
#
# Children Self Trace output
# ........ ........ ..................
#
100.00% 100.00% (ffffffffb0e93fc0)
|
---__GI___libc_write
system_call
sys_write
vfs_write
do_sync_write
sock_aio_write
tcp_sendmsg
#
# (Tip: For memory address profiling, try: perf mem record / perf mem report)
这显示了write()系统调用到tcp_sendmsg()的路径。
[root@localhost perf]# perf probe --del tcp_sendmsg
Removed event: probe:tcp_sendmsg
3.3.2 Kernel: tcp_sendmsg() with size
如果内核具有debuginfo(CONFIG_DEBUG_INFO=y),则可以从函数中找出内核变量。这是检查size_ t(整数)的简单示例:
列出tcp_sendmsg()可用的变量:
-V, --vars=
Show available local variables at given probe point. The argument syntax is same as PROBE SYNTAX, but NO ARGs.
[root@localhost perf]# perf probe -V tcp_sendmsg
Available variables at tcp_sendmsg
@<tcp_sendmsg+0>
int size_goal
long int timeo
size_t size
struct kiocb* iocb
struct msghdr* msg
struct sock* sk
使用“size”变量为tcp_sendmsg()创建探测:
[root@localhost perf]# perf probe --add 'tcp_sendmsg size'
Added new event:
probe:tcp_sendmsg (on tcp_sendmsg with size)
You can now use it in all perf tools, such as:
perf record -e probe:tcp_sendmsg -aR sleep 1
Tracing this probe:
[root@localhost perf]# perf record -e probe:tcp_sendmsg -a
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.628 MB perf.data (49 samples) ]
[root@localhost perf]# perf script
sshd 29204 [001] 82790.664230: probe:tcp_sendmsg: (ffffffffb0e93fc0) size=0x50
sshd 29204 [001] 82790.664939: probe:tcp_sendmsg: (ffffffffb0e93fc0) size=0x60
sshd 29204 [001] 82790.665627: probe:tcp_sendmsg: (ffffffffb0e93fc0) size=0x50
sshd 29204 [001] 82790.665712: probe:tcp_sendmsg: (ffffffffb0e93fc0) size=0x30
sshd 29204 [001] 82790.667466: probe:tcp_sendmsg: (ffffffffb0e93fc0) size=0x40
sshd 29204 [001] 82790.667566: probe:tcp_sendmsg: (ffffffffb0e93fc0) size=0x30
sshd 29204 [001] 82790.668484: probe:tcp_sendmsg: (ffffffffb0e93fc0) size=0x30
sshd 29204 [001] 82790.668582: probe:tcp_sendmsg: (ffffffffb0e93fc0) size=0x30
sshd 29204 [001] 82791.672211: probe:tcp_sendmsg: (ffffffffb0e93fc0) size=0x40
sshd 29204 [001] 82791.674832: probe:tcp_sendmsg: (ffffffffb0e93fc0) size=0x60
sshd 29204 [001] 82791.677249: probe:tcp_sendmsg: (ffffffffb0e93fc0) size=0x50
sshd 29204 [001] 82791.677519: probe:tcp_sendmsg: (ffffffffb0e93fc0) size=0x30
sshd 29204 [001] 82791.681480: probe:tcp_sendmsg: (ffffffffb0e93fc0) size=0x40
sshd 29204 [001] 82791.681772: probe:tcp_sendmsg: (ffffffffb0e93fc0) size=0x30
sshd 29204 [001] 82791.684988: probe:tcp_sendmsg: (ffffffffb0e93fc0) size=0x30
sshd 29204 [001] 82791.685259: probe:tcp_sendmsg: (ffffffffb0e93fc0) size=0x30
sshd 29204 [001] 82792.689914: probe:tcp_sendmsg: (ffffffffb0e93fc0) size=0x40
sshd 29204 [001] 82792.692504: probe:tcp_sendmsg: (ffffffffb0e93fc0) size=0x60
sshd 29204 [001] 82792.694701: probe:tcp_sendmsg: (ffffffffb0e93fc0) size=0x50
sshd 29204 [001] 82792.694949: probe:tcp_sendmsg: (ffffffffb0e93fc0) size=0x30
......
3.3.2 Kernel: tcp_sendmsg() line number and local variable
使用debuginfo,perf_events可以为内核函数内的行创建跟踪点。列出tcp_sendmsg()的可用行探测:
-L, --line=
Show source code lines which can be probed. This needs an argument which specifies a range of the source code. (see LINE SYNTAX for detail)
[root@localhost perf]# perf probe -L tcp_sendmsg
<tcp_sendmsg@/usr/src/debug/kernel-3.10.0-957.el7/linux-3.10.0-957.el7.x86_64/net/ipv4/tcp.c:0>
0 int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
size_t size)
2 {
struct iovec *iov;
struct tcp_sock *tp = tcp_sk(sk);
struct sk_buff *skb;
6 int iovlen, flags, err, copied = 0;
7 int mss_now = 0, size_goal, copied_syn = 0, offset = 0;
bool sg;
long timeo;
11 lock_sock(sk);
13 flags = msg->msg_flags;
14 if (flags & MSG_FASTOPEN) {
15 err = tcp_sendmsg_fastopen(sk, msg, &copied_syn, size);
16 if (err == -EINPROGRESS && copied_syn > 0)
goto out;
18 else if (err)
goto out_err;
offset = copied_syn;
}
......
}
[root@localhost perf]# perf probe -V tcp_sendmsg:81
Available variables at tcp_sendmsg:81
@<tcp_sendmsg+826>
bool sg
int copied
int flags
int iovlen
int mss_now
int offset
int size_goal
long int timeo
size_t seglen
struct iovec* iov
struct sock* sk
unsigned char* from
现在让我们跟踪第81行,并在循环中检查seglen变量:
[root@localhost perf]# perf probe --del tcp_sendmsg
Removed event: probe:tcp_sendmsg
[root@localhost perf]# perf probe --add 'tcp_sendmsg:81 seglen'
Added new event:
probe:tcp_sendmsg (on tcp_sendmsg:81 with seglen)
You can now use it in all perf tools, such as:
perf record -e probe:tcp_sendmsg -aR sleep 1
[root@localhost perf]# perf record -e probe:tcp_sendmsg -a
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.619 MB perf.data (33 samples) ]
[root@localhost perf]# perf script
sshd 29204 [001] 83979.203223: probe:tcp_sendmsg: (ffffffffb0e942fa) seglen=0x50
sshd 29204 [001] 83979.205938: probe:tcp_sendmsg: (ffffffffb0e942fa) seglen=0x60
sshd 29204 [001] 83979.208299: probe:tcp_sendmsg: (ffffffffb0e942fa) seglen=0x50
sshd 29204 [001] 83979.208539: probe:tcp_sendmsg: (ffffffffb0e942fa) seglen=0x30
sshd 29204 [001] 83979.214403: probe:tcp_sendmsg: (ffffffffb0e942fa) seglen=0x40
sshd 29204 [001] 83979.214656: probe:tcp_sendmsg: (ffffffffb0e942fa) seglen=0x30
sshd 29204 [001] 83979.218265: probe:tcp_sendmsg: (ffffffffb0e942fa) seglen=0x30
sshd 29204 [001] 83979.218564: probe:tcp_sendmsg: (ffffffffb0e942fa) seglen=0x30
sshd 29204 [001] 83980.223443: probe:tcp_sendmsg: (ffffffffb0e942fa) seglen=0x50
sshd 29204 [001] 83980.226039: probe:tcp_sendmsg: (ffffffffb0e942fa) seglen=0x30
sshd 29204 [001] 83980.226248: probe:tcp_sendmsg: (ffffffffb0e942fa) seglen=0x70
sshd 29204 [002] 83980.229020: probe:tcp_sendmsg: (ffffffffb0e942fa) seglen=0x50
sshd 29204 [002] 83980.229094: probe:tcp_sendmsg: (ffffffffb0e942fa) seglen=0x30
sshd 29204 [002] 83980.232982: probe:tcp_sendmsg: (ffffffffb0e942fa) seglen=0x40
sshd 29204 [002] 83980.233220: probe:tcp_sendmsg: (ffffffffb0e942fa) seglen=0x30
sshd 29204 [002] 83980.236385: probe:tcp_sendmsg: (ffffffffb0e942fa) seglen=0x30
sshd 29204 [002] 83980.236677: probe:tcp_sendmsg: (ffffffffb0e942fa) seglen=0x30
sshd 29204 [002] 83981.241734: probe:tcp_sendmsg: (ffffffffb0e942fa) seglen=0x50
sshd 29204 [002] 83981.244477: probe:tcp_sendmsg: (ffffffffb0e942fa) seglen=0x60
sshd 29204 [002] 83981.246985: probe:tcp_sendmsg: (ffffffffb0e942fa) seglen=0x50
sshd 29204 [002] 83981.247241: probe:tcp_sendmsg: (ffffffffb0e942fa) seglen=0x30
sshd 29204 [002] 83981.251093: probe:tcp_sendmsg: (ffffffffb0e942fa) seglen=0x40
sshd 29204 [002] 83981.251360: probe:tcp_sendmsg: (ffffffffb0e942fa) seglen=0x30
sshd 29204 [002] 83981.254500: probe:tcp_sendmsg: (ffffffffb0e942fa) seglen=0x30
sshd 29204 [002] 83981.254828: probe:tcp_sendmsg: (ffffffffb0e942fa) seglen=0x30
sshd 29204 [002] 83982.259723: probe:tcp_sendmsg: (ffffffffb0e942fa) seglen=0x40
sshd 29204 [002] 83982.262445: probe:tcp_sendmsg: (ffffffffb0e942fa) seglen=0x60
sshd 29204 [002] 83982.264933: probe:tcp_sendmsg: (ffffffffb0e942fa) seglen=0x50
sshd 29204 [002] 83982.265127: probe:tcp_sendmsg: (ffffffffb0e942fa) seglen=0x30
sshd 29204 [002] 83982.268984: probe:tcp_sendmsg: (ffffffffb0e942fa) seglen=0x40
sshd 29204 [002] 83982.269206: probe:tcp_sendmsg: (ffffffffb0e942fa) seglen=0x30
sshd 29204 [002] 83982.272402: probe:tcp_sendmsg: (ffffffffb0e942fa) seglen=0x30
sshd 29204 [002] 83982.272704: probe:tcp_sendmsg: (ffffffffb0e942fa) seglen=0x30
还可以在内核过滤中使用–filter,以仅匹配所需的数据。
三、uprobes的使用
uprobes(用户空间探测器)与kprobes类似,但适用于用户空间。它们可以动态地插入应用程序和库中的函数,并提供一个不稳定的API,用于深入其他工具范围之外的软件内部。uprobes在2012年发布的Linux3.5中提供。
3.1 perf uprobe 的使用
在使用 perf 时,uprobes的创建与kprobes类似。例如,要为libc文件打开函数创建uprobe,fopen:
# perf probe -x /lib/x86_64-linux-gnu/libc.so.6 --add fopen
Added new event:
probe_libc:fopen (on fopen in /lib/x86_64-linux-gnu/libc-2.27.so)
二进制路径使用-x指定。uprobe名为probe_libc:fopen,现在可以与perf record一起使用来记录事件。
完成uprobe后,可以使用–del:
# perf probe --del probe_libc:fopen
Removed event: probe_libc:fopen
可以通过添加%return来检测函数的返回:
# perf probe -x /lib/x86_64-linux-gnu/libc.so.6 --add fopen%return
使用uretpobe。
3.2 uprobe Arguments
如果您的系统具有目标二进制文件的debuginfo,则变量信息(包括参数)可能可用。这可以使用–vars列出:
# perf probe -x /lib/x86_64-linux-gnu/libc.so.6 --vars fopen
Available variables at fopen
@<_IO_vfscanf+15344>
char* filename
char* mode
输出显示fopen具有 filename 和 mode 变量。创建探针时可以添加以下内容:
perf probe -x /lib/x86_64-linux-gnu/libc.so.6 --add 'fopen filename mode'
Debuginfo可以通过-dbg或-dbgsym包提供。如果这在目标系统上不可用,但在另一个系统上,则其他系统可以用作参考系统,如前一节kprobes中所示。
参考资料
https://www.brendangregg.com/perf.html