Crash分析gpu非法访问地址问题
1. 问题描述
在我司产品monkey老化过程中,极低概率出现gpu驱动访问非法地址导致kernel panic问题,在kernel panic后,主动触发ramdump机制,抓到相关的ramdump文件,利用crash工具进行离线分析。
2. crash分析ramdump
2.1 获取log信息
在crash工具中,通过dmesg或者log命令获取kernel panic时的logbuf的信息,提取关键信息如下:
[24145.263446] aliperm##error[sess_perm:1692]can't find session with id 651
[24145.291018] Unhandled fault: level 3 address size fault (0x96000043) at 0xffff00001fbaf000
[24145.291021] Mem abort info:
[24145.291024] Exception class = DABT (current EL), IL = 32 bits
[24145.291025] SET = 0, FnV = 0
[24145.291026] EA = 0, S1PTW = 0
[24145.291027] Data abort info:
[24145.291028] ISV = 0, ISS = 0x00000043
[24145.291029] CM = 0, WnR = 1
[24145.291033] Internal error: : 96000043 [#1] PREEMPT SMP
[24145.291039] Modules linked in: 8021q garp mrp bridge stp llc fts_ts veth himax_mmi xrp_hw_semidrive aes_neon_bs crc32_ce aes_neon_blk x9_ref_mach_ak7738 dwmac_dwc_qos_eth alitks_mod(PO) bcmdhd snd_sdrv_i2s_sc aliperm_mod(P) alisec_mod(P) alipatch_mod(P) alintgr_mod(P) alimbedtls_mod(P)
[24145.291075] CPU: 5 PID: 1803 Comm: pvr_defer_free Tainted: P W O 4.14.61+ #1
[24145.291076] Hardware name: Semidrive kunlun x9 REF Board (DT)
[24145.291079] task: ffff8001794fab80 task.stack: ffff00000e7d8000
[24145.291088] PC is at DeviceMemSet+0x8c/0xc8
[24145.291094] LR is at _ZeroPageArray+0x78/0x118
[24145.291096] pc : [<ffff00000876d444>] lr : [<ffff0000087609c0>] pstate: 20c00145
[24145.291097] sp : ffff00000e7dbd00
[24145.291098] x29: ffff00000e7dbd00 x28: ffff80015bc9fb00
[24145.291102] x27: 00000000000186a0 x26: 0000000000000000
[24145.291105] x25: 00e8000000000f0f x24: 0000000000000080
[24145.291108] x23: ffff80012d0db000 x22: ffff00001fb7c000
[24145.291111] x21: 0000000000000000 x20: 0000000000080000
[24145.291114] x19: ffff00001fbaf000 x18: 0000ffffac000bcc
[24145.291117] x17: 0000ffff97c22ec8 x16: ffff00000818ded0
[24145.291121] x15: 0000ffffac000bc8 x14: 0140000000000000
[24145.291124] x13: ffff00001fbfc000 x12: 0000000000000000
[24145.291127] x11: 0000000000000000 x10: 0000000000000040
[24145.291130] x9 : 0040000000000041 x8 : 0040000000000001
[24145.291132] x7 : 0000000000000001 x6 : 000000017fffd7e8
[24145.291135] x5 : ffff8001398feb98 x4 : ffff8001398feb98
[24145.291138] x3 : ffff00001fbfbfff x2 : 0000000000000000
[24145.291140] x1 : 0000000000000000 x0 : ffff00001fbfc000
[24145.291146]
X4: 0xffff8001398feb18:
[24145.291147] eb18 398feb10 ffff8001 00000000 00000000 00000000 00000000 00000000 00000000
[24145.291155] eb38 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[24145.291163] eb58 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[24145.291170] eb78 00000000 00007e28 1fac5000 ffff0000 1fb7c000 ffff0000 00000000 00000000
[24145.291178] eb98 33317c19 ffff8000 1e1c9018 ffff8000 1e1e2618 ffff8000 1e1c9030 ffff8000
[24145.291186] ebb8 1e1e2630 ffff8000 398fe5c0 ffff8001 00000000 00000000 00000000 00000000
[24145.291193] ebd8 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[24145.291201] ebf8 00000000 00000000 39a12880 ffff8001 00000000 00000000 00000020 00000000
[24145.291210]
X5: 0xffff8001398feb18:
[24145.291210] eb18 398feb10 ffff8001 00000000 00000000 00000000 00000000 00000000 00000000
[24145.291218] eb38 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[24145.291225] eb58 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[24145.291233] eb78 00000000 00007e28 1fac5000 ffff0000 1fb7c000 ffff0000 00000000 00000000
[24145.291240] eb98 33317c19 ffff8000 1e1c9018 ffff8000 1e1e2618 ffff8000 1e1c9030 ffff8000
[24145.291248] ebb8 1e1e2630 ffff8000 398fe5c0 ffff8001 00000000 00000000 00000000 00000000
[24145.291256] ebd8 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[24145.291263] ebf8 00000000 00000000 39a12880 ffff8001 00000000 00000000 00000020 00000000
[24145.291276]
X23: 0xffff80012d0daf80:
[24145.291277] af80 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[24145.291284] afa0 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[24145.291292] afc0 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[24145.291300] afe0 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[24145.291307] b000 04fcd9c0 ffff7e00 04fcd980 ffff7e00 04fce8c0 ffff7e00 04fce880 ffff7e00
[24145.291315] b020 04fcebc0 ffff7e00 04fceb80 ffff7e00 04fcf0c0 ffff7e00 04fcf080 ffff7e00
[24145.291323] b040 04fcf2c0 ffff7e00 04fcf280 ffff7e00 04fcfac0 ffff7e00 04fcfa80 ffff7e00
[24145.291330] b060 04fcffc0 ffff7e00 04fcff80 ffff7e00 0433aec0 ffff7e00 0433ae80 ffff7e00
[24145.291340]
X28: 0xffff80015bc9fa80:
[24145.291341] fa80 786f7270 5f632e79 31393533 ffff8000 5bc9fa88 ffff8001 718dbc00 ffff8001
[24145.291349] faa0 6596aa00 ffff8001 00000000 00000000 2cf5a295 00000003 00000002 00000000
[24145.291356] fac0 0000001d 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[24145.291364] fae0 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[24145.291371] fb00 00000000 00000000 00000000 00000000 08760a60 ffff0000 5bc9fb00 ffff8001
[24145.291379] fb20 00000000 00000000 000000c8 00000000 00000000 00000000 eb2d59e0 ffff8000
[24145.291386] fb40 0000003c 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[24145.291394] fb60 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00005a47
[24145.291402]
[24145.291404] Process pvr_defer_free (pid: 1803, stack limit = 0xffff00000e7d8000)
[24145.291405] Call trace:
[24145.291408] Exception stack(0xffff00000e7dbbc0 to 0xffff00000e7dbd00)
[24145.291411] bbc0: ffff00001fbfc000 0000000000000000 0000000000000000 ffff00001fbfbfff
[24145.291414] bbe0: ffff8001398feb98 ffff8001398feb98 000000017fffd7e8 0000000000000001
[24145.291417] bc00: 0040000000000001 0040000000000041 0000000000000040 0000000000000000
[24145.291420] bc20: 0000000000000000 ffff00001fbfc000 0140000000000000 0000ffffac000bc8
[24145.291422] bc40: ffff00000818ded0 0000ffff97c22ec8 0000ffffac000bcc ffff00001fbaf000
[24145.291425] bc60: 0000000000080000 0000000000000000 ffff00001fb7c000 ffff80012d0db000
[24145.291428] bc80: 0000000000000080 00e8000000000f0f 0000000000000000 00000000000186a0
[24145.291431] bca0: ffff80015bc9fb00 ffff00000e7dbd00 ffff0000087609c0 ffff00000e7dbd00
[24145.291434] bcc0: ffff00000876d444 0000000020c00145 ffff00000e7dbd30 ffff0000087609ac
[24145.291436] bce0: ffffffffffffffff ffff00000877e72c ffff00000e7dbd00 ffff00000876d444
[24145.291440] [<ffff00000876d444>] DeviceMemSet+0x8c/0xc8
[24145.291444] [<ffff0000087609c0>] _ZeroPageArray+0x78/0x118
[24145.291447] [<ffff000008760b5c>] _CleanupThread_CleanPages+0xfc/0x2b8
[24145.291453] [<ffff000008781be0>] CleanupThread+0x148/0x3a0
[24145.291455] [<ffff00000875c258>] OSThreadRun+0x28/0x60
[24145.291461] [<ffff000008108c70>] kthread+0x138/0x140
[24145.291465] [<ffff00000808518c>] ret_from_fork+0x10/0x1c
[24145.291469] Code: 927cec00 91004000 8b000260 b3607c42 (a9000a62)
[24145.291483] SMP: stopping secondary CPUs
[24145.291503] ---[ end trace 2abaad5900994446 ]---
[24145.335190] Kernel panic - not syncing: Fatal exception
[24145.335308] Kernel Offset: disabled
[24145.335312] CPU features: 0x0802210
[24145.335313] Memory Limit: none
[24146.006485] Rebooting in 1 seconds..
[24147.010058] flush all cache
2.2 结合log信息分析
从log看,是gpu驱动在进行memset时,访问非法地址了,初步判断,可能是buffer越界访问导致的,从PC指针获取现场信息
crash> dis DeviceMemSet+0x8c -l
/drivers/gpu/rogue_km/services/shared/common/mem_utils.c: 335
0xffff00000876d444 <DeviceMemSet+140>: stp x2, x2, [x19]
对应源码如下:
同过crash工具获取异常时现场寄存器的上下文信息如下:
crash> bt -e
PID: 1803 TASK: ffff8001794fab80 CPU: 5 COMMAND: "pvr_defer_free"
KERNEL-MODE EXCEPTION FRAME AT: ffff00000e7dbbc0
PC: ffff00000876d444 [DeviceMemSet+140]
LR: ffff0000087609c0 [_ZeroPageArray+120]
SP: ffff00000e7dbd00 PSTATE: 20c00145
X29: ffff00000e7dbd00 X28: ffff80015bc9fb00 X27: 00000000000186a0
X26: 0000000000000000 X25: 00e8000000000f0f X24: 0000000000000080
X23: ffff80012d0db000 X22: ffff00001fb7c000 X21: 0000000000000000
X20: 0000000000080000 X19: ffff00001fbaf000 X18: 0000ffffac000bcc
X17: 0000ffff97c22ec8 X16: ffff00000818ded0 X15: 0000ffffac000bc8
X14: 0140000000000000 X13: ffff00001fbfc000 X12: 0000000000000000
X11: 0000000000000000 X10: 0000000000000040 X9: 0040000000000041
X8: 0040000000000001 X7: 0000000000000001 X6: 000000017fffd7e8
X5: ffff8001398feb98 X4: ffff8001398feb98 X3: ffff00001fbfbfff
X2: 0000000000000000 X1: 0000000000000000 X0: ffff00001fbfc000
从现场寄存器信息来看,X19=ffff00001fbaf000,与panic时log提示的现场一致
反编译整个DeviceMemSet函数:
crash> dis DeviceMemSet
0xffff00000876d3b8 <DeviceMemSet>: stp x29, x30, [sp, #-48]!
0xffff00000876d3bc <DeviceMemSet+4>: mov x29, sp
0xffff00000876d3c0 <DeviceMemSet+8>: stp x19, x20, [sp, #16]
0xffff00000876d3c4 <DeviceMemSet+12>: str x21, [sp, #32]
0xffff00000876d3c8 <DeviceMemSet+16>: mov x19, x0
0xffff00000876d3cc <DeviceMemSet+20>: and w21, w1, #0xff
0xffff00000876d3d0 <DeviceMemSet+24>: mov x20, x2
0xffff00000876d3d4 <DeviceMemSet+28>: mov x0, x30
0xffff00000876d3d8 <DeviceMemSet+32>: nop
0xffff00000876d3dc <DeviceMemSet+36>: ands x1, x19, #0xf
0xffff00000876d3e0 <DeviceMemSet+40>: b.eq 0xffff00000876d418 <DeviceMemSet+96> // b.none
0xffff00000876d3e4 <DeviceMemSet+44>: mov x0, #0x10 // #16
0xffff00000876d3e8 <DeviceMemSet+48>: sub x0, x0, x1
0xffff00000876d3ec <DeviceMemSet+52>: cmp x0, x20
0xffff00000876d3f0 <DeviceMemSet+56>: csel x0, x0, x20, ls // ls = plast
0xffff00000876d3f4 <DeviceMemSet+60>: sub x20, x20, x0
0xffff00000876d3f8 <DeviceMemSet+64>: cbz x0, 0xffff00000876d418 <DeviceMemSet+96>
0xffff00000876d3fc <DeviceMemSet+68>: add x0, x19, x0
0xffff00000876d400 <DeviceMemSet+72>: mov x1, x19
0xffff00000876d404 <DeviceMemSet+76>: strb w21, [x1]
0xffff00000876d408 <DeviceMemSet+80>: add x19, x19, #0x1
0xffff00000876d40c <DeviceMemSet+84>: mov x1, x19
0xffff00000876d410 <DeviceMemSet+88>: cmp x0, x19
0xffff00000876d414 <DeviceMemSet+92>: b.ne 0xffff00000876d404 <DeviceMemSet+76> // b.any
0xffff00000876d418 <DeviceMemSet+96>: cmp x20, #0xf
0xffff00000876d41c <DeviceMemSet+100>: b.ls 0xffff00000876d458 <DeviceMemSet+160> // b.plast
0xffff00000876d420 <DeviceMemSet+104>: lsl w2, w21, #16
0xffff00000876d424 <DeviceMemSet+108>: orr w1, w21, w21, lsl #8
0xffff00000876d428 <DeviceMemSet+112>: orr w2, w2, w21, lsl #24
0xffff00000876d42c <DeviceMemSet+116>: sub x0, x20, #0x10
0xffff00000876d430 <DeviceMemSet+120>: orr w2, w2, w1
0xffff00000876d434 <DeviceMemSet+124>: and x0, x0, #0xfffffffffffffff0
0xffff00000876d438 <DeviceMemSet+128>: add x0, x0, #0x10
0xffff00000876d43c <DeviceMemSet+132>: add x0, x19, x0
0xffff00000876d440 <DeviceMemSet+136>: bfi x2, x2, #32, #32
0xffff00000876d444 <DeviceMemSet+140>: stp x2, x2, [x19] //panic
...
结合源码和汇编信息分析,应该大概率是pDst这个指针越界了,分析上下文,获取pDst指针指向的buffer以及size如下:
从源码可以看出,PDst访问的buffer和size,是由该函数的参数决定的,因此pDst指针范围,不能超出函数参数给出的限制。因为函数调用时被调用函数会在自己的栈帧中保存即将被修改到的寄存器,因此可以从函数栈帧获取DeviceMemSet()函数的信息如下:
crash> bt -f
PID: 1803 TASK: ffff8001794fab80 CPU: 5 COMMAND: "pvr_defer_free"
#0 [ffff00000e7dbd00] DeviceMemSet at ffff00000876d440
ffff00000e7dbd00: ffff00000e7dbd30 ffff0000087609c0
ffff00000e7dbd10: 0000000000000080 0000000000000080
ffff00000e7dbd20: 0000000000000080 ffff00000877fcb4
#1 [ffff00000e7dbd30] _ZeroPageArray at ffff0000087609bc
ffff00000e7dbd30: ffff00000e7dbd80 ffff000008760b5c
ffff00000e7dbd40: ffff80015bc9fb00 ffff8000eb2d59e0
ffff00000e7dbd50: ffff000008760a60 ffff80015bc9f800
ffff00000e7dbd60: 0000000000000000 0000000000000000
ffff00000e7dbd70: ffff000009334000 ffff00000877283c
#2 [ffff00000e7dbd80] _CleanupThread_CleanPages at ffff000008760b58
ffff00000e7dbd80: ffff00000e7dbdc0 ffff000008781be0
ffff00000e7dbd90: ffff80017910b200 ffff80017910b290
ffff00000e7dbda0: ffff000008760a60 ffff000008781bd4
ffff00000e7dbdb0: ffff80017910b200 ffff80017910b290
#3 [ffff00000e7dbdc0] CleanupThread at ffff000008781bdc
ffff00000e7dbdc0: ffff00000e7dbe50 ffff00000875c258
ffff00000e7dbdd0: ffff8001783be180 ffff80017918d080
ffff00000e7dbde0: ffff8001794fab80 ffff000009b8cf70
ffff00000e7dbdf0: ffff00000a563be8 ffff8001783be180
ffff00000e7dbe00: ffff00000875c230 ffff000009313368
ffff00000e7dbe10: 0000000000000000 0000000000000000
ffff00000e7dbe20: ffff000009391ad8 ffff000009334968
ffff00000e7dbe30: ffff8001783be180 ffff80017918d900
ffff00000e7dbe40: ffff80017918d380 168cfdb4a63a5800
#4 [ffff00000e7dbe50] OSThreadRun at ffff00000875c254
ffff00000e7dbe50: ffff00000e7dbe70 ffff000008108c70
ffff00000e7dbe60: ffff8001783be980 168cfdb4a63a5800
#5 [ffff00000e7dbe70] kthread at ffff000008108c6c
在被调用函数DeviceMemSet()在栈帧中保存了以下参数:
crash> dis DeviceMemSet
0xffff00000876d3b8 <DeviceMemSet>: stp x29, x30, [sp, #-48]!
0xffff00000876d3bc <DeviceMemSet+4>: mov x29, sp
0xffff00000876d3c0 <DeviceMemSet+8>: stp x19, x20, [sp, #16] //保存X19,X20到栈中
0xffff00000876d3c4 <DeviceMemSet+12>: str x21, [sp, #32] //保存X21到栈中
而DeviceMemSet()函数的SP为0xffff00000e7dbd00
再看看调用DeviceMemSet()函数的父函数:
crash> dis _ZeroPageArray
0xffff000008760948 <_ZeroPageArray>: stp x29, x30, [sp, #-80]!
0xffff00000876094c <_ZeroPageArray+4>: mov x29, sp
0xffff000008760950 <_ZeroPageArray+8>: stp x20, x21, [sp, #24]
0xffff000008760954 <_ZeroPageArray+12>: str x23, [sp, #48]
0xffff000008760958 <_ZeroPageArray+16>: str x25, [sp, #64]
0xffff00000876095c <_ZeroPageArray+20>: mov w20, w0
0xffff000008760960 <_ZeroPageArray+24>: mov x23, x1
0xffff000008760964 <_ZeroPageArray+28>: mov x25, x2
0xffff000008760968 <_ZeroPageArray+32>: mov x0, x30
0xffff00000876096c <_ZeroPageArray+36>: nop
0xffff000008760970 <_ZeroPageArray+40>: mov w21, #0x400 // #1024
0xffff000008760974 <_ZeroPageArray+44>: cmp w20, w21
0xffff000008760978 <_ZeroPageArray+48>: csel w21, w20, w21, ls // ls = plast
0xffff00000876097c <_ZeroPageArray+52>: cbz w20, 0xffff0000087609e4 <_ZeroPageArray+156>
0xffff000008760980 <_ZeroPageArray+56>: str x19, [x29, #16]
0xffff000008760984 <_ZeroPageArray+60>: str x22, [x29, #40]
0xffff000008760988 <_ZeroPageArray+64>: str x24, [x29, #56]
0xffff00000876098c <_ZeroPageArray+68>: cmp w20, w21
0xffff000008760990 <_ZeroPageArray+72>: mov x3, x25
0xffff000008760994 <_ZeroPageArray+76>: csel w19, w20, w21, ls // ls = plast
0xffff000008760998 <_ZeroPageArray+80>: mov w2, #0xffffffff // #-1
0xffff00000876099c <_ZeroPageArray+84>: mov w1, w19
0xffff0000087609a0 <_ZeroPageArray+88>: mov x0, x23
0xffff0000087609a4 <_ZeroPageArray+92>: mov w24, w19
0xffff0000087609a8 <_ZeroPageArray+96>: bl 0xffff000008267958 <vm_map_ram>
0xffff0000087609ac <_ZeroPageArray+100>: mov x22, x0
0xffff0000087609b0 <_ZeroPageArray+104>: cbz x0, 0xffff000008760a00 <_ZeroPageArray+184>
0xffff0000087609b4 <_ZeroPageArray+108>: lsl x2, x24, #12
0xffff0000087609b8 <_ZeroPageArray+112>: mov w1, #0x0 // #0
0xffff0000087609bc <_ZeroPageArray+116>: bl 0xffff00000876d3b8 <DeviceMemSet>
...
DeviceMemSet()函数如下方式被调用:
0xffff0000087609a4 <_ZeroPageArray+92>: mov w24, w19
0xffff0000087609a8 <_ZeroPageArray+96>: bl 0xffff000008267958 <vm_map_ram>
0xffff0000087609ac <_ZeroPageArray+100>: mov x22, x0
0xffff0000087609b0 <_ZeroPageArray+104>: cbz x0, 0xffff000008760a00 <_ZeroPageArray+184>
0xffff0000087609b4 <_ZeroPageArray+108>: lsl x2, x24, #12
0xffff0000087609b8 <_ZeroPageArray+112>: mov w1, #0x0 // #0
0xffff0000087609bc <_ZeroPageArray+116>: bl 0xffff00000876d3b8 <DeviceMemSet>
可以看到,传个DeviceMemSet()的参数X1 = 0, X0 = X22, X2 = X24 << 12, X24 = X19, 而X19保存到DeviceMemSet()函数的SP + 16的地址,SP = 0xffff00000e7dbd00,所以X19寄存器的值从栈中可以获取到为0x80,因此这个buffer的size = 0x80 * 4096Byte = 0x80000,因为在DeviceMemSet()中没有使用到X22,因此出现异常时,X22还保存了跟父函数_ZeroPageArray()一样的值,为0xffff00001fb7c000
因此,这样要memset的buffer起始地址和size都获取到了,所以要memset的范围是:0xffff00001fb7c000 ~ 0xffff00001fb7c000 + 0x80000 - 1,即0xffff00001fb7c000 ~ 0xffff00001fbfc000 - 1
因为出错指令为stp指令:
stp x2, x2, [x19]
这条指令,因此当X19 = 0xffff00001fbfbff0时,就可以把该片memory清0。而0xffff00001fbfbff0 + 0x10 = 0xffff00001fbfc000,刚好是越界访问的地址,且该地址是没有映射的,所有会提示找不到相关转换的页表。
至此,锅已经找到了,就差负责炖的人了。。。