1 问题引入
最近在分析一个panic问题时,发现panic现场无法与log对应起来。
先贴log:
<1>[ 180.089084] Unable to handle kernel NULL pointer dereference at virtual address 00000001
<1>[ 180.099551] pgd = 8bbde651
<1>[ 180.107775] Unable to handle kernel NULL pointer dereference at virtual address 00000001
<1>[ 180.108527] [00000001] *pgd=45636835, *pte=00000000, *ppte=00000000
<1>[ 180.121525] pgd = 8bbde651
<1>[ 180.135901] [00000001] *pgd=45636835, *pte=00000000, *ppte=00000000
<0>[ 180.136405] Internal error: Oops: 17 [#1] PREEMPT SMP THUMB2
<0>[ 180.147836] Modules linked in: cw221x(O) dwc3 sdhci_axera dwc3_axera spi_axera_module [last unloaded: kspi]
<0>[ 180.157597] CPU: 1 PID: 124 Comm: qs_sleep Tainted: G O 4.19.125 #2
<0>[ 180.165167] Hardware name: Axera_chip (Device Tree)
<0>[ 180.170059] PC is at kmem_cache_alloc+0x80/0xd4
<0>[ 180.174592] LR is at kmem_cache_alloc+0x15/0xd4
<0>[ 180.179124] pc : [<c0084f50>] lr : [<c0084ee5>] psr: 60000033
<0>[ 180.185393] sp : c56cfde0 ip : c5bda108 fp : 00000000
<0>[ 180.190619] r10: 00000000 r9 : c5c6b420 r8 : 0001ed1c
<0>[ 180.195845] r7 : ffffe000 r6 : c00c29fd r5 : 006000c0 r4 : c6001e40
<0>[ 180.202374] r3 : 00000001 r2 : 00000000 r1 : 0001ed1c r0 : 00000000
<0>[ 180.208906] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA Thumb Segment user
<0>[ 180.216217] Control: 50c5383d Table: 45bc806a DAC: 00000055
<0>[ 180.221967] Process qs_sleep (pid: 124, stack limit = 0x51827165)
<0>[ 180.228063] Stack: (0xc56cfde0 to 0xc56d0000)
<0>[ 180.232425] fde0: c049f8fc c53bc980 00000000 c612bd80 c0372ac8 c55fbe40 c5c6b420 c00c29fd
<0>[ 180.240608] fe00: c5c6b420 c049f8fc 00000000 c55fbe40 00000000 c5c6b500 c5c6b420 c00c285f
<0>[ 180.248790] fe20: c55fbe48 00000000 00000000 c0087393 c5c79a18 c56cfed0 00000000 c55fbe40
<0>[ 180.256973] fe40: c5c6b420 00000000 00000000 c0092759 c5b81978 00000001 c56cff74 00000002
<0>[ 180.265156] fe60: 00000000 00000041 c5c79908 000041ed 00000000 00000006 00000000 c56cfed8
<0>[ 180.273338] fe80: 00000000 ffffe000 c6094400 00000000 c5c6b420 c5b96610 c5c79a18 00010001
<0>[ 180.281520] fea0: c56cfea0 c56cfea0 0000001d 00000033 c56cff74 00000001 c6247000 c00091e4
<0>[ 180.289703] fec0: c56ce000 00000005 0000000a c009285b c5b96610 c5c79a18 7d11879d 00000009
<0>[ 180.297886] fee0: c624701b c025690d c6006e10 c5c02aa0 c5c6b420 00000101 00000000 00000048
<0>[ 180.306069] ff00: 00000000 00000000 00000000 c56cff10 c043feb3 c048d5de c05882c0 c61d89c8
<0>[ 180.314251] ff20: c043feb3 00000033 c5123a00 c5b96f00 c5b96f40 c036611f 00000000 c009ac05
<0>[ 180.322434] ff40: c6247000 00000000 00000002 00000002 ffffff9c c00091e4 00000033 00000002
<0>[ 180.330616] ff60: ffffff9c c0087f57 00000000 00000000 c61d89c1 00000002 00000000 00000006
<0>[ 180.338797] ff80: 00000100 00000001 00000000 00000000 00000002 b67bbda8 00000005 c00091e4
<0>[ 180.346979] ffa0: c56ce000 c0009001 00000000 00000002 007c4ff8 00000002 00000000 00000001
<0>[ 180.355161] ffc0: 00000000 00000002 b67bbda8 00000005 00000005 005228ac 00000032 0000000a
<0>[ 180.363343] ffe0: 00000000 b67bbda8 b6dcfe33 b6dcfe46 80000030 007c4ff8 00000000 00000000
<0>[ 180.371542] [<c0084f50>] (kmem_cache_alloc) from [<c00c29fd>] (kernfs_fop_open+0x19f/0x23a)
<0>[ 180.379902] [<c00c29fd>] (kernfs_fop_open) from [<c0087393>] (do_dentry_open+0x1f3/0x260)
<0>[ 180.388088] [<c0087393>] (do_dentry_open) from [<c0092759>] (path_openat+0x7d7/0x8ba)
<0>[ 180.395927] [<c0092759>] (path_openat) from [<c009285b>] (do_filp_open+0x1f/0x50)
<0>[ 180.403416] [<c009285b>] (do_filp_open) from [<c0087f57>] (do_sys_open+0xd3/0x140)
<0>[ 180.410993] [<c0087f57>] (do_sys_open) from [<c0009001>] (ret_fast_syscall+0x1/0x5a)
<0>[ 180.418737] Exception stack(0xc56cffa8 to 0xc56cfff0)
<0>[ 180.423792] ffa0: 00000000 00000002 007c4ff8 00000002 00000000 00000001
<0>[ 180.431973] ffc0: 00000000 00000002 b67bbda8 00000005 00000005 005228ac 00000032 0000000a
<0>[ 180.440153] ffe0: 00000000 b67bbda8 b6dcfe33 b6dcfe46
<0>[ 180.445211] Code: 4618 e8bd 83f8 6962 (f853) e002
<0>[ 180.450008] Internal error: Oops: 17 [#2] PREEMPT SMP THUMB2
<0>[ 180.455673] Modules linked in: cw221x(O) dwc3 sdhci_axera dwc3_axera spi_axera_module [last unloaded: kspi]
<4>[ 180.456034] ---[ end trace 4fe5cd369560c84c ]---
<0>[ 180.465435] CPU: 0 PID: 165 Comm: encoder Tainted: G D O 4.19.125 #2
<0>[ 180.465438] Hardware name: Axera_chip (Device Tree)
<0>[ 180.465455] PC is at kmem_cache_alloc+0x80/0xd4
<0>[ 180.465461] LR is at kmem_cache_alloc+0x15/0xd4
<0>[ 180.465465] pc : [<c0084f50>] lr : [<c0084ee5>] psr: 60000033
<0>[ 180.465470] sp : c53bfe68 ip : 60000013 fp : 00000000
<0>[ 180.465474] r10: c55fb840 r9 : 00001000 r8 : 0001ed24
<0>[ 180.465479] r7 : ffffe000 r6 : c00fd299 r5 : 006000c0 r4 : c6001e40
<0>[ 180.465484] r3 : 00000001 r2 : 00000000 r1 : 0001ed24 r0 : 00000000
<0>[ 180.465489] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA Thumb Segment user
<0>[ 180.465494] Control: 50c5383d Table: 45bc806a DAC: 00000055
<0>[ 180.465501] Process encoder (pid: 165, stack limit = 0xb65c0d9e)
<0>[ 180.465506] Stack: (0xc53bfe68 to 0xc53c0000)
<0>[ 180.465514] fe60: c049f8fc 00000000 00000000 c5bda120 c5bda108 00fa24a0
<0>[ 180.465522] fe80: 00001000 c00fd299 c00fd44d c5bda108 c5bda108 c5bda120 00000000 c00fd457
<0>[ 180.470413] Kernel panic - not syncing: Fatal exception
<0>[ 180.477620] fea0: c5bda108 c53bff88 00000000 c009e7c7 c5bda120 c5bda138 c6246000 c00091e4
<0>[ 180.574453] fec0: c53be000 c62b8080 c009e70f 00000001 00000000 00001000 c53bff88 00000003
<0>[ 180.582636] fee0: 00000000 c00b8a55 c53bff88 c55fb840 c036ff00 c00b8a17 c53bff88 c0089839
<0>[ 180.590818] ff00: 00000000 00000000 00000001 c00ae3f1 c00b8901 b26e0d7c c5c6e028 c55fb840
<0>[ 180.599000] ff20: 00000034 c0093ac9 0000541b c009402f 00000000 00000000 00000000 c55fb840
<0>[ 180.607183] ff40: c55fb843 00001000 c55fb840 00000000 00001000 00000001 00fa24a0 c0089939
<0>[ 180.615366] ff60: c55fb840 00fa24a0 00001000 c55fb840 c55fb843 00fa24a0 00001000 c00091e4
<0>[ 180.623549] ff80: c53be000 c0089c71 00000000 00000000 00001000 00000034 00000000 b26e0d88
<0>[ 180.631732] ffa0: 00000003 c0009001 00000034 00000000 00000034 00fa24a0 00001000 00000003
<0>[ 180.639915] ffc0: 00000034 00000000 b26e0d88 00000003 b26e0ea4 004034ac be96d69c 00000000
<0>[ 180.648097] ffe0: 0080761c b26e0d88 b6dd155f b6dd1516 00000030 00000034 00000000 00000000
<0>[ 180.656293] [<c0084f50>] (kmem_cache_alloc) from [<c00fd299>] (disk_seqf_start+0x13/0x5a)
<0>[ 180.664480] [<c00fd299>] (disk_seqf_start) from [<c00fd457>] (show_partition_start+0xb/0x2c)
<0>[ 180.672928] [<c00fd457>] (show_partition_start) from [<c009e7c7>] (seq_read+0xb9/0x27a)
<0>[ 180.680942] [<c009e7c7>] (seq_read) from [<c00b8a55>] (proc_reg_read+0x3f/0x5e)
<0>[ 180.688261] [<c00b8a55>] (proc_reg_read) from [<c0089839>] (__vfs_read+0x15/0xc2)
<0>[ 180.695751] [<c0089839>] (__vfs_read) from [<c0089939>] (vfs_read+0x53/0xb6)
<0>[ 180.702806] [<c0089939>] (vfs_read) from [<c0089c71>] (ksys_read+0x2d/0x64)
<0>[ 180.709775] [<c0089c71>] (ksys_read) from [<c0009001>] (ret_fast_syscall+0x1/0x5a)
<0>[ 180.717346] Exception stack(0xc53bffa8 to 0xc53bfff0)
<0>[ 180.722401] ffa0: 00000034 00000000 00000034 00fa24a0 00001000 00000003
<0>[ 180.730583] ffc0: 00000034 00000000 b26e0d88 00000003 b26e0ea4 004034ac be96d69c 00000000
<0>[ 180.738764] ffe0: 0080761c b26e0d88 b6dd155f b6dd1516
<0>[ 180.743822] Code: 4618 e8bd 83f8 6962 (f853) e002
<2>[ 180.748618] CPU0: stopping
Log中CPU 0和CPU 1均发生了,出错误的位置均是:kmem_cache_alloc+0x80,对应的汇编:
ZST:0076::C0084F50|F853E002 ldr r14,[r3,r2]
CPU 1寄存器值:
<0>[ 180.202374] r3 : 00000001 r2 : 00000000 r1 : 0001ed1c r0 : 00000000
CPU 0寄存器值:
<0>[ 180.465484] r3 : 00000001 r2 : 00000000 r1 : 0001ed24 r0 : 00000000
r2、r3寄存器值一致,从r3+r2地址加载数据到r14, r3+r2=0x1,访问了0x1地址,这与log:
Unable to handle kernel NULL pointer dereference at virtual address 00000001
能对应上。
接下来分析出错时的函数调用栈:
-000|freelist_dereference(inline)
-000|get_freepointer(inline)
-000|get_freepointer_safe(inline)
-000|slab_alloc_node(inline)
| object = 0x1
-000|slab_alloc(inline)
-000|kmem_cache_alloc(s = 0xC6001E40, gfpflags = 6291648)
| s = 0xC6001E40
| gfpflags = 6291648
-001|kmalloc(inline)
-001|kernfs_get_open_node(inline)
| on = 0x0
| new_on = 0x0
-001|kernfs_fop_open(inode = 0xC5C6B420, file = 0xC55FBE40)
| inode = 0xC5C6B420
| file = 0xC55FBE40
| kn = 0xC612BD80
| __key = ()
| __key = ()
| __key = ()
-002|do_dentry_open(f = 0xC55FBE40, inode = 0xC5C6B420, open = 0xC00C285F)
| f = 0xC55FBE40
| inode = 0xC5C6B420
| open = 0xC00C285F
| empty_fops = (owner = 0x0, llseek = 0x0, read = 0x0, write = 0x0, read_iter = 0x0, write_iter = 0x0, iterate = 0x0, iterate_shared = 0x0, poll = 0x0, unlocked_ioctl = 0x0, compat_ioctl = 0x0, mmap = 0x0, mmap_supported_flags = 0, open = 0x0, flush = 0x0, release = 0x0, fsync = 0x0, fasync = 0x0, lock = 0x0, sendpage = 0x0, get_unmapped_area = 0x0, check_flags = 0x0, flock = 0x0, splice_write = 0x0, splice_read = 0x0, setlease = 0x0, fallocate = 0x0, show_fdinfo = 0x0, copy_file_range = 0x0, clone_file_range = 0x0, dedupe_file_range = 0x0, fadvise = 0x0)
| error = 0
| __warned = FALSE
-002|vfs_open(tailcall)
-003|do_last(inline)
| will_truncate = FALSE
| got_write = FALSE
| seq = 0
| inode = 0xC5C6B420
| path = (mnt = 0xC5B96610, dentry = 0xC5C79A18)
-003|path_openat(nd = 0xC56CFED0, op = ?, flags = ?)
| nd = 0xC56CFED0
| file = 0xC55FBE40
| error = 0
-004|do_filp_open(dfd = ?, pathname = ?, op = 0xC56CFF74)
| op = 0xC56CFF74
| nd = (path = (mnt = 0xC5B96610, dentry = 0xC5C79A18), last = (hash = 2098300829, len = 9, hash_len = 40753006493, name = 0xC624701B), root = (mnt = 0xC6006E10, dentry = 0xC5C02AA0), inode = 0xC5C6B420, flags = 257, seq = 0, m_seq = 72, last_type = 0, depth = 0, total_link_count = 0, stack = 0xC56CFF10, internal = ((link = (mnt = 0xC043FEB3, dentry = 0xC048D5DE), done = (fn = 0xC05882C0, arg = 0xC61D89C8), name = 0xC043FEB3, seq = 51), (link = (mnt = 0xC5123A00, dentry = 0xC5B96F00), done = (fn = 0xC5B96F40, arg = 0xC036611F), name = 0x0, seq = 3221859333)), name = 0xC6247000, saved = 0x0, link_inode = 0x2, root_seq = 2, dfd = -100)
| flags = 1
| filp = 0x33
-005|do_sys_open(dfd = -100, filename = ?, flags = 2, mode = ?)
| dfd = -100
| flags = 2
| op = (open_flag = 2, mode = 0, acc_mode = 6, intent = 256, lookup_flags = 1)
| fd = 51
| tmp = 0xC6247000
| f = ???
-006|__entry_text_start(asm)
---|end of frame
通过分析发现,kmalloc函数中调用kmem_cache_alloc_trace是一个关键突破口:
static __always_inline void *kmalloc(size_t size, gfp_t flags)
{
if (__builtin_constant_p(size)) {
if (size > KMALLOC_MAX_CACHE_SIZE)
return kmalloc_large(size, flags);
#ifndef CONFIG_SLOB
if (!(flags & GFP_DMA)) {
unsigned int index = kmalloc_index(size);
if (!index)
return ZERO_SIZE_PTR;
return kmem_cache_alloc_trace(kmalloc_caches[index],
flags, size);
}
#endif
}
return __kmalloc(size, flags);
}
在第13行:kmem_cache_alloc_trace(kmalloc_caches[index],可知传递给kmem_cache_alloc_trace函数的第一个参数是kmalloc_caches[index],根据调用栈的第8行:
s = 0xC6001E40,查看kmalloc_caches数组值:
kmalloc_caches = (
0x0,
0x0,
0xC6001CC0,
0x0,
0x0,
0x0,
0xC6001E40,
0xC6001D80,
0xC6001C00,
0xC6001B40,
0xC6001A80,
0xC60019C0,
0xC6001900,
0xC6001840)
可知,index=6,对应kmalloc申请的size为2^6=64字节。
再回到PC指针位置,对应出错的代码应该是第2683行:
这里object的取到的object值为0x1, 而object的值是在第2677行获取到的,而c值是通过第2656行:
c = raw_cpu_ptr(s->cpu_slab);
获取到的,那么object实际就是kmalloc_caches[6].cpu_slab->freelist,但查看kmalloc_caches[6].cpu_slab->freelist的值为0,并不是1:
kmalloc_caches = (
0x0,
0x0,
0xC6001CC0,
0x0,
0x0,
0x0,
0xC6001E40 -> (
cpu_slab = 0xC0530BF0 -> (
freelist_=_0x0,
tid = 0,
page = 0x0),
flags = 0,
2 问题出在哪里
注意到kmalloc_caches数组的类型为:struct kmem_cache,其定义如下:
struct kmem_cache {
struct kmem_cache_cpu __percpu *cpu_slab;
/* Used for retriving partial slabs etc */
slab_flags_t flags;
unsigned long min_partial;
unsigned int size; /* The size of an object including meta data */
unsigned int object_size;/* The size of an object without meta data */
unsigned int offset; /* Free pointer offset. */
#ifdef CONFIG_SLUB_CPU_PARTIAL
/* Number of per cpu partial objects to keep around */
unsigned int cpu_partial;
#endif
struct kmem_cache_order_objects oo;
/* Allocation and freeing of slabs */
struct kmem_cache_order_objects max;
struct kmem_cache_order_objects min;
gfp_t allocflags; /* gfp flags to use on each alloc */
int refcount; /* Refcount for slab cache destroy */
void (*ctor)(void *);
unsigned int inuse; /* Offset to metadata */
unsigned int align; /* Alignment */
unsigned int red_left_pad; /* Left redzone padding size */
const char *name; /* Name (only for display!) */
struct list_head list; /* List of slab caches */
#ifdef CONFIG_SYSFS
struct kobject kobj; /* For sysfs */
struct work_struct kobj_remove_work;
#endif
#ifdef CONFIG_MEMCG
struct memcg_cache_params memcg_params;
/* for propagation, maximum size of a stored attr */
unsigned int max_attr_size;
#ifdef CONFIG_SYSFS
struct kset *memcg_kset;
#endif
#endif
#ifdef CONFIG_SLAB_FREELIST_HARDENED
unsigned long random;
#endif
#ifdef CONFIG_NUMA
/*
* Defragmentation by allocating from a remote node.
*/
unsigned int remote_node_defrag_ratio;
#endif
#ifdef CONFIG_SLAB_FREELIST_RANDOM
unsigned int *random_seq;
#endif
#ifdef CONFIG_KASAN
struct kasan_cache kasan_info;
#endif
unsigned int useroffset; /* Usercopy region offset */
unsigned int usersize; /* Usercopy region size */
struct kmem_cache_node *node[MAX_NUMNODES];
};
其中cpu_slab是__percpu类型的,这正是前面分析出错的原因所在。
3 percpu变量的实现
percpu变量是一项内核特性,创建一个percpu变量后每个CPU上都会有一个此变量的副本,CPU在访问其副本时无需加锁,可以放入自己的cache中,极大地提高了访问与更新效率。
3.1 percpu变量的定义
内核提供了DEFINE_PER_CPU宏来定义percpu变量,其定义如下:
#define DEFINE_PER_CPU(type, name) \
DEFINE_PER_CPU_SECTION(type, name, "")
DEFINE_PER_CPU_SECTION、__PCPU_ATTRS和PER_CPU_BASE_SECTION的定义如下:
#define PER_CPU_BASE_SECTION ".data..percpu"
#define __PCPU_ATTRS(sec) \
__percpu __attribute__((section(PER_CPU_BASE_SECTION sec))) \
PER_CPU_ATTRIBUTES
#define DEFINE_PER_CPU_SECTION(type, name, sec) \
__PCPU_ATTRS(sec) PER_CPU_DEF_ATTRIBUTES \
__typeof__(type) name
#endif
#ifndef PER_CPU_ATTRIBUTES
#define PER_CPU_ATTRIBUTES
#endif
#ifndef PER_CPU_DEF_ATTRIBUTES
#define PER_CPU_DEF_ATTRIBUTES
#endif
DEFINE_PER_CPU宏展开后就是这样的:
__percpu __attribute__((section(".data..percpu" sec))) \
__typeof__(type) name
举例来说,通过DEFINE_PER_CPU(int, per_cpu_n)定义变量per_cpu_n,展开后是这样的:
__percpu __attribute__((section(".data..percpu" sec))) int per_cpu_n
这与 struct kmem_cache结构体的cpu_slab成员的定义:
struct kmem_cache_cpu __percpu *cpu_slab;
类似。
percpu变量的实现较为复杂,建议参考这篇文章做详细了解。
3.2 percpu变量的访问
1个percpu变量NR_CPUS+1个实体(比较浪费空间),其中NR_CPUS个是每个CPU的副本,CPU在访问时都直接访问自己的副本。仍然以:
DEFINE_PER_CPU(int, per_cpu_n);
为例,加入NR_CPUS为2,那么定义了一个:
int per_cpu_n; //为方便,我们称其为正本
变量,以及两个CPU副本:
int per_cpu_n; //副本0
int per_cpu_n; //副本1
CPU0和CPU1在访问per_cpu_n时分别访问自己的副本,而正本却没有谁去访问!
现在回到我们前面的问题,我们访问kmalloc_caches[6].cpu_slab时,实际是访问了正本,而非副本,但恰恰正本是没有谁在使用的,所以我们看到其成员值都为0。
现在问题来了,CPU副本是如何访问的呢?其实很简单,linux内核提供了一个__per_cpu_offset[NR_CPUS]数组,数组中保存了percpu变量对应的offset值,访问时通过该offset值+正本地址,得到的就是其副本的地址。比如per_cpu_n 对应CPU0的副本地址是:
__per_cpu_offset[0] + per_cpu_n
回到我们前面的panic问题,查看__per_cpu_offset值:
__per_cpu_offset = (
0x0624B000,
0x0625D000)
CPU 0对应的cpu_slab地址为:
0xC0530BF0 + 0x0624B000 = 0xC678BBF0
转换成kmem_cache_cpu结构:
(struct kmem_cache_cpu*)0xc677bbf0 = 0xC677BBF0 -> (
freelist = 0x1,
tid = 0x0001ED30,
page = 0xC685E5E0)
CPU 1对应的cpu_slab地址为:
0xC0530BF0 + 0x0625D000 = 0xC678DBF0
转换成kmem_cache_cpu结构:
(struct kmem_cache_cpu*)0xC678DBF0 = 0xC678DBF0 -> (
freelist = 0xC4421740,
tid = 29805,
page = 0xC683F420)
这里CPU 0的 cpu_slab->freelist为1,与panic现场一致。但CPU 1的cpu_slab->freelist为0xC4421740,并不为1。这是为何?