邮件列表类似问题的讨论: nfsd: radix tree warning in nfs4_put_stid and kernel panic
vmcore
4.19.90-24.4.v2101.ky10.aarch64
环境信息:
crash> sys
KERNEL: vmlinux
DUMPFILE: 0037/127.0.0.1-2024-01-05-09:05:11/vmcore [PARTIAL DUMP]
CPUS: 128
DATE: Fri Jan 5 09:03:20 CST 2024
UPTIME: 14:23:28
LOAD AVERAGE: 0.13, 0.11, 0.04
TASKS: 1395
NODENAME: xzh-act-obbaknas-t
RELEASE: 4.19.90-24.4.v2101.ky10.aarch64
VERSION: #1 SMP Mon May 24 14:45:37 CST 2021
MACHINE: aarch64 (unknown Mhz)
MEMORY: 512 GB
PANIC: "Internal error: Oops: 8600000e [#1] SMP" (check log for details)
日志:
51807.593752] WARNING: CPU: 56 PID: 53252 at lib/radix-tree.c:784 delete_node+0x98/0x250
[...
51807.593800] CPU: 56 PID: 53252 Comm: kworker/u256:5 Kdump: loaded Not tainted 4.19.90-24.4.v2101.ky10.aarch64 #1
[51807.593801] Hardware name: WUZHOU S627K4/BC82AMDYA, BIOS 6.55 03/23/2023
[51807.593824] Workqueue: nfsd4_callbacks nfsd4_run_cb_work [nfsd]
[51807.593826] pstate: 20c00009 (nzCv daif +PAN +UAO)
[51807.593828] pc : delete_node+0x98/0x250
[51807.593829] lr : __radix_tree_delete+0x94/0xc0
[51807.593830] sp : ffffa62ac5b27c90
[51807.593830] x29: ffffa62ac5b27c90 x28: 0000000000000000
[51807.593832] x27: 0000000000000000 x26: ffff505f0afc6b00
[51807.593833] x25: ffffa629f5b648f0 x24: 0000000000000003
[51807.593834] x23: 0000000000000000 x22: ffff505f0a38d588
[51807.593835] x21: 0000000000000000 x20: ffffc64a6c17c7c0
[51807.593836] x19: ffffa64af54fe8b0 x18: ffffc64a6c17c7c0
[51807.593837] x17: 0000000000011000 x16: ffff505f0a388930
[51807.593838] x15: 0000000000000000 x14: ffffa628c7d2bd20
[51807.593839] x13: ffffa628c7d2bb10 x12: 0000000000000000
[51807.593840] x11: ffffa628c7d2bb38 x10: 0000000000000000
[51807.593841] x9 : 000000000000003e x8 : 0000000000000043
[51807.593842] x7 : ffffa628c7d2bd21 x6 : 000000000000003d
[51807.593843] x5 : 0000000000000000 x4 : 0000000000000000
[51807.593844] x3 : ffffa628c7d2bb28 x2 : 000000000000003f
[51807.593845] x1 : ffff505f0a38d588 x0 : ffffa64af5b06230
[51807.593847] Call trace:
[51807.593848] delete_node+0x98/0x250
[51807.593850] __radix_tree_delete+0x94/0xc0
[51807.593851] radix_tree_delete_item+0x50/0xc8
[51807.593852] idr_remove+0x18/0x20
[51807.593862] nfs4_put_stid+0x40/0xa0 [nfsd]
[51807.593870] nfsd4_cb_recall_release+0x20/0x30 [nfsd]
[51807.593878] nfsd4_run_cb_work+0xcc/0x110 [nfsd]
[51807.593880] process_one_work+0x1f8/0x490
[51807.593881] worker_thread+0x50/0x4b8
[51807.593882] kthread+0x134/0x138
[51807.593885] ret_from_fork+0x10/0x18
[51807.593886] ---[ end trace e77f38e6ef8f5232 ]---
[51807.645972] Unable to handle kernel read from unreadable memory at virtual address ffffa628c7d2bb28
[51807.655614] Mem abort info:
[51807.658887] ESR = 0x8600000e
[51807.662420] Exception class = IABT (current EL), IL = 32 bits
[51807.668870] SET = 0, FnV = 0
[51807.672402] EA = 0, S1PTW = 0
[51807.676029] swapper pgtable: 64k pages, 48-bit VAs, pgdp = 000000000ecf001e
[51807.683534] [ffffa628c7d2bb28] pgd=0000205fbffc0803, pud=0000205fbffc0803, pmd=0068003d80000f11
[51807.692808] Internal error: Oops: 8600000e [#1] SMP
[...
51807.773886] Process swapper/96 (pid: 0, stack limit = 0x00000000aa7d66c2)
[51807.788356] CPU: 96 PID: 0 Comm: swapper/96 Kdump: loaded Tainted: G W 4.19.90-24.4.v2101.ky10.aarch64 #1
[51807.807135] Hardware name: WUZHOU S627K4/BC82AMDYA, BIOS 6.55 03/23/2023
[51807.818018] pstate: a0400009 (NzCv daif +PAN -UAO)
[51807.826949] pc : 0xffffa628c7d2bb28
[51807.834500] lr : rcu_process_callbacks+0x224/0x590
[51807.843382] sp : ffffc64afebafec0
[51807.850791] x29: ffffc64afebafec0 x28: 000000000000000a
[51807.860198] x27: 000075ebf3f90000 x26: ffff505f0afa5000
[51807.869582] x25: ffff505f0ab953c8 x24: ffff505f0ab90000
[51807.878968] x23: ffff505f0afce000 x22: ffff505f0afe0e80
[51807.888274] x21: ffffc64afebaff40 x20: ffffc64afebb1538
[51807.897503] x19: ffffc64afebb1500 x18: ffff505f0afa1000
[51807.906639] x17: 0000000000000000 x16: ffff505f0a38d6b8
[51807.915682] x15: 0000000000000000 x14: 0000000000000000
[51807.924667] x13: ffffa64af7224d90 x12: 0000000000000000
[51807.933574] x11: 0000000000000001 x10: ffffc64afebaff28
[51807.942413] x9 : 0000000000000000 x8 : 0000000000000060
[51807.951156] x7 : ffffa64af517c430 x6 : ffff7fe992bd45c0
[51807.959831] x5 : 0000000000210d00 x4 : ffff7fe992bd45e0
[51807.968417] x3 : 000000000070006d x2 : ffffa628c7d2bb28
[51807.976901] x1 : ffffa628c7d2bb28 x0 : ffffa628c7d2bb28
[51807.985289] Call trace:
[51807.990701] 0xffffa628c7d2bb28
[51807.996719] __do_softirq+0x114/0x32c
[51808.003175] irq_exit+0x108/0x120
[51808.009195] __handle_domain_irq+0x6c/0xc0
[51808.016016] gic_handle_irq+0x6c/0x170
[51808.022432] el1_irq+0xb8/0x140
[51808.028154] arch_cpu_idle+0x38/0x1c0
[51808.034334] default_idle_call+0x24/0x58
[51808.040696] do_idle+0x1a4/0x268
[51808.046264] cpu_startup_entry+0x2c/0x78
[51808.052481] secondary_start_kernel+0x17c/0x1c8
[51808.059303] Code: f54fe8b0 ffffa64a 6c17c7c0 ffffc64a (c7d2bb28)
[51808.067801] SMP: stopping secondary CPUs
[51808.079832] Starting crashdump kernel...
[51808.086525] Bye! [
用faddr2line
脚本解析:
./scripts/faddr2line vmlinux delete_node+0x98/0x250
delete_node+0x98/0x250:
delete_node at lib/radix-tree.c:784
WARNING
发生在delete_node
函数的WARN_ON_ONCE(!list_empty(&node->private_list))
。从时间戳可以看出Unable to handle kernel read from unreadable memory
是在WARNING
后立刻发生的。
报WARNING
是在delete_node
函数的0x98
偏移量,delete_node
函数前面一部分的汇编:
crash> dis -lx delete_node
...
/usr/src/debug/kernel-4.19.90/linux-4.19.90-24.4.v2101.ky10.aarch64/lib/radix-tree.c: 762
0xffff505f0a38d778 <delete_node+0x60>: ldrb w2, [x0,#2]
# x0为node的地址,x0中的值加24为node->private_list的值,存储在x3中(既x3 == &node->private_list)
0xffff505f0a38d77c <delete_node+0x64>: add x3, x0, #0x18 # 表示将寄存器 x0 的值与十六进制数 0x18 相加,然后将结果存储到寄存器 x3 中
...
/usr/src/debug/kernel-4.19.90/linux-4.19.90-24.4.v2101.ky10.aarch64/./include/linux/compiler.h: 223
0xffff505f0a38d7a4 <delete_node+0x8c>: ldr x0, [x0,#24] # 将存储在内存地址 [x0 + 24] 处的值加载到寄存器 x0 中
/usr/src/debug/kernel-4.19.90/linux-4.19.90-24.4.v2101.ky10.aarch64/lib/radix-tree.c: 784
0xffff505f0a38d7a8 <delete_node+0x90>: cmp x3, x0
0xffff505f0a38d7ac <delete_node+0x94>: b.eq 0xffff505f0a38d8e8
0xffff505f0a38d7b0 <delete_node+0x98>: brk #0x800
我们看到panic时无法读的地址ffffa628c7d2bb28
是报WARNING
时的x3
寄存器,解析vmcore crash> struct radix_tree_node.private_list -xo
可知private_list
的偏移量为24
,根据<delete_node+0x64>: add x3, x0, #0x18
这一行的汇编可知x3
是&node->private_list
,为什么在lib/radix-tree.c: 762 if (node->count) {
这一行取出&node->private_list
的值呢,是gcc优化所致。
报WARNING
时执行到list_empty()
函数的return READ_ONCE(head->next) == head
,include/linux/compiler.h: 223
是执行到__read_once_size
函数,所以ldr x0, [x0,#24]
代表执行head->next
,所以x0
是head->next
的值,x3
是head
的值,也就是&node->private_list
的值(数据类型为struct list_head *
),指向下一个radix_tree_node
的链表指针。再看struct radix_tree_node
结构体中rcu_head
的偏移量和private_list
是一样的(在同一个union
中)。
再看panic时的栈,在软中断的执行__do_softirq
函数中,我们看lr
寄存器知道这时执行到rcu_process_callbacks
函数,是从radix_tree_node_free() -> call_rcu
触发的。反汇编rcu_process_callbacks
函数:
crash> dis -lx rcu_process_callbacks
...
/usr/src/debug/kernel-4.19.90/linux-4.19.90-24.4.v2101.ky10.aarch64/kernel/rcu/rcu.h: 226
0xffff505f099276fc <rcu_process_callbacks+0x214>: ldr x2, [x0,#8]
/usr/src/debug/kernel-4.19.90/linux-4.19.90-24.4.v2101.ky10.aarch64/kernel/rcu/rcu.h: 229
0xffff505f09927700 <rcu_process_callbacks+0x218>: cmp x2, #0xfff
0xffff505f09927704 <rcu_process_callbacks+0x21c>: b.ls 0xffff505f09927970
/usr/src/debug/kernel-4.19.90/linux-4.19.90-24.4.v2101.ky10.aarch64/kernel/rcu/rcu.h: 236
0xffff505f09927708 <rcu_process_callbacks+0x220>: blr x2
/usr/src/debug/kernel-4.19.90/linux-4.19.90-24.4.v2101.ky10.aarch64/kernel/rcu/tree.c: 2589
0xffff505f0992770c <rcu_process_callbacks+0x224>: ldr x1, [x29,#144]
用faddr2line
脚本解析:
scripts/faddr2line vmlinux rcu_process_callbacks+0x224/0x590
rcu_process_callbacks+0x224/0x590:
rcu_do_batch at kernel/rcu/tree.c:2589 # 这一行是 __rcu_reclaim里的head->func(head),应该是执行到2583行
(inlined by) invoke_rcu_callbacks at kernel/rcu/tree.c:2896
(inlined by) __rcu_process_callbacks at kernel/rcu/tree.c:2863
(inlined by) rcu_process_callbacks at kernel/rcu/tree.c:2880
<rcu_process_callbacks+0x214>: ldr x2, [x0,#8]
处的汇编将存储在内存地址 [x0 + 8] = ffffa628c7d2bb30
处的值加载到寄存器 x2
中:
crash> rd ffffa628c7d2bb30
ffffa628c7d2bb30: ffffa628c7d2bb28
<rcu_process_callbacks+0x220>: blr x2
处的汇编将寄存器 x2
中保存的地址加载到程序计数器(pc
)中,实现了一个函数调用,并将返回地址保存在链接寄存器(lr
)中。所以panic是发生在__rcu_reclaim
函数的head->func(head);
一行,参数struct rcu_head *head
的值就是报panic的地址ffffa628c7d2bb28
。
查看内存中的值:
crash> rd ffffa628c7d2bb28 20
ffffa628c7d2bb28: ffffa628c7d2bb28 ffffa628c7d2bb28 (...(...(...(...
ffffa628c7d2bb38: 0000000000000000 0000000000000000 ................
crash> struct list_head ffffa628c7d2bb28
struct list_head {
next = 0xffffa628c7d2bb28,
prev = 0xffffa628c7d2bb28
}
crash> struct callback_head ffffa628c7d2bb28
struct callback_head {
next = 0xffffa628c7d2bb28,
func = 0xffffa628c7d2bb28
}
我们看到,struct rcu_head *head
中的值已经被重新初始化了,可能是被INIT_LIST_HEAD()
初始化了。
4.19.90-23.45.v2101.ky10.aarch64
解压提取vmlinux
和ko
文件:
rpm2cpio kernel-debuginfo-4.19.90-23.45.v2101.ky10.aarch64.rpm | cpio -div
加载ko
文件:
crash> mod -s nfsd usr/lib/debug/lib/modules/4.19.90-23.45.v2101.ky10.aarch64/kernel/fs/nfsd/nfsd.ko.debug
在x86环境上用faddr2line
脚本解析:
./scripts/faddr2line vmlinux radix_tree_free_nodes+0x8c/0xe0
radix_tree_free_nodes+0x8c/0xe0:
radix_tree_free_nodes at lib/radix-tree.c:889 (discriminator 1)
WARNING
也是发生在radix_tree_free_nodes
函数的WARN_ON_ONCE(!list_empty(&old->private_list))
。
WARNING
发生在时刻34052.859997
,Unable to handle kernel read from unreadable memory
发生在时刻34052.941212
,两个时间点是一前一后紧挨着。
整理几次panic前报warning的调用栈。
# 2023-05-18-09:46:52
# 2024-01-04-18:37:07
# 2024-01-05-09:05:11
idr_remove+0x18/0x20
nfs4_put_stid+0x40/0xa0 [nfsd]
nfsd4_cb_recall_release+0x20/0x30 [nfsd]
nfsd4_run_cb_work+0xcc/0x110 [nfsd]
process_one_work+0x1f8/0x490
worker_thread+0x50/0x4b8
kthread+0x134/0x138
# 20240710
# 20240712-104419
# 20240712-105545
# 20240713
# 20240715
idr_remove+0x18/0x20
nfs4_put_stid+0x40/0xa0 [nfsd]
nfsd4_cb_recall_release+0x20/0x30 [nfsd]
nfsd4_cb_release+0x30/0x58 [nfsd]
rpc_free_task+0x2c/0x70 [sunrpc]
__rpc_execute+0x438/0x4a8 [sunrpc]
rpc_async_schedule+0x20/0x30 [sunrpc]
process_one_work+0x1f8/0x490
worker_thread+0x50/0x4b8
kthread+0x134/0x138
# 20240714-005120
radix_tree_free_nodes+0x8c/0xe0
idr_destroy+0x44/0x60
free_client+0xdc/0x110 [nfsd]
__destroy_client+0x1d8/0x218 [nfsd]
expire_client+0x30/0x40 [nfsd]
nfsd4_create_session+0x5c0/0x710 [nfsd]
nfsd4_proc_compound+0x4b8/0x700 [nfsd]
nfsd_dispatch+0x104/0x248 [nfsd]
svc_process_common+0x314/0x7b8 [sunrpc]
svc_process+0xb0/0xc8 [sunrpc]
nfsd+0xf0/0x160 [nfsd]
kthread+0x134/0x138
# 20240711,无warning栈,只有rcu panic栈
Unable to handle kernel read from unreadable memory at virtual address ffff8003a1a13000
WARNING: CPU: 7 PID: 0 at kernel/rcu/tree.c:2669 rcu_process_callbacks+0x540/0x590
# 20240714-115213,无warning栈,只有rcu panic栈
Unable to handle kernel read from unreadable memory at virtual address ffff8003a0b23ff8
# 20240714-145204,无warning栈,只有rcu panic栈
Unable to handle kernel read from unreadable memory at virtual address ffff8003a51c1930
修复补丁为: 2bbfed98a4d8 nfsd: Fix races between nfsd4_cb_release() and nfsd4_shutdown_callback()
。
邮件列表类似问题的讨论: nfsd: radix tree warning in nfs4_put_stid and kernel panic
邮件中提到ubuntu遇到的类似问题换成d1abaeb3be7b (tag: v5.3-rc5) Linux 5.3-rc5
内核,从2019年8月22日测试到8月27日5天无重启。邮件中认为2bbfed98a4d8 nfsd: Fix races between nfsd4_cb_release() and nfsd4_shutdown_callback()
更像是修复补丁,但这个补丁在v5.5-rc1
提出来的(用git name-rev 2bbfed98a4d8
查看),时间对不上。邮件中提到的另一个补丁e6abc8caa6de nfsd: Don't release the callback slot unless it was actually held
是v5.1-rc7
提出来的,已经合入了。
59f8e91b75ec nfsd4: use reference count to free client
是规避补丁,能够极大降低复现概率,是在tags/v5.3-rc1
引入的,所以社区邮件讨论的d1abaeb3be7b (tag: v5.3-rc5) Linux 5.3-rc5
内核不复现就能解释了。
struct radix_tree_node
的private_list
没有被nfs相关的代码使用,所以执行到delete_node()
时node->private_list
链表正常情况下肯定为空,但出问题时这里的链表却不为空,应该是有其他进程并发操作node->private_list
。
nfsd4_run_cb_workif (!clnt) { // 回调通道中断,或客户端被终止;放弃。
// cb->cb_ops->release
nfsd4_cb_recall_release
nfs4_put_stid
idr_remove
radix_tree_delete_item
__radix_tree_delete
delete_node// &node->private_list链表不为空
WARN_ON_ONCE(!list_empty(&node->private_list))
radix_tree_node_free
call_rcu(..., func=radix_tree_node_rcu_free)
__call_rcu// 正常情况下这里被赋值成了函数
head->func = func
head->next = NULL
// 有可能是从 nfsd4_run_cb_work 调用了 rpc_call_async
// nfsd4_cb_ops->rpc_release
nfsd4_cb_release nfsd4_cb_recall_release
通过上面vmcore
分析过程可以看到,执行到__rcu_reclaim()
时struct rcu_head *head
正常情况下应该是函数radix_tree_node_rcu_free()
的值,但出问题时却变成了空链表,可能是被INIT_LIST_HEAD()
之类的宏定义初始化成空链表了,所以这里也应该是有其他进程并发操作struct rcu_head *head
。
__do_softirq
rcu_process_callbacks
__rcu_process_callbacks
invoke_rcu_callbacks
rcu_do_batch
__rcu_reclaim// 这里的 head 就是 delete_node() 中 node->private_list 的值
// 也就是 node->rcu_head 的值
// 这里panic
head->func(head) // 通过 call_rcu 调用,panic时还未执行到
radix_tree_node_rcu_free // panic时还未执行到
INIT_LIST_HEAD(&node->private_list);
其中idr_remove()
移除的id是在nfs4_alloc_stid() -> idr_alloc_cyclic()
分配的。