4.19 nfs4_put_stid()报warning紧接着panic的问题

1 问题描述

邮件列表类似问题的讨论: nfsd: radix tree warning in nfs4_put_stid and kernel panic

2 解析vmcore

2.1 4.19.90-24.4.v2101.ky10.aarch64

环境信息:

crash> sys
      KERNEL: vmlinux
    DUMPFILE: 0037/127.0.0.1-2024-01-05-09:05:11/vmcore  [PARTIAL DUMP]
        CPUS: 128
        DATE: Fri Jan  5 09:03:20 CST 2024
      UPTIME: 14:23:28
LOAD AVERAGE: 0.13, 0.11, 0.04
       TASKS: 1395
    NODENAME: xzh-act-obbaknas-t
     RELEASE: 4.19.90-24.4.v2101.ky10.aarch64
     VERSION: #1 SMP Mon May 24 14:45:37 CST 2021
     MACHINE: aarch64  (unknown Mhz)
      MEMORY: 512 GB
       PANIC: "Internal error: Oops: 8600000e [#1] SMP" (check log for details)

日志:

[51807.593752] WARNING: CPU: 56 PID: 53252 at lib/radix-tree.c:784 delete_node+0x98/0x250
...
[51807.593800] CPU: 56 PID: 53252 Comm: kworker/u256:5 Kdump: loaded Not tainted 4.19.90-24.4.v2101.ky10.aarch64 #1
[51807.593801] Hardware name: WUZHOU S627K4/BC82AMDYA, BIOS 6.55 03/23/2023
[51807.593824] Workqueue: nfsd4_callbacks nfsd4_run_cb_work [nfsd]
[51807.593826] pstate: 20c00009 (nzCv daif +PAN +UAO)
[51807.593828] pc : delete_node+0x98/0x250
[51807.593829] lr : __radix_tree_delete+0x94/0xc0
[51807.593830] sp : ffffa62ac5b27c90
[51807.593830] x29: ffffa62ac5b27c90 x28: 0000000000000000 
[51807.593832] x27: 0000000000000000 x26: ffff505f0afc6b00 
[51807.593833] x25: ffffa629f5b648f0 x24: 0000000000000003 
[51807.593834] x23: 0000000000000000 x22: ffff505f0a38d588 
[51807.593835] x21: 0000000000000000 x20: ffffc64a6c17c7c0 
[51807.593836] x19: ffffa64af54fe8b0 x18: ffffc64a6c17c7c0 
[51807.593837] x17: 0000000000011000 x16: ffff505f0a388930 
[51807.593838] x15: 0000000000000000 x14: ffffa628c7d2bd20 
[51807.593839] x13: ffffa628c7d2bb10 x12: 0000000000000000 
[51807.593840] x11: ffffa628c7d2bb38 x10: 0000000000000000 
[51807.593841] x9 : 000000000000003e x8 : 0000000000000043 
[51807.593842] x7 : ffffa628c7d2bd21 x6 : 000000000000003d 
[51807.593843] x5 : 0000000000000000 x4 : 0000000000000000 
[51807.593844] x3 : ffffa628c7d2bb28 x2 : 000000000000003f 
[51807.593845] x1 : ffff505f0a38d588 x0 : ffffa64af5b06230 
[51807.593847] Call trace:
[51807.593848]  delete_node+0x98/0x250
[51807.593850]  __radix_tree_delete+0x94/0xc0
[51807.593851]  radix_tree_delete_item+0x50/0xc8
[51807.593852]  idr_remove+0x18/0x20
[51807.593862]  nfs4_put_stid+0x40/0xa0 [nfsd]
[51807.593870]  nfsd4_cb_recall_release+0x20/0x30 [nfsd]
[51807.593878]  nfsd4_run_cb_work+0xcc/0x110 [nfsd]
[51807.593880]  process_one_work+0x1f8/0x490
[51807.593881]  worker_thread+0x50/0x4b8
[51807.593882]  kthread+0x134/0x138
[51807.593885]  ret_from_fork+0x10/0x18
[51807.593886] ---[ end trace e77f38e6ef8f5232 ]---
[51807.645972] Unable to handle kernel read from unreadable memory at virtual address ffffa628c7d2bb28
[51807.655614] Mem abort info:
[51807.658887]   ESR = 0x8600000e
[51807.662420]   Exception class = IABT (current EL), IL = 32 bits
[51807.668870]   SET = 0, FnV = 0
[51807.672402]   EA = 0, S1PTW = 0
[51807.676029] swapper pgtable: 64k pages, 48-bit VAs, pgdp = 000000000ecf001e
[51807.683534] [ffffa628c7d2bb28] pgd=0000205fbffc0803, pud=0000205fbffc0803, pmd=0068003d80000f11
[51807.692808] Internal error: Oops: 8600000e [#1] SMP
...
[51807.773886] Process swapper/96 (pid: 0, stack limit = 0x00000000aa7d66c2)
[51807.788356] CPU: 96 PID: 0 Comm: swapper/96 Kdump: loaded Tainted: G        W         4.19.90-24.4.v2101.ky10.aarch64 #1
[51807.807135] Hardware name: WUZHOU S627K4/BC82AMDYA, BIOS 6.55 03/23/2023
[51807.818018] pstate: a0400009 (NzCv daif +PAN -UAO)
[51807.826949] pc : 0xffffa628c7d2bb28
[51807.834500] lr : rcu_process_callbacks+0x224/0x590
[51807.843382] sp : ffffc64afebafec0
[51807.850791] x29: ffffc64afebafec0 x28: 000000000000000a 
[51807.860198] x27: 000075ebf3f90000 x26: ffff505f0afa5000 
[51807.869582] x25: ffff505f0ab953c8 x24: ffff505f0ab90000 
[51807.878968] x23: ffff505f0afce000 x22: ffff505f0afe0e80 
[51807.888274] x21: ffffc64afebaff40 x20: ffffc64afebb1538 
[51807.897503] x19: ffffc64afebb1500 x18: ffff505f0afa1000 
[51807.906639] x17: 0000000000000000 x16: ffff505f0a38d6b8 
[51807.915682] x15: 0000000000000000 x14: 0000000000000000 
[51807.924667] x13: ffffa64af7224d90 x12: 0000000000000000 
[51807.933574] x11: 0000000000000001 x10: ffffc64afebaff28 
[51807.942413] x9 : 0000000000000000 x8 : 0000000000000060 
[51807.951156] x7 : ffffa64af517c430 x6 : ffff7fe992bd45c0 
[51807.959831] x5 : 0000000000210d00 x4 : ffff7fe992bd45e0 
[51807.968417] x3 : 000000000070006d x2 : ffffa628c7d2bb28 
[51807.976901] x1 : ffffa628c7d2bb28 x0 : ffffa628c7d2bb28 
[51807.985289] Call trace:
[51807.990701]  0xffffa628c7d2bb28
[51807.996719]  __do_softirq+0x114/0x32c
[51808.003175]  irq_exit+0x108/0x120
[51808.009195]  __handle_domain_irq+0x6c/0xc0
[51808.016016]  gic_handle_irq+0x6c/0x170
[51808.022432]  el1_irq+0xb8/0x140
[51808.028154]  arch_cpu_idle+0x38/0x1c0
[51808.034334]  default_idle_call+0x24/0x58
[51808.040696]  do_idle+0x1a4/0x268
[51808.046264]  cpu_startup_entry+0x2c/0x78
[51808.052481]  secondary_start_kernel+0x17c/0x1c8
[51808.059303] Code: f54fe8b0 ffffa64a 6c17c7c0 ffffc64a (c7d2bb28) 
[51808.067801] SMP: stopping secondary CPUs
[51808.079832] Starting crashdump kernel...
[51808.086525] Bye!

faddr2line脚本解析:

./scripts/faddr2line vmlinux delete_node+0x98/0x250
delete_node+0x98/0x250:
delete_node at lib/radix-tree.c:784

WARNING发生在delete_node函数的WARN_ON_ONCE(!list_empty(&node->private_list))。从时间戳可以看出Unable to handle kernel read from unreadable memory是在WARNING后立刻发生的。

WARNING是在delete_node函数的0x98偏移量,delete_node函数前面一部分的汇编:

crash> dis -lx delete_node
...
/usr/src/debug/kernel-4.19.90/linux-4.19.90-24.4.v2101.ky10.aarch64/lib/radix-tree.c: 762
0xffff505f0a38d778 <delete_node+0x60>:  ldrb    w2, [x0,#2]
# x0为node的地址,x0中的值加24为node->private_list的值,存储在x3中(既x3 == &node->private_list)
0xffff505f0a38d77c <delete_node+0x64>:  add     x3, x0, #0x18 # 表示将寄存器 x0 的值与十六进制数 0x18 相加,然后将结果存储到寄存器 x3 中
...
/usr/src/debug/kernel-4.19.90/linux-4.19.90-24.4.v2101.ky10.aarch64/./include/linux/compiler.h: 223
0xffff505f0a38d7a4 <delete_node+0x8c>:  ldr     x0, [x0,#24] # 将存储在内存地址 [x0 + 24] 处的值加载到寄存器 x0 中
/usr/src/debug/kernel-4.19.90/linux-4.19.90-24.4.v2101.ky10.aarch64/lib/radix-tree.c: 784
0xffff505f0a38d7a8 <delete_node+0x90>:  cmp     x3, x0
0xffff505f0a38d7ac <delete_node+0x94>:  b.eq    0xffff505f0a38d8e8
0xffff505f0a38d7b0 <delete_node+0x98>:  brk     #0x800

我们看到panic时无法读的地址ffffa628c7d2bb28是报WARNING时的x3寄存器,解析vmcore crash> struct radix_tree_node.private_list -xo 可知private_list的偏移量为24,根据<delete_node+0x64>: add x3, x0, #0x18这一行的汇编可知x3&node->private_list,为什么在lib/radix-tree.c: 762 if (node->count) {这一行取出&node->private_list的值呢,是gcc优化所致。

WARNING时执行到list_empty()函数的return READ_ONCE(head->next) == headinclude/linux/compiler.h: 223是执行到__read_once_size函数,所以ldr x0, [x0,#24]代表执行head->next,所以x0head->next的值,x3head的值,也就是&node->private_list的值(数据类型为struct list_head *),指向下一个radix_tree_node的链表指针。再看struct radix_tree_node结构体中rcu_head的偏移量和private_list是一样的(在同一个union中)。

再看panic时的栈,在软中断的执行__do_softirq函数中,我们看lr寄存器知道这时执行到rcu_process_callbacks函数,是从radix_tree_node_free() -> call_rcu触发的。反汇编rcu_process_callbacks函数:

crash> dis -lx rcu_process_callbacks
...
/usr/src/debug/kernel-4.19.90/linux-4.19.90-24.4.v2101.ky10.aarch64/kernel/rcu/rcu.h: 226
0xffff505f099276fc <rcu_process_callbacks+0x214>:       ldr     x2, [x0,#8]
/usr/src/debug/kernel-4.19.90/linux-4.19.90-24.4.v2101.ky10.aarch64/kernel/rcu/rcu.h: 229
0xffff505f09927700 <rcu_process_callbacks+0x218>:       cmp     x2, #0xfff
0xffff505f09927704 <rcu_process_callbacks+0x21c>:       b.ls    0xffff505f09927970
/usr/src/debug/kernel-4.19.90/linux-4.19.90-24.4.v2101.ky10.aarch64/kernel/rcu/rcu.h: 236
0xffff505f09927708 <rcu_process_callbacks+0x220>:       blr     x2
/usr/src/debug/kernel-4.19.90/linux-4.19.90-24.4.v2101.ky10.aarch64/kernel/rcu/tree.c: 2589
0xffff505f0992770c <rcu_process_callbacks+0x224>:       ldr     x1, [x29,#144]

faddr2line脚本解析:

scripts/faddr2line vmlinux rcu_process_callbacks+0x224/0x590
rcu_process_callbacks+0x224/0x590:
rcu_do_batch at kernel/rcu/tree.c:2589 # 这一行是 __rcu_reclaim里的head->func(head),应该是执行到2583
(inlined by) invoke_rcu_callbacks at kernel/rcu/tree.c:2896
(inlined by) __rcu_process_callbacks at kernel/rcu/tree.c:2863
(inlined by) rcu_process_callbacks at kernel/rcu/tree.c:2880

<rcu_process_callbacks+0x214>: ldr x2, [x0,#8]处的汇编将存储在内存地址 [x0 + 8] = ffffa628c7d2bb30 处的值加载到寄存器 x2 中:

crash> rd ffffa628c7d2bb30
ffffa628c7d2bb30:  ffffa628c7d2bb28

<rcu_process_callbacks+0x220>: blr x2处的汇编将寄存器 x2 中保存的地址加载到程序计数器(pc)中,实现了一个函数调用,并将返回地址保存在链接寄存器(lr)中。所以panic是发生在__rcu_reclaim函数的head->func(head);一行,参数struct rcu_head *head的值就是报panic的地址ffffa628c7d2bb28

查看内存中的值:

crash> rd ffffa628c7d2bb28 20
ffffa628c7d2bb28:  ffffa628c7d2bb28 ffffa628c7d2bb28   (...(...(...(...
ffffa628c7d2bb38:  0000000000000000 0000000000000000   ................

crash> struct list_head ffffa628c7d2bb28 
struct list_head {
  next = 0xffffa628c7d2bb28, 
  prev = 0xffffa628c7d2bb28
}

crash> struct callback_head ffffa628c7d2bb28
struct callback_head {
  next = 0xffffa628c7d2bb28, 
  func = 0xffffa628c7d2bb28
}

我们看到,struct rcu_head *head中的值已经被重新初始化了,可能是被INIT_LIST_HEAD()初始化了。

2.2 4.19.90-23.45.v2101.ky10.aarch64

解压提取vmlinuxko文件:

rpm2cpio kernel-debuginfo-4.19.90-23.45.v2101.ky10.aarch64.rpm | cpio -div

加载ko文件:

crash> mod -s nfsd usr/lib/debug/lib/modules/4.19.90-23.45.v2101.ky10.aarch64/kernel/fs/nfsd/nfsd.ko.debug

在x86环境上用faddr2line脚本解析:

./scripts/faddr2line vmlinux radix_tree_free_nodes+0x8c/0xe0
radix_tree_free_nodes+0x8c/0xe0:
radix_tree_free_nodes at lib/radix-tree.c:889 (discriminator 1)

WARNING也是发生在radix_tree_free_nodes函数的WARN_ON_ONCE(!list_empty(&old->private_list))

WARNING发生在时刻34052.859997Unable to handle kernel read from unreadable memory发生在时刻34052.941212,两个时间点是一前一后紧挨着。

2.3 报warning的函数调用栈

整理几次panic前报warning的调用栈。

# 2023-05-18-09:46:52
# 2024-01-04-18:37:07
# 2024-01-05-09:05:11
idr_remove+0x18/0x20
nfs4_put_stid+0x40/0xa0 [nfsd]
nfsd4_cb_recall_release+0x20/0x30 [nfsd]
nfsd4_run_cb_work+0xcc/0x110 [nfsd]
process_one_work+0x1f8/0x490
worker_thread+0x50/0x4b8
kthread+0x134/0x138

# 20240710
# 20240712-104419
# 20240712-105545
# 20240713
# 20240715
idr_remove+0x18/0x20
nfs4_put_stid+0x40/0xa0 [nfsd]
nfsd4_cb_recall_release+0x20/0x30 [nfsd]
nfsd4_cb_release+0x30/0x58 [nfsd]
rpc_free_task+0x2c/0x70 [sunrpc]
__rpc_execute+0x438/0x4a8 [sunrpc]
rpc_async_schedule+0x20/0x30 [sunrpc]
process_one_work+0x1f8/0x490
worker_thread+0x50/0x4b8
kthread+0x134/0x138

# 20240714-005120
radix_tree_free_nodes+0x8c/0xe0
idr_destroy+0x44/0x60
free_client+0xdc/0x110 [nfsd]
__destroy_client+0x1d8/0x218 [nfsd]
expire_client+0x30/0x40 [nfsd]
nfsd4_create_session+0x5c0/0x710 [nfsd]
nfsd4_proc_compound+0x4b8/0x700 [nfsd]
nfsd_dispatch+0x104/0x248 [nfsd]
svc_process_common+0x314/0x7b8 [sunrpc]
svc_process+0xb0/0xc8 [sunrpc]
nfsd+0xf0/0x160 [nfsd]
kthread+0x134/0x138

# 20240711,无warning栈,只有rcu panic栈
Unable to handle kernel read from unreadable memory at virtual address ffff8003a1a13000
WARNING: CPU: 7 PID: 0 at kernel/rcu/tree.c:2669 rcu_process_callbacks+0x540/0x590

# 20240714-115213,无warning栈,只有rcu panic栈
Unable to handle kernel read from unreadable memory at virtual address ffff8003a0b23ff8

# 20240714-145204,无warning栈,只有rcu panic栈
Unable to handle kernel read from unreadable memory at virtual address ffff8003a51c1930

3 修复补丁

修复补丁为: 2bbfed98a4d8 nfsd: Fix races between nfsd4_cb_release() and nfsd4_shutdown_callback()

4 社区邮件

邮件列表类似问题的讨论: nfsd: radix tree warning in nfs4_put_stid and kernel panic

邮件中提到ubuntu遇到的类似问题换成d1abaeb3be7b (tag: v5.3-rc5) Linux 5.3-rc5内核,从2019年8月22日测试到8月27日5天无重启。邮件中认为2bbfed98a4d8 nfsd: Fix races between nfsd4_cb_release() and nfsd4_shutdown_callback()更像是修复补丁,但这个补丁在v5.5-rc1提出来的(用git name-rev 2bbfed98a4d8查看),时间对不上。邮件中提到的另一个补丁e6abc8caa6de nfsd: Don't release the callback slot unless it was actually heldv5.1-rc7提出来的,已经合入了。

59f8e91b75ec nfsd4: use reference count to free client是规避补丁,能够极大降低复现概率,是在tags/v5.3-rc1引入的,所以社区邮件讨论的d1abaeb3be7b (tag: v5.3-rc5) Linux 5.3-rc5内核不复现就能解释了。

5 代码分析

struct radix_tree_nodeprivate_list没有被nfs相关的代码使用,所以执行到delete_node()node->private_list链表正常情况下肯定为空,但出问题时这里的链表却不为空,应该是有其他进程并发操作node->private_list

nfsd4_run_cb_work
  if (!clnt) { // 回调通道中断,或客户端被终止;放弃。
  nfsd4_cb_recall_release // cb->cb_ops->release
    nfs4_put_stid
      idr_remove
        radix_tree_delete_item
          __radix_tree_delete
            delete_node
              WARN_ON_ONCE(!list_empty(&node->private_list)) // &node->private_list链表不为空
              radix_tree_node_free
                call_rcu(..., func=radix_tree_node_rcu_free)
                  __call_rcu
                    head->func = func // 正常情况下这里被赋值成了函数
                    head->next = NULL

// 有可能是从 nfsd4_run_cb_work 调用了 rpc_call_async
nfsd4_cb_release // nfsd4_cb_ops->rpc_release
  nfsd4_cb_recall_release

通过上面vmcore分析过程可以看到,执行到__rcu_reclaim()struct rcu_head *head正常情况下应该是函数radix_tree_node_rcu_free()的值,但出问题时却变成了空链表,可能是被INIT_LIST_HEAD()之类的宏定义初始化成空链表了,所以这里也应该是有其他进程并发操作struct rcu_head *head

__do_softirq
  rcu_process_callbacks
    __rcu_process_callbacks
      invoke_rcu_callbacks
        rcu_do_batch
          __rcu_reclaim
            // 这里的 head 就是 delete_node() 中 node->private_list 的值
            // 也就是 node->rcu_head 的值
            head->func(head) // 这里panic
              radix_tree_node_rcu_free // 通过 call_rcu 调用,panic时还未执行到
                INIT_LIST_HEAD(&node->private_list); // panic时还未执行到
              

其中idr_remove()移除的id是在nfs4_alloc_stid() -> idr_alloc_cyclic()分配的。