4.19 __rpc_execute() soft lockup的问题

点击这里查看配套的教学视频

点击跳转到nfs课程所有目录

1 问题描述

nfs client挂载选项如下:

rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,moutaddr=xx.xx.xx.xx,mountvers=3,mountport=2050,mountproto=udp,local_lock=none,addr=xx.xx.xx.xx,_netdev

在某个CPU#110上出现大量的软锁报错:

...
[4158131.484488] watchdog: BUG: soft lockup - CPU#110 stuck for 22s! [kworker/u257:2:236027]
[4158171.485473] watchdog: BUG: soft lockup - CPU#110 stuck for 22s! [kworker/u257:2:236027]
[4158199.486164] watchdog: BUG: soft lockup - CPU#110 stuck for 23s! [kworker/u257:2:236027]
...

日志中的其中一个堆栈信息如下:

[4158171.485473] watchdog: BUG: soft lockup - CPU#110 stuck for 22s! [kworker/u257:2:236027]
...
[4158171.485532] Workqueue: xprtiod rpc_async_schedule [sunrpc]
[4158171.485540] RIP: 0010:kfree+0xb1/0x160
...
[4158171.485553] Call Trace:
[4158171.485559]  __kfree_skb+0xe/0x20
[4158171.485562]  sk_stream_alloc_skb+0x106/0x1e0
[4158171.485565]  tcp_sendmsg_locked+0x515/0xd30
[4158171.485573]  tcp_sendmsg+0x27/0x40
[4158171.485576]  sock_sendmsg+0x36/0x40
[4158171.485586]  xs_send_kvec+0xb7/0xc0 [sunrpc]
[4158171.485597]  xs_sendpages+0x5d/0x200 [sunrpc]
[4158171.485615]  xs_tcp_send_request+0xa7/0x240 [sunrpc]
[4158171.485646]  xprt_transmit+0x68/0x360 [sunrpc]
[4158171.485673]  call_transmit+0x1cb/0x2a0 [sunrpc]
[4158171.485683]  __rpc_execute+0x7f/0x3e0 [sunrpc]
[4158171.485687]  process_one_work+0x195/0x3d0
[4158171.485689]  worker_thread+0x30/0x390
[4158171.485693]  kthread+0x113/0x130
[4158171.485697]  ret_from_fork+0x22/0x40

整理了日志中堆栈的代码流程,可以看出是在__rpc_execute()中不断循环。

__rpc_execute
  call_transmit
    xprt_transmit
      xs_tcp_send_request
        xs_sendpages
          xs_send_kvec
            sock_sendmsg
              tcp_sendmsg
                tcp_sendmsg_locked
                  sk_stream_alloc_skb
                    __kfree_skb
                      kfree
                release_sock
                  tcp_release_cb
        xs_nospace
          spin_unlock_bh
            raw_spin_unlock_bh
              _raw_spin_unlock_bh
                __raw_spin_unlock_bh
                  __local_bh_enable_ip
                    do_softirq
          xs_tcp_write_space
    xprt_prepare_transmit
      spin_unlock_bh
        raw_spin_unlock_bh
          _raw_spin_unlock_bh
            __raw_spin_unlock_bh
              __local_bh_enable_ip
                do_softirq
  __x86_indirect_thunk_r15+0x3/0x11
  __x86_indirect_thunk_rax+0x3/0x20
  __x86_indirect_thunk_rcx+0x3/0x20
  __x86_indirect_thunk_r15+0x0/0x11

2 vmcore分析

2.1 准备

crash> mod | grep nfs # 可以看出,使用的是nfsv3
ffffffffc0397000  nfs_acl                   16384  (not loaded)  [CONFIG_KALLSYMS]
ffffffffc03ef100  nfsv3                     49152  fs/nfs/nfsv3.ko.debug 
ffffffffc066dc80  nfs                      311296  fs/nfs/nfs.ko.debug
# 加载调试的ko
crash> mod -s sunrpc net/sunrpc/sunrpc.ko.debug
crash> mod -s nfs fs/nfs/nfs.ko.debug
crash> mod -s nfsv3 fs/nfs/nfsv3.ko.debug
# crash> mod -s nfsv4 fs/nfs/nfsv4.ko.debug

注意这个vmcore是执行echo 1 > /proc/sys/kernel/softlockup_panic后导出的。

2.2 找到触发rpc任务的进程

查看崩溃时的堆栈:

crash> bt
PID: 236027  TASK: ffff99b651d44680  CPU: 110  COMMAND: "kworker/u257:2"
 #0 [ffff9ab03eb83d50] machine_kexec at ffffffff8725a70e
 #1 [ffff9ab03eb83da8] __crash_kexec at ffffffff8735b001
 #2 [ffff9ab03eb83e68] panic at ffffffff872b310e
 #3 [ffff9ab03eb83ef0] watchdog_timer_fn at ffffffff8738f21b
 #4 [ffff9ab03eb83f20] __hrtimer_run_queues at ffffffff8733ae98
 #5 [ffff9ab03eb83f80] hrtimer_interrupt at ffffffff8733b615
 #6 [ffff9ab03eb83fd8] smp_apic_timer_interrupt at ffffffff87c025ba
 #7 [ffff9ab03eb83ff0] apic_timer_interrupt at ffffffff87c01b1f
--- <IRQ stack> ---
 #8 [ffffaa3eedaebd98] apic_timer_interrupt at ffffffff87c01b1f
    [exception RIP: __x86_indirect_thunk_r15+3]
    RIP: ffffffff87e031c3  RSP: ffffaa3eedaebe40  RFLAGS: 00000286
    RAX: 0000000000000005  RBX: ffff9a3319bd4a18  RCX: 0000000000000006
    RDX: 0000000000000000  RSI: 0000000000000000  RDI: ffff9a3319bd4a18
    RBP: ffff9ab0343e0400   R8: ffff9a7c2c187a40   R9: ffff9ab035efba60
    R10: ffffe1aa188d9600  R11: ffff9a89fa993440  R12: 0000000000000000
    R13: 0000000000000001  R14: ffffffffc1063860  R15: ffffffffc10597a0
    ORIG_RAX: ffffffffffffff13  CS: 0010  SS: 0018
 #9 [ffffaa3eedaebe40] __rpc_execute at ffffffffc1063f6f [sunrpc]
#10 [ffffaa3eedaebe98] process_one_work at ffffffff872d0915
#11 [ffffaa3eedaebed8] worker_thread at ffffffff872d0b80
#12 [ffffaa3eedaebf10] kthread at ffffffff872d72f3
#13 [ffffaa3eedaebf50] ret_from_fork at ffffffff87c00202
crash> dis -rl ffffffffc1063f6f
...
/usr/src/debug/kernel-4.19.90/linux-4.19.90-23.15.v2101.ky10.x86_64/net/sunrpc/sched.c: 783
0xffffffffc1063f67 <__rpc_execute+119>: mov    %rbx,%rdi
0xffffffffc1063f6a <__rpc_execute+122>: callq  0xffffffff87e031c0 <__x86_indirect_thunk_r15>
/usr/src/debug/kernel-4.19.90/linux-4.19.90-23.15.v2101.ky10.x86_64/./arch/x86/include/asm/bitops.h: 318
0xffffffffc1063f6f <__rpc_execute+127>: mov    0x30(%rbx),%rax

可以看出__x86_indirect_thunk_r15+3是执行到__rpc_execute()do_action(task)

反汇编__rpc_execute(),重点看%rdi寄存器相关的内容:

crash> dis -l __rpc_execute
/usr/src/debug/kernel-4.19.90/linux-4.19.90-23.15.v2101.ky10.x86_64/net/sunrpc/sched.c: 753
...
0xffffffffc1063efe <__rpc_execute+14>:  push   %rbx
0xffffffffc1063eff <__rpc_execute+15>:  mov    %rdi,%rbx # 将寄存器 rdi 中的值复制到寄存器 rbx 
0xffffffffc1063f02 <__rpc_execute+18>:  sub    $0x20,%rsp
/usr/src/debug/kernel-4.19.90/linux-4.19.90-23.15.v2101.ky10.x86_64/net/sunrpc/sched.c: 755
0xffffffffc1063f06 <__rpc_execute+22>:  movzwl 0xdc(%rdi),%r13d
...
/usr/src/debug/kernel-4.19.90/linux-4.19.90-23.15.v2101.ky10.x86_64/./arch/x86/include/asm/bitops.h: 318
0xffffffffc1063f2b <__rpc_execute+59>:  mov    0x30(%rbx),%rax
/usr/src/debug/kernel-4.19.90/linux-4.19.90-23.15.v2101.ky10.x86_64/net/sunrpc/sched.c: 761
...
/usr/src/debug/kernel-4.19.90/linux-4.19.90-23.15.v2101.ky10.x86_64/./arch/x86/include/asm/bitops.h: 318
0xffffffffc1063f37 <__rpc_execute+71>:  mov    0x30(%rbx),%rax
/usr/src/debug/kernel-4.19.90/linux-4.19.90-23.15.v2101.ky10.x86_64/net/sunrpc/sched.c: 756
...
/usr/src/debug/kernel-4.19.90/linux-4.19.90-23.15.v2101.ky10.x86_64/net/sunrpc/sched.c: 776
0xffffffffc1063f4d <__rpc_execute+93>:  mov    0x18(%rbx),%r15
0xffffffffc1063f51 <__rpc_execute+97>:  test   %r15,%r15
0xffffffffc1063f54 <__rpc_execute+100>: je     0xffffffffc106400b <__rpc_execute+283>
/usr/src/debug/kernel-4.19.90/linux-4.19.90-23.15.v2101.ky10.x86_64/net/sunrpc/sched.c: 778
0xffffffffc1063f5a <__rpc_execute+106>: movq   $0x0,0x18(%rbx) # 不会改变 %rbx 的值
0xffffffffc1063f62 <__rpc_execute+114>: nopl   0x0(%rax,%rax,1)
/usr/src/debug/kernel-4.19.90/linux-4.19.90-23.15.v2101.ky10.x86_64/net/sunrpc/sched.c: 783
0xffffffffc1063f67 <__rpc_execute+119>: mov    %rbx,%rdi
0xffffffffc1063f6a <__rpc_execute+122>: callq  0xffffffff87e031c0 <__x86_indirect_thunk_r15>
/usr/src/debug/kernel-4.19.90/linux-4.19.90-23.15.v2101.ky10.x86_64/./arch/x86/include/asm/bitops.h: 318
0xffffffffc1063f6f <__rpc_execute+127>: mov    0x30(%rbx),%rax
/usr/src/debug/kernel-4.19.90/linux-4.19.90-23.15.v2101.ky10.x86_64/net/sunrpc/sched.c: 788

x86_64下整数参数使用的寄存器依次为: RDI,RSI,RDX,RCX,R8,R9,所以%rdi(也赋值到了%rbx)的值就是__rpc_execute(struct rpc_task *task)第一个参数的值。

我们再尝试找到触发rpc任务的进程:

crash> struct rpc_task ffff9a3319bd4a18 # RDI: ffff9a3319bd4a18
struct rpc_task {
  ...
  tk_owner = 227403, # 线程组中主线程的pid
  ...
}

查看这个线程组中所有的线程:

crash> ps -g 227403
crash> bt -g 227403 # 打印出所有的栈

点击这里查看bt -g 227403的输出

2.3 找到rpc请求的类型

解析rpc任务结构体:

crash> struct rpc_task ffff9a3319bd4a18 # RDI: ffff9a3319bd4a18
struct rpc_task {
  tk_msg = {
    rpc_proc = 0xffffffffc03eb8a0 <nfs3_procedures+288>, 
    rpc_argp = 0xffff9a3319bd4bd8, 
    rpc_resp = 0xffff9a3319bd4c40, 
    rpc_cred = 0xffff9ab0283d2e00
  }, 

nfs3_procedures[]struct rpc_procinfo类型的数组,struct rpc_procinfo结构体的大小如下:

crash> struct rpc_procinfo
struct rpc_procinfo {
  ...
}
SIZE: 48

<nfs3_procedures+288>中的偏移量288计算出288/48=6,所以rpc请求类型是NFS3PROC_READ

3 调试

3.1 操作步骤

修改nfs server内核代码:

--- a/fs/nfsd/nfs3proc.c
+++ b/fs/nfsd/nfs3proc.c
@@ -14,6 +14,7 @@
 #include "xdr3.h"
 #include "vfs.h"
 #include "filecache.h"
+#include <linux/delay.h>
 
 #define NFSDDBG_FACILITY               NFSDDBG_PROC
 
@@ -182,6 +183,10 @@ nfsd3_proc_read(struct svc_rqst *rqstp)
        struct nfsd3_readargs *argp = rqstp->rq_argp;
        struct nfsd3_readres *resp = rqstp->rq_resp;
 
+       printk("%s:%d, begin delay\n", __func__, __LINE__);
+       msleep(36000 * 1000);
        dprintk("nfsd: READ(3) %s %lu bytes at %Lu\n",
                                SVCFH_fmt(&argp->fh),
                                (unsigned long) argp->count,

nfs client命令:

tcpdump --interface=any --buffer-size=20480 -w out.cap
mount -t nfs -o rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountvers=3,mountproto=udp,local_lock=none,_netdev 192.168.53.209:/tmp/s_test /mnt # mountport=2050
echo something > /mnt/file
echo 3 > /proc/sys/vm/drop_caches
cat /mnt/file

3.2 抓包数据分析

nfs read请求数据包:

94  24.719361   192.168.53.215  192.168.53.209  NFS 184 V3 READ Call, FH: 0x633aa4f4 Offset: 0 Len: 10
Transmission Control Protocol, Src Port: 958, Dst Port: 2049, Seq: 1313, Ack: 1717, Len: 112
Remote Procedure Call, Type:Call XID:0x6be6919a
Network File System, READ Call FH: 0x633aa4f4 Offset: 0 Len: 10

tcp回复:

95  24.760076   192.168.53.209  192.168.53.215  TCP 72  2049 → 958 [ACK] Seq=1717 Ack=1425 Win=64896 Len=0 TSval=167216215 TSecr=1557071350
Transmission Control Protocol, Src Port: 2049, Dst Port: 958, Seq: 1717, Ack: 1425, Len: 0

3.3 nfs server正常回复的抓包数据分析

nfs read请求包:

80  15.970150   127.0.0.1   127.0.0.1   NFS 178 V3 READ Call (Reply In 81), FH: 0x61f66583 Offset: 0 Len: 8
Transmission Control Protocol, Src Port: 820, Dst Port: 2049, Seq: 1321, Ack: 1697, Len: 112
Remote Procedure Call, Type:Call XID:0x3542f9f0
Network File System, READ Call FH: 0x61f66583 Offset: 0 Len: 8

nfs read回复包:

81  15.970387   127.0.0.1   127.0.0.1   NFS 206 V3 READ Reply (Call In 80) Len: 8
Transmission Control Protocol, Src Port: 2049, Dst Port: 820, Seq: 1697, Ack: 1433, Len: 140
Remote Procedure Call, Type:Reply XID:0x3542f9f0
Network File System, READ Reply Len: 8

4 代码分析

由vmcore解析可知rpc请求类型是NFS3PROC_READ,只有nfs3_proc_read_setup()函数中引用这个宏定义。

read
  ksys_read
    vfs_read
      __vfs_read
        new_sync_read
          nfs_file_read
            generic_file_read_iter
              generic_file_buffered_read
                page_cache_sync_readahead
                  ondemand_readahead
                    __do_page_cache_readahead
                      read_pages
                        nfs_readpages
                          nfs_pageio_complete
                            nfs_pageio_doio
                              nfs_generic_pg_pgios
                                nfs_initiate_pgio
                                  nfs_initiate_read
                                    nfs3_proc_read_setup
                                      // 设置请求类型
                                      &nfs3_procedures[NFS3PROC_READ]
                wait_on_page_locked_killable
                  wait_on_page_bit_killable // rpc读任务返回时通过unlock_page()唤醒
                    wait_on_page_bit_common

从vmcore的解析bt -g 227403的输出可以看到进程组里没有一个进程在等待异步rpc读的回复。

5 抓包数据分析

nfs read请求包:

17786   0.494431937 22.14.23.3  22.14.64.15 NFS 238 V3 READ Call, FH: 0x0b6bfcf8 Offset: 1376075776 Len: 1048576
Transmission Control Protocol, Src Port: 766, Dst Port: 2049, Seq: 13417, Ack: 20317249, Len: 160
Remote Procedure Call, Type:Call XID:0xccbc20c7
Network File System, READ Call FH: 0x0b6bfcf8 Offset: 1376075776 Len: 1048576

tcp回复:

17787   0.494499755 22.14.64.15 22.14.23.3  TCP 78  2049 → 766 [ACK] Seq=20323137 Ack=13577 Win=5623 Len=0 TSval=4228514640 TSecr=255410915
Transmission Control Protocol, Src Port: 2049, Dst Port: 766, Seq: 20323137, Ack: 13577, Len: 0

# 用 tcp.seq == 20323041 找到序号一样的包,但用 tcp.ack == 20323041 找不到客户端发的请求包
17788   0.503423525 22.14.64.15 22.14.23.3  TCP 174 [TCP Retransmission] 2049 → 766 [ACK] Seq=20323041 Ack=13577 Win=5623 Len=96 TSval=4228514649 TSecr=255410915
Transmission Control Protocol, Src Port: 2049, Dst Port: 766, Seq: 20323041, Ack: 13577, Len: 96

17789   0.711405976 22.14.64.15 22.14.23.3  TCP 1526    [TCP Retransmission] 2049 → 766 [ACK] Seq=20317249 Ack=13577 Win=5623 Len=1448 TSval=4228514857 TSecr=255410915
Transmission Control Protocol, Src Port: 2049, Dst Port: 766, Seq: 20317249, Ack: 13577, Len: 1448

17790   1.139363629 22.14.64.15 22.14.23.3  TCP 1526    [TCP Retransmission] 2049 → 766 [ACK] Seq=20317249 Ack=13577 Win=5623 Len=1448 TSval=4228515285 TSecr=255410915
Transmission Control Protocol, Src Port: 2049, Dst Port: 766, Seq: 20317249, Ack: 13577, Len: 1448

17791   1.971276412 22.14.64.15 22.14.23.3  TCP 1526    [TCP Retransmission] 2049 → 766 [ACK] Seq=20317249 Ack=13577 Win=5623 Len=1448 TSval=4228516117 TSecr=255410915
Transmission Control Protocol, Src Port: 2049, Dst Port: 766, Seq: 20317249, Ack: 13577, Len: 1448

6 补丁

7 下一步计划

使用以下脚本,下次再复现出soft lockup后,打开调试开头收集更多的nfs和rpc的日志:

#!/bin/bash

WAIT_TIME=60

execute_command() {
    mkdir dmesg-log/
    echo "打开nfs日志开头"
    echo 0xFFFF > /proc/sys/sunrpc/nfs_debug # NFSDBG_ALL
    echo 0x7fff > /proc/sys/sunrpc/rpc_debug # RPCDBG_ALL
    echo "等一段时间以产生足够多的日志"
    sleep ${WAIT_TIME}s
    cp /var/log/messages* dmesg-log/
    echo "日志已保存到 $(pwd)/dmesg-log/ 目录下"
    exit
}

dmesg -w | while read -r line; do
    if echo "$line" | grep -q "watchdog: BUG: soft lockup"; then
        echo "检测到软锁"
        execute_command
    fi
done