nfs client挂载选项如下:
rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,moutaddr=xx.xx.xx.xx,mountvers=3,mountport=2050,mountproto=udp,local_lock=none,addr=xx.xx.xx.xx,_netdev
在某个CPU#110上出现大量的软锁报错:
...
[4158131.484488] watchdog: BUG: soft lockup - CPU#110 stuck for 22s! [kworker/u257:2:236027]
[4158171.485473] watchdog: BUG: soft lockup - CPU#110 stuck for 22s! [kworker/u257:2:236027]
[4158199.486164] watchdog: BUG: soft lockup - CPU#110 stuck for 23s! [kworker/u257:2:236027]
...
日志中的其中一个堆栈信息如下:
[4158171.485473] watchdog: BUG: soft lockup - CPU#110 stuck for 22s! [kworker/u257:2:236027]
...
[4158171.485532] Workqueue: xprtiod rpc_async_schedule [sunrpc]
[4158171.485540] RIP: 0010:kfree+0xb1/0x160
...
[4158171.485553] Call Trace:
[4158171.485559] __kfree_skb+0xe/0x20
[4158171.485562] sk_stream_alloc_skb+0x106/0x1e0
[4158171.485565] tcp_sendmsg_locked+0x515/0xd30
[4158171.485573] tcp_sendmsg+0x27/0x40
[4158171.485576] sock_sendmsg+0x36/0x40
[4158171.485586] xs_send_kvec+0xb7/0xc0 [sunrpc]
[4158171.485597] xs_sendpages+0x5d/0x200 [sunrpc]
[4158171.485615] xs_tcp_send_request+0xa7/0x240 [sunrpc]
[4158171.485646] xprt_transmit+0x68/0x360 [sunrpc]
[4158171.485673] call_transmit+0x1cb/0x2a0 [sunrpc]
[4158171.485683] __rpc_execute+0x7f/0x3e0 [sunrpc]
[4158171.485687] process_one_work+0x195/0x3d0
[4158171.485689] worker_thread+0x30/0x390
[4158171.485693] kthread+0x113/0x130
[4158171.485697] ret_from_fork+0x22/0x40
整理了日志中堆栈的代码流程,可以看出是在__rpc_execute()
中不断循环。
__rpc_execute
call_transmit
xprt_transmit
xs_tcp_send_request
xs_sendpages
xs_send_kvec
sock_sendmsg
tcp_sendmsg
tcp_sendmsg_locked
sk_stream_alloc_skb
__kfree_skb
kfree
release_sock
tcp_release_cb
xs_nospace
spin_unlock_bh
raw_spin_unlock_bh
_raw_spin_unlock_bh
__raw_spin_unlock_bh
__local_bh_enable_ip
do_softirq
xs_tcp_write_space
xprt_prepare_transmit
spin_unlock_bh
raw_spin_unlock_bh
_raw_spin_unlock_bh
__raw_spin_unlock_bh
__local_bh_enable_ip
do_softirq+0x3/0x11
__x86_indirect_thunk_r15+0x3/0x20
__x86_indirect_thunk_rax+0x3/0x20
__x86_indirect_thunk_rcx+0x0/0x11 __x86_indirect_thunk_r15
crash> mod | grep nfs # 可以看出,使用的是nfsv3
ffffffffc0397000 nfs_acl 16384 (not loaded) [CONFIG_KALLSYMS]
ffffffffc03ef100 nfsv3 49152 fs/nfs/nfsv3.ko.debug
ffffffffc066dc80 nfs 311296 fs/nfs/nfs.ko.debug
# 加载调试的ko
crash> mod -s sunrpc net/sunrpc/sunrpc.ko.debug
crash> mod -s nfs fs/nfs/nfs.ko.debug
crash> mod -s nfsv3 fs/nfs/nfsv3.ko.debug
# crash> mod -s nfsv4 fs/nfs/nfsv4.ko.debug
注意这个vmcore是执行echo 1 > /proc/sys/kernel/softlockup_panic
后导出的。
查看崩溃时的堆栈:
crash> bt
PID: 236027 TASK: ffff99b651d44680 CPU: 110 COMMAND: "kworker/u257:2"
#0 [ffff9ab03eb83d50] machine_kexec at ffffffff8725a70e
#1 [ffff9ab03eb83da8] __crash_kexec at ffffffff8735b001
#2 [ffff9ab03eb83e68] panic at ffffffff872b310e
#3 [ffff9ab03eb83ef0] watchdog_timer_fn at ffffffff8738f21b
#4 [ffff9ab03eb83f20] __hrtimer_run_queues at ffffffff8733ae98
#5 [ffff9ab03eb83f80] hrtimer_interrupt at ffffffff8733b615
#6 [ffff9ab03eb83fd8] smp_apic_timer_interrupt at ffffffff87c025ba
#7 [ffff9ab03eb83ff0] apic_timer_interrupt at ffffffff87c01b1f
--- <IRQ stack> ---
#8 [ffffaa3eedaebd98] apic_timer_interrupt at ffffffff87c01b1f
[exception RIP: __x86_indirect_thunk_r15+3]
RIP: ffffffff87e031c3 RSP: ffffaa3eedaebe40 RFLAGS: 00000286
RAX: 0000000000000005 RBX: ffff9a3319bd4a18 RCX: 0000000000000006
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9a3319bd4a18
RBP: ffff9ab0343e0400 R8: ffff9a7c2c187a40 R9: ffff9ab035efba60
R10: ffffe1aa188d9600 R11: ffff9a89fa993440 R12: 0000000000000000
R13: 0000000000000001 R14: ffffffffc1063860 R15: ffffffffc10597a0
ORIG_RAX: ffffffffffffff13 CS: 0010 SS: 0018
#9 [ffffaa3eedaebe40] __rpc_execute at ffffffffc1063f6f [sunrpc]
#10 [ffffaa3eedaebe98] process_one_work at ffffffff872d0915
#11 [ffffaa3eedaebed8] worker_thread at ffffffff872d0b80
#12 [ffffaa3eedaebf10] kthread at ffffffff872d72f3
#13 [ffffaa3eedaebf50] ret_from_fork at ffffffff87c00202
crash> dis -rl ffffffffc1063f6f
...
/usr/src/debug/kernel-4.19.90/linux-4.19.90-23.15.v2101.ky10.x86_64/net/sunrpc/sched.c: 783
0xffffffffc1063f67 <__rpc_execute+119>: mov %rbx,%rdi
0xffffffffc1063f6a <__rpc_execute+122>: callq 0xffffffff87e031c0 <__x86_indirect_thunk_r15>
/usr/src/debug/kernel-4.19.90/linux-4.19.90-23.15.v2101.ky10.x86_64/./arch/x86/include/asm/bitops.h: 318
0xffffffffc1063f6f <__rpc_execute+127>: mov 0x30(%rbx),%rax
可以看出__x86_indirect_thunk_r15+3
是执行到__rpc_execute()
的do_action(task)
。
反汇编__rpc_execute()
,重点看%rdi
寄存器相关的内容:
crash> dis -l __rpc_execute
/usr/src/debug/kernel-4.19.90/linux-4.19.90-23.15.v2101.ky10.x86_64/net/sunrpc/sched.c: 753
...
0xffffffffc1063efe <__rpc_execute+14>: push %rbx
0xffffffffc1063eff <__rpc_execute+15>: mov %rdi,%rbx # 将寄存器 rdi 中的值复制到寄存器 rbx
0xffffffffc1063f02 <__rpc_execute+18>: sub $0x20,%rsp
/usr/src/debug/kernel-4.19.90/linux-4.19.90-23.15.v2101.ky10.x86_64/net/sunrpc/sched.c: 755
0xffffffffc1063f06 <__rpc_execute+22>: movzwl 0xdc(%rdi),%r13d
...
/usr/src/debug/kernel-4.19.90/linux-4.19.90-23.15.v2101.ky10.x86_64/./arch/x86/include/asm/bitops.h: 318
0xffffffffc1063f2b <__rpc_execute+59>: mov 0x30(%rbx),%rax
/usr/src/debug/kernel-4.19.90/linux-4.19.90-23.15.v2101.ky10.x86_64/net/sunrpc/sched.c: 761
...
/usr/src/debug/kernel-4.19.90/linux-4.19.90-23.15.v2101.ky10.x86_64/./arch/x86/include/asm/bitops.h: 318
0xffffffffc1063f37 <__rpc_execute+71>: mov 0x30(%rbx),%rax
/usr/src/debug/kernel-4.19.90/linux-4.19.90-23.15.v2101.ky10.x86_64/net/sunrpc/sched.c: 756
...
/usr/src/debug/kernel-4.19.90/linux-4.19.90-23.15.v2101.ky10.x86_64/net/sunrpc/sched.c: 776
0xffffffffc1063f4d <__rpc_execute+93>: mov 0x18(%rbx),%r15
0xffffffffc1063f51 <__rpc_execute+97>: test %r15,%r15
0xffffffffc1063f54 <__rpc_execute+100>: je 0xffffffffc106400b <__rpc_execute+283>
/usr/src/debug/kernel-4.19.90/linux-4.19.90-23.15.v2101.ky10.x86_64/net/sunrpc/sched.c: 778
0xffffffffc1063f5a <__rpc_execute+106>: movq $0x0,0x18(%rbx) # 不会改变 %rbx 的值
0xffffffffc1063f62 <__rpc_execute+114>: nopl 0x0(%rax,%rax,1)
/usr/src/debug/kernel-4.19.90/linux-4.19.90-23.15.v2101.ky10.x86_64/net/sunrpc/sched.c: 783
0xffffffffc1063f67 <__rpc_execute+119>: mov %rbx,%rdi
0xffffffffc1063f6a <__rpc_execute+122>: callq 0xffffffff87e031c0 <__x86_indirect_thunk_r15>
/usr/src/debug/kernel-4.19.90/linux-4.19.90-23.15.v2101.ky10.x86_64/./arch/x86/include/asm/bitops.h: 318
0xffffffffc1063f6f <__rpc_execute+127>: mov 0x30(%rbx),%rax
/usr/src/debug/kernel-4.19.90/linux-4.19.90-23.15.v2101.ky10.x86_64/net/sunrpc/sched.c: 788
x86_64下整数参数使用的寄存器依次为:
RDI,RSI,RDX,RCX,R8,R9
,所以%rdi
(也赋值到了%rbx
)的值就是__rpc_execute(struct rpc_task *task)
第一个参数的值。
我们再尝试找到触发rpc任务的进程:
crash> struct rpc_task ffff9a3319bd4a18 # RDI: ffff9a3319bd4a18
struct rpc_task {
...
tk_owner = 227403, # 线程组中主线程的pid
...
}
查看这个线程组中所有的线程:
crash> ps -g 227403
crash> bt -g 227403 # 打印出所有的栈
解析rpc任务结构体:
crash> struct rpc_task ffff9a3319bd4a18 # RDI: ffff9a3319bd4a18
struct rpc_task {
tk_msg = {
rpc_proc = 0xffffffffc03eb8a0 <nfs3_procedures+288>,
rpc_argp = 0xffff9a3319bd4bd8,
rpc_resp = 0xffff9a3319bd4c40,
rpc_cred = 0xffff9ab0283d2e00
},
nfs3_procedures[]
是struct rpc_procinfo
类型的数组,struct rpc_procinfo
结构体的大小如下:
crash> struct rpc_procinfo
struct rpc_procinfo {
...
}
SIZE: 48
从<nfs3_procedures+288>
中的偏移量288
计算出288/48=6
,所以rpc请求类型是NFS3PROC_READ
。
修改nfs server内核代码:
--- a/fs/nfsd/nfs3proc.c
+++ b/fs/nfsd/nfs3proc.c
@@ -14,6 +14,7 @@
#include "xdr3.h"
#include "vfs.h"
#include "filecache.h"
+#include <linux/delay.h>
#define NFSDDBG_FACILITY NFSDDBG_PROC
@@ -182,6 +183,10 @@ nfsd3_proc_read(struct svc_rqst *rqstp)
struct nfsd3_readargs *argp = rqstp->rq_argp;
struct nfsd3_readres *resp = rqstp->rq_resp;
+ printk("%s:%d, begin delay\n", __func__, __LINE__);
+ msleep(36000 * 1000);
dprintk("nfsd: READ(3) %s %lu bytes at %Lu\n",
SVCFH_fmt(&argp->fh),
(unsigned long) argp->count,
nfs client命令:
tcpdump --interface=any --buffer-size=20480 -w out.cap
mount -t nfs -o rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountvers=3,mountproto=udp,local_lock=none,_netdev 192.168.53.209:/tmp/s_test /mnt # mountport=2050
echo something > /mnt/file
echo 3 > /proc/sys/vm/drop_caches
cat /mnt/file
nfs read请求数据包:
94 24.719361 192.168.53.215 192.168.53.209 NFS 184 V3 READ Call, FH: 0x633aa4f4 Offset: 0 Len: 10
Transmission Control Protocol, Src Port: 958, Dst Port: 2049, Seq: 1313, Ack: 1717, Len: 112
Remote Procedure Call, Type:Call XID:0x6be6919a
Network File System, READ Call FH: 0x633aa4f4 Offset: 0 Len: 10
tcp回复:
95 24.760076 192.168.53.209 192.168.53.215 TCP 72 2049 → 958 [ACK] Seq=1717 Ack=1425 Win=64896 Len=0 TSval=167216215 TSecr=1557071350
Transmission Control Protocol, Src Port: 2049, Dst Port: 958, Seq: 1717, Ack: 1425, Len: 0
nfs read请求包:
80 15.970150 127.0.0.1 127.0.0.1 NFS 178 V3 READ Call (Reply In 81), FH: 0x61f66583 Offset: 0 Len: 8
Transmission Control Protocol, Src Port: 820, Dst Port: 2049, Seq: 1321, Ack: 1697, Len: 112
Remote Procedure Call, Type:Call XID:0x3542f9f0
Network File System, READ Call FH: 0x61f66583 Offset: 0 Len: 8
nfs read回复包:
81 15.970387 127.0.0.1 127.0.0.1 NFS 206 V3 READ Reply (Call In 80) Len: 8
Transmission Control Protocol, Src Port: 2049, Dst Port: 820, Seq: 1697, Ack: 1433, Len: 140
Remote Procedure Call, Type:Reply XID:0x3542f9f0
Network File System, READ Reply Len: 8
由vmcore解析可知rpc请求类型是NFS3PROC_READ
,只有nfs3_proc_read_setup()
函数中引用这个宏定义。
read
ksys_read
vfs_read
__vfs_read
new_sync_read
nfs_file_read
generic_file_read_iter
generic_file_buffered_read
page_cache_sync_readahead
ondemand_readahead
__do_page_cache_readahead
read_pages
nfs_readpages
nfs_pageio_complete
nfs_pageio_doio
nfs_generic_pg_pgios
nfs_initiate_pgio
nfs_initiate_read
nfs3_proc_read_setup// 设置请求类型
&nfs3_procedures[NFS3PROC_READ]
wait_on_page_locked_killable// rpc读任务返回时通过unlock_page()唤醒
wait_on_page_bit_killable wait_on_page_bit_common
从vmcore的解析bt -g 227403
的输出可以看到进程组里没有一个进程在等待异步rpc读的回复。
nfs read请求包:
17786 0.494431937 22.14.23.3 22.14.64.15 NFS 238 V3 READ Call, FH: 0x0b6bfcf8 Offset: 1376075776 Len: 1048576
Transmission Control Protocol, Src Port: 766, Dst Port: 2049, Seq: 13417, Ack: 20317249, Len: 160
Remote Procedure Call, Type:Call XID:0xccbc20c7
Network File System, READ Call FH: 0x0b6bfcf8 Offset: 1376075776 Len: 1048576
tcp回复:
17787 0.494499755 22.14.64.15 22.14.23.3 TCP 78 2049 → 766 [ACK] Seq=20323137 Ack=13577 Win=5623 Len=0 TSval=4228514640 TSecr=255410915
Transmission Control Protocol, Src Port: 2049, Dst Port: 766, Seq: 20323137, Ack: 13577, Len: 0
# 用 tcp.seq == 20323041 找到序号一样的包,但用 tcp.ack == 20323041 找不到客户端发的请求包
17788 0.503423525 22.14.64.15 22.14.23.3 TCP 174 [TCP Retransmission] 2049 → 766 [ACK] Seq=20323041 Ack=13577 Win=5623 Len=96 TSval=4228514649 TSecr=255410915
Transmission Control Protocol, Src Port: 2049, Dst Port: 766, Seq: 20323041, Ack: 13577, Len: 96
17789 0.711405976 22.14.64.15 22.14.23.3 TCP 1526 [TCP Retransmission] 2049 → 766 [ACK] Seq=20317249 Ack=13577 Win=5623 Len=1448 TSval=4228514857 TSecr=255410915
Transmission Control Protocol, Src Port: 2049, Dst Port: 766, Seq: 20317249, Ack: 13577, Len: 1448
17790 1.139363629 22.14.64.15 22.14.23.3 TCP 1526 [TCP Retransmission] 2049 → 766 [ACK] Seq=20317249 Ack=13577 Win=5623 Len=1448 TSval=4228515285 TSecr=255410915
Transmission Control Protocol, Src Port: 2049, Dst Port: 766, Seq: 20317249, Ack: 13577, Len: 1448
17791 1.971276412 22.14.64.15 22.14.23.3 TCP 1526 [TCP Retransmission] 2049 → 766 [ACK] Seq=20317249 Ack=13577 Win=5623 Len=1448 TSval=4228516117 TSecr=255410915
Transmission Control Protocol, Src Port: 2049, Dst Port: 766, Seq: 20317249, Ack: 13577, Len: 1448
ed0172af5d6f SUNRPC: Fix a race to wake a sync task
(修复的是同步rpc任务相关的问题,与此问题无关)
6258cf25d5e3 SUNRPC: avoid soft lockup when transmitting UDP to reachable server.
使用以下脚本,下次再复现出soft lockup后,打开调试开头收集更多的nfs和rpc的日志:
#!/bin/bash
WAIT_TIME=60
execute_command() {
mkdir dmesg-log/
echo "打开nfs日志开头"
echo 0xFFFF > /proc/sys/sunrpc/nfs_debug # NFSDBG_ALL
echo 0x7fff > /proc/sys/sunrpc/rpc_debug # RPCDBG_ALL
echo "等一段时间以产生足够多的日志"
sleep ${WAIT_TIME}s
cp /var/log/messages* dmesg-log/
echo "日志已保存到 $(pwd)/dmesg-log/ 目录下"
exit
}
dmesg -w | while read -r line; do
if echo "$line" | grep -q "watchdog: BUG: soft lockup"; then
echo "检测到软锁"
execute_command
fi
done