内核版本tag: 4.19.90-23.23.v2101
。
现场环境如下:
+-------------+ +-------------+
| huawei | | huawei |
| nfsserver1 | | nfsserver2 |
| (A) | | (D) |
+-------------+ +-------------+
^ ^
| |
nfs nfs
| |
v v
+-------------+ +-------------+
| | | |
| rsyncserver |<--rsync-->| rsyncclient |
| (B) | | (C) |
+-------------+ +-------------+
rsyncclient(C)通过nfs挂载nfsserver2(D),rsyncserver(B)通过nfs挂载nfsserver1(A),rsyncclient(C)通过rsync备份rsyncserver(B)的数据。
rsyncserver(B)的挂载参数如下:
172.23.15.11:/WGPCDIAPP_FS01 on /home/cdis/data type nfs4 (rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=172.23.8.123,local_lock=none,addr=172.23.15.11)
rsyncserver(B)使用的rsync
命令如下:
rsync -avz rsuser@172.23.8.123::CDISDATA_FS01 /home/cdis/data/ --port 10000 --password-file=/etc/rsync_cdis.password
现在问题出在B访问A上的文件时,出现hung task现象,但在B上用ls
命令查看A上的文件可以正常返回输出,说明A和B的nfs连接是正常的。
rsyncserver(B)的日志如下:
316100.086897] INFO: task rsync:3621271 blocked for more than 120 seconds.
[316100.086950] Not tainted 4.19.90-23.23.v2101.ky10.x86_64 #1
[316100.086987] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[316100.087036] rsync D 0 3621271 152322 0x00000080
[316100.087039] Call Trace:
[316100.087052] schedule+0x28/0x80
[316100.087055] rwsem_down_read_failed+0x11c/0x190
[316100.087059] call_rwsem_down_read_failed+0x14/0x30
[316100.087060] down_read+0x13/0x30
[316100.087063] lookup_slow+0x27/0x50
[316100.087066] walk_component+0x1c4/0x350
[316100.087071] path_lookupat+0x6e/0x270
[316100.087074] filename_lookup+0xb6/0x190
[316100.087085] vfs_statx+0x73/0xe0
[316100.087087] __do_sys_newlstat+0x39/0x70
[316100.087093] do_syscall_64+0x5b/0x1d0
[316100.087095] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[316100.087097] RIP: 0033:0x7f612a9e5575
[316100.087102] Code: Bad RIP value.
[316100.087103] RSP: 002b:00007fff67b25fd8 EFLAGS: 00000246 ORIG_RAX: 0000000000000006
[316100.087105] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f612a9e5575
[316100.087105] RDX: 00007fff67b26100 RSI: 00007fff67b26100 RDI: 00007fff67b26190
[316100.087106] RBP: 0000000000000000 R08: 00007fff67b261bf R09: 0000000000000000
[316100.087107] R10: 0000000000000001 R11: 0000000000000246 R12: 00007fff67b26190
[316100.087107] R13: 00007fff67b26100 R14: 0000000000000002 R15: 0000000000010004
[316341.753069] INFO: task rsync:3621271 blocked for more than 120 seconds. [
解压提取vmlinux
和ko
文件:
rpm2cpio kernel-debuginfo-4.19.90-23.23.v2101.ky10.x86_64.rpm | cpio -div
加载ko
文件:
crash> mod -s nfs kernel-debuginfo-4.19.90-23.23.v2101.ky10.x86_64/usr/lib/debug/lib/modules/4.19.90-23.23.v2101.ky10.x86_64/kernel/fs/nfs/nfs.ko.debug
crash> mod -s nfsv3 kernel-debuginfo-4.19.90-23.23.v2101.ky10.x86_64/usr/lib/debug/lib/modules/4.19.90-23.23.v2101.ky10.x86_64/kernel/fs/nfs/nfsv3.ko.debug # 不用加载
crash> mod -s nfsv4 kernel-debuginfo-4.19.90-23.23.v2101.ky10.x86_64/usr/lib/debug/lib/modules/4.19.90-23.23.v2101.ky10.x86_64/kernel/fs/nfs/nfsv4.ko.debug
打印rsync
的所有进程,其中IN
代表TASK_INTERRUPTIBLE
,UN
代表TASK_UNINTERRUPTIBLE
,具体看include/linux/sched.h。
crash> ps rsync # 或用 ps | grep UN 找到所有 TASK_UNINTERRUPTIBLE 的进程
PID PPID CPU TASK ST %MEM VSZ RSS COMM
152322 1 5 ffff91fd87239780 IN 0.0 213108 1592 rsync
3621271 152322 7 ffff91fd6d044680 UN 0.7 409284 128232 rsync
3632440 152322 5 ffff91fde98a1780 UN 1.0 410648 181880 rsync
3644149 152322 0 ffff91fe2c35de00 UN 1.0 410052 183764 rsync
3655486 152322 5 ffff91fd8f1ac680 UN 1.0 410948 184780 rsync
3667062 152322 5 ffff91fe2d47c680 UN 1.0 409860 183572 rsync
3678549 152322 3 ffff91fe2f851780 UN 1.0 410072 182824 rsync
3690262 152322 4 ffff91fddd830000 UN 1.0 409832 182816 rsync
3701961 152322 4 ffff91fd8723c680 UN 1.0 410392 182852 rsync
3713720 152322 7 ffff91fe12449780 UN 1.0 410080 183720 rsync
3725136 152322 3 ffff91fb2af01780 UN 1.0 410104 183060 rsync
3736677 152322 1 ffff91fb1d6e9780 UN 1.0 409856 182816 rsync
3748012 152322 5 ffff91fe27069780 UN 1.0 410108 183348 rsync
3759132 152322 0 ffff91fab0c0de00 UN 1.0 407016 180696 rsync
3770908 152322 7 ffff91fde4e95e00 UN 1.0 411200 184848 rsync
3782024 152322 7 ffff91fde9f82f00 UN 1.0 411028 184848 rsync
3794329 152322 5 ffff91fb26acaf00 UN 1.0 409796 182864 rsync
3805549 152322 7 ffff91fab7445e00 UN 1.0 410084 182880 rsync
3817313 152322 2 ffff91fab0c0af00 UN 1.0 410084 183448 rsync
3828467 152322 7 ffff91fd8f1aaf00 UN 1.0 410600 183164 rsync
3840283 152322 6 ffff91fe2d5a5e00 UN 1.0 410384 183152 rsync
其中有18个状态为UN
的进程的栈如下:
crash> bt 3632440
PID: 3632440 TASK: ffff91fde98a1780 CPU: 5 COMMAND: "rsync"
#0 [ffffa932c498bca0] __schedule at ffffffffa749c4a6
#1 [ffffa932c498bd40] schedule at ffffffffa749cb48
#2 [ffffa932c498bd48] rwsem_down_write_failed_killable at ffffffffa74a074d
#3 [ffffa932c498bdf0] call_rwsem_down_write_failed_killable at ffffffffa7493a53
#4 [ffffa932c498be30] down_write_killable at ffffffffa749f7b0
#5 [ffffa932c498be38] iterate_dir at ffffffffa6edccc9
#6 [ffffa932c498be78] ksys_getdents64 at ffffffffa6eddc30
#7 [ffffa932c498bf30] __x64_sys_getdents64 at ffffffffa6edde46
#8 [ffffa932c498bf38] do_syscall_64 at ffffffffa6c0430b
#9 [ffffa932c498bf50] entry_SYSCALL_64_after_hwframe at ffffffffa7600088
RIP: 00007f612a9bd757 RSP: 00007fff67b2a338 RFLAGS: 00000246
RAX: ffffffffffffffda RBX: 0000562293813e70 RCX: 00007f612a9bd757
RDX: 0000000000008000 RSI: 0000562293813ea0 RDI: 0000000000000009
RBP: 0000562293813ea0 R8: 0000000000000000 R9: 0000000000000001
R10: 0000000000000001 R11: 0000000000000246 R12: ffffffffffffff80
R13: 0000000000000000 R14: 000056229381be73 R15: 000056229381be60
ORIG_RAX: 00000000000000d9 CS: 0033 SS: 002b
另外2个进程的栈是:
crash> bt 3621271
PID: 3621271 TASK: ffff91fd6d044680 CPU: 7 COMMAND: "rsync"
#0 [ffffa932d099fa90] __schedule at ffffffffa749c4a6
#1 [ffffa932d099fb30] schedule at ffffffffa749cb48
#2 [ffffa932d099fb38] rwsem_down_read_failed at ffffffffa74a02fc
#3 [ffffa932d099fbe0] call_rwsem_down_read_failed at ffffffffa74939d4
#4 [ffffa932d099fc28] down_read at ffffffffa749f703
#5 [ffffa932d099fc30] lookup_slow at ffffffffa6ed42f7
#6 [ffffa932d099fc58] walk_component at ffffffffa6ed47d4
#7 [ffffa932d099fcb8] path_lookupat at ffffffffa6ed4f5e
#8 [ffffa932d099fd18] filename_lookup at ffffffffa6ed8866
#9 [ffffa932d099fe40] vfs_statx at ffffffffa6ecc593
#10 [ffffa932d099fe98] __do_sys_newlstat at ffffffffa6eccbd9
#11 [ffffa932d099ff38] do_syscall_64 at ffffffffa6c0430b
#12 [ffffa932d099ff50] entry_SYSCALL_64_after_hwframe at ffffffffa7600088
RIP: 00007f612a9e5575 RSP: 00007fff67b25fd8 RFLAGS: 00000246
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f612a9e5575
RDX: 00007fff67b26100 RSI: 00007fff67b26100 RDI: 00007fff67b26190
RBP: 0000000000000000 R8: 00007fff67b261bf R9: 0000000000000000
R10: 0000000000000001 R11: 0000000000000246 R12: 00007fff67b26190
R13: 00007fff67b26100 R14: 0000000000000002 R15: 0000000000010004
ORIG_RAX: 0000000000000006 CS: 0033 SS: 002b
crash> bt 3748012
PID: 3748012 TASK: ffff91fe27069780 CPU: 5 COMMAND: "rsync"
#0 [ffffa932c4ea77f8] __schedule at ffffffffa749c4a6
#1 [ffffa932c4ea7898] schedule at ffffffffa749cb48
#2 [ffffa932c4ea78a0] rpc_wait_bit_killable at ffffffffc072230e [sunrpc]
#3 [ffffa932c4ea78b8] __wait_on_bit at ffffffffa749cf94
#4 [ffffa932c4ea78f0] out_of_line_wait_on_bit at ffffffffa749d071
#5 [ffffa932c4ea7940] __rpc_execute at ffffffffc0722fd0 [sunrpc]
#6 [ffffa932c4ea7998] rpc_run_task at ffffffffc0716e69 [sunrpc]
#7 [ffffa932c4ea79d8] nfs4_call_sync_sequence at ffffffffc0828854 [nfsv4]
#8 [ffffa932c4ea7a48] _nfs4_proc_readdir at ffffffffc082ab85 [nfsv4]
#9 [ffffa932c4ea7b58] nfs4_proc_readdir at ffffffffc0834ff6 [nfsv4]
#10 [ffffa932c4ea7bd0] nfs_readdir_xdr_to_array at ffffffffc07e053c [nfs]
#11 [ffffa932c4ea7cd0] nfs_readdir_filler at ffffffffc07e07dd [nfs]
#12 [ffffa932c4ea7cf0] do_read_cache_page at ffffffffa6e16e6a
#13 [ffffa932c4ea7da0] nfs_readdir at ffffffffc07e096c [nfs]
#14 [ffffa932c4ea7e38] iterate_dir at ffffffffa6edcce6
#15 [ffffa932c4ea7e78] ksys_getdents64 at ffffffffa6eddc30
#16 [ffffa932c4ea7f30] __x64_sys_getdents64 at ffffffffa6edde46
#17 [ffffa932c4ea7f38] do_syscall_64 at ffffffffa6c0430b
#18 [ffffa932c4ea7f50] entry_SYSCALL_64_after_hwframe at ffffffffa7600088
RIP: 00007f612a9bd757 RSP: 00007fff67b2a338 RFLAGS: 00000246
RAX: ffffffffffffffda RBX: 000056228f6a5a90 RCX: 00007f612a9bd757
RDX: 0000000000008000 RSI: 000056228f6a5ac0 RDI: 0000000000000009
RBP: 000056228f6a5ac0 R8: 0000000000000000 R9: 0000000000000001
R10: 0000000000000001 R11: 0000000000000246 R12: ffffffffffffff80
R13: 0000000000000000 R14: 000056228f6ada93 R15: 000056228f6ada80
ORIG_RAX: 00000000000000d9 CS: 0033 SS: 002b
下面分析inode
相关的信息,栈中的数据暂时不分析汇编(TODO),因为从代码上看只会操作一个inode
。
与进程3632440
栈一样的几个进程操作的inode
如下:
crash> bt 3632440 -FF
#5 [ffffa932c498be38] iterate_dir at ffffffffa6edccc9
ffffa932c498be60: 00000000fffffff2 [ffff91fe2bb65b00:filp]
crash> kmem ffff91fe2bb65b00
FREE / [ALLOCATED]
ffff91fe2bb65b00]
[crash> struct file.f_inode ffff91fe2bb65b00
f_inode = 0xffff91fca5985500,
struct inode 0xffff91fca5985500 -x
i_mode = 0x41e0, # S_IFDIR为0x4000,所以是文件类型是文件夹
crash> bt 3644149 -FF
#5 [ffffa932cf64fe38] iterate_dir at ffffffffa6edccc9
ffffa932cf64fe60: 00000000fffffff2 [ffff91fb608c1a00:filp]
crash> kmem ffff91fb608c1a00
FREE / [ALLOCATED]
ffff91fb608c1a00]
[crash> struct file.f_inode ffff91fb608c1a00
f_inode = 0xffff91fca5985500,
crash> bt 3655486 -FF
#5 [ffffa932c4d37e38] iterate_dir at ffffffffa6edccc9
ffffa932c4d37e60: 00000000fffffff2 [ffff91fe2b81b300:filp]
crash> kmem ffff91fe2b81b300
FREE / [ALLOCATED]
ffff91fe2b81b300]
[crash> struct file.f_inode ffff91fe2b81b300
f_inode = 0xffff91fca5985500,
crash> bt 3667062 -FF
#5 [ffffa932d0857e38] iterate_dir at ffffffffa6edccc9
ffffa932d0857e60: 00000000fffffff2 [ffff91fe2bd4a900:filp]
crash> kmem ffff91fe2bd4a900
FREE / [ALLOCATED]
ffff91fe2bd4a900]
[crash> struct file ffff91fe2bd4a900
f_inode = 0xffff91fca5985500,
# 以下就不列出命令了,只写结果
# 3678549
ffff91fcbfe03f00:filp]
[FREE / [ALLOCATED]
ffff91fcbfe03f00]
[f_inode = 0xffff91fca5985500,
# 3690262
ffff91fe2ce37c00:filp]
[FREE / [ALLOCATED]
ffff91fe2ce37c00]
[f_inode = 0xffff91fca5985500,
# 3701961
ffff91fddda2bb00:filp]
[FREE / [ALLOCATED]
ffff91fddda2bb00]
[f_inode = 0xffff91fca5985500,
# 3713720
ffff91fe12783b00:filp]
[FREE / [ALLOCATED]
ffff91fe12783b00]
[f_inode = 0xffff91fca5985500,
# 3725136
ffff91fde961b900:filp]
[FREE / [ALLOCATED]
ffff91fde961b900]
[f_inode = 0xffff91fca5985500,
# 3736677
ffff91fe29e98300:filp]
[FREE / [ALLOCATED]
ffff91fe29e98300]
[f_inode = 0xffff91fca5985500,
# 3759132
ffff91fe10973900:filp]
[FREE / [ALLOCATED]
ffff91fe10973900]
[f_inode = 0xffff91fca5985500,
# 3770908
ffff91fe2f936800:filp]
[FREE / [ALLOCATED]
ffff91fe2f936800]
[f_inode = 0xffff91fca5985500,
# 3782024
ffff91fde4eb7c00:filp]
[FREE / [ALLOCATED]
ffff91fde4eb7c00]
[f_inode = 0xffff91fca5985500,
# 3794329
ffff91fe2a111c00:filp]
[FREE / [ALLOCATED]
ffff91fe2a111c00]
[f_inode = 0xffff91fca5985500,
# 3805549
ffff91fe2bd4a000:filp]
[FREE / [ALLOCATED]
ffff91fe2bd4a000]
[f_inode = 0xffff91fca5985500,
# 3817313
ffff91fd8704b800:filp]
[FREE / [ALLOCATED]
ffff91fd8704b800]
[f_inode = 0xffff91fca5985500,
# 3828467
ffff91fe063d3500:filp]
[FREE / [ALLOCATED]
ffff91fe063d3500]
[f_inode = 0xffff91fca5985500,
# 3840283
ffff91fe25417100:filp]
[FREE / [ALLOCATED]
ffff91fe25417100]
[f_inode = 0xffff91fca5985500,
进程3748012
的inode
:
crash> bt 3748012 -FF
ffff91fdfdcbf000:filp]
[crash> kmem ffff91fdfdcbf000
FREE / [ALLOCATED]
ffff91fdfdcbf000]
[crash> struct file.f_inode ffff91fdfdcbf000
f_inode = 0xffff91fca5985500,
进程3621271
的inode
:
crash> bt 3621271 -FF
PID: 3621271 TASK: ffff91fd6d044680 CPU: 7 COMMAND: "rsync"
#0 [ffffa932d099fa90] __schedule at ffffffffa749c4a6
#1 [ffffa932d099fb30] schedule at ffffffffa749cb48
#2 [ffffa932d099fb38] rwsem_down_read_failed at ffffffffa74a02fc
#3 [ffffa932d099fbe0] call_rwsem_down_read_failed at ffffffffa74939d4
ffffa932d099fbe8: 0000000000000000 61c8864680b583eb
ffffa932d099fbf8: [ffff91fe29cee027:names_cache] 0000000000000028
ffffa932d099fc08: 0000000000000000 00072a8500000000
ffffa932d099fc18: [ffff91fcb02b7440:dentry] [ffff91fca59855a0:nfs_inode_cache]
ffffa932d099fc28: down_read+19
#4 [ffffa932d099fc28] down_read at ffffffffa749f703
#5 [ffffa932d099fc30] lookup_slow at ffffffffa6ed42f7
#6 [ffffa932d099fc58] walk_component at ffffffffa6ed47d4
#7 [ffffa932d099fcb8] path_lookupat at ffffffffa6ed4f5e
#8 [ffffa932d099fd18] filename_lookup at ffffffffa6ed8866
#9 [ffffa932d099fe40] vfs_statx at ffffffffa6ecc593
#10 [ffffa932d099fe98] __do_sys_newlstat at ffffffffa6eccbd9
#11 [ffffa932d099ff38] do_syscall_64 at ffffffffa6c0430b
#12 [ffffa932d099ff50] entry_SYSCALL_64_after_hwframe at ffffffffa7600088
crash> kmem ffff91fcb02b7440 # [ffff91fcb02b7440:dentry]
FREE / [ALLOCATED]
ffff91fcb02b7440]
[crash> struct dentry ffff91fcb02b7440
d_inode = 0xffff91fca5985500,
crash> kmem ffff91fca59855a0 # [ffff91fca59855a0:nfs_inode_cache]
FREE / [ALLOCATED]
ffff91fca5985320]
[crash> struct nfs_inode ffff91fca5985320 -ox
ffff91fca5985500] struct inode vfs_inode; [
可以看出,20个进程操作的inode
是一样的,且这个inode
的文件类型是目录。
按ctrl + a c
打开QEMU控制台,使用以下命令导出vmcore:
(qemu) dump-guest-memory /your_path/vmcore
列出所有ls
进程:
crash> ps ls
PID PPID CPU TASK ST %MEM VSZ RSS COMM
571 570 5 ffff88812f9b2680 UN 3.3 189588 175328 ls
574 573 12 ffff888117444d00 UN 3.4 190248 180860 ls
578 577 13 ffff888117440000 UN 2.3 182328 123108 ls
581 580 10 ffff88810217cd00 UN 2.9 186420 153032 ls
584 583 2 ffff88812e5da680 UN 2.1 180876 112452 ls
587 586 4 ffff88812e5dcd00 UN 3.2 188796 170200 ls
590 589 3 ffff88812f988000 UN 2.7 184836 141652 ls
593 592 4 ffff88810c8dcd00 UN 2.8 185628 147248 ls
596 595 14 ffff88810c8d8000 UN 2.9 186288 152092 ls
599 598 2 ffff8881176b8000 UN 3.2 188796 169756 ls
602 601 4 ffff88812fb32680 UN 3.8 352624 197372 ls
605 604 2 ffff88812fb34d00 UN 3.2 188796 169760 ls
608 607 0 ffff88810769a680 UN 2.5 183384 131228 ls
611 610 4 ffff88810761cd00 UN 2.8 185892 149336 ls
614 613 1 ffff8881076e4d00 UN 2.9 186024 150032 ls
617 616 14 ffff888134ab4d00 UN 3.1 188004 164404 ls
620 619 6 ffff888102244d00 UN 2.3 182196 121744 ls
623 622 13 ffff88812fa88000 UN 2.8 185628 147200 ls
626 625 13 ffff888107618000 UN 3.0 187344 159756 ls
629 628 3 ffff888117762680 UN 2.6 184440 137968 ls
有19个进程的栈如下所示:
PID: 571 TASK: ffff88812f9b2680 CPU: 5 COMMAND: "ls"
#0 [ffffc900011dbce0] __schedule at ffffffff82186cdc
#1 [ffffc900011dbd78] schedule at ffffffff82187222
#2 [ffffc900011dbd88] rwsem_down_write_failed_killable at ffffffff8218a202
#3 [ffffc900011dbe18] call_rwsem_down_write_failed_killable at ffffffff8217bab3
#4 [ffffc900011dbe58] down_write_killable at ffffffff821893c9
#5 [ffffc900011dbe68] iterate_dir at ffffffff813f52fc
#6 [ffffc900011dbea8] ksys_getdents64 at ffffffff813f629c
#7 [ffffc900011dbf18] __x64_sys_getdents64 at ffffffff813f636d
#8 [ffffc900011dbf28] do_syscall_64 at ffffffff81004ad2
#9 [ffffc900011dbf50] entry_SYSCALL_64_after_hwframe at ffffffff82200088
1个进程的栈如下所示(过一会儿,进入nfs请求的进程变成其他进程):
PID: 574 TASK: ffff888117444d00 CPU: 12 COMMAND: "ls"
#0 [ffffc900011eb860] __schedule at ffffffff82186cdc
#1 [ffffc900011eb8f8] schedule at ffffffff82187222
#2 [ffffc900011eb908] rpc_wait_bit_killable at ffffffff820c0d55
#3 [ffffc900011eb928] __wait_on_bit at ffffffff82187652
#4 [ffffc900011eb960] out_of_line_wait_on_bit at ffffffff82187722
#5 [ffffc900011eb9a8] __rpc_execute at ffffffff820c22fa
#6 [ffffc900011eb9e8] rpc_execute at ffffffff820c297e
#7 [ffffc900011eba10] rpc_run_task at ffffffff820adc89
#8 [ffffc900011eba50] nfs4_call_sync_sequence at ffffffff8156b97f
#9 [ffffc900011ebad8] _nfs4_proc_readdir at ffffffff81570173
#10 [ffffc900011ebbd8] nfs4_proc_readdir at ffffffff81579dd6
#11 [ffffc900011ebc50] nfs_readdir_xdr_to_array at ffffffff8153f3b0
#12 [ffffc900011ebd50] nfs_readdir_filler at ffffffff8153f5cb
#13 [ffffc900011ebd70] do_read_cache_page at ffffffff812ec110
#14 [ffffc900011ebdc0] nfs_readdir at ffffffff8153f7eb
#15 [ffffc900011ebe68] iterate_dir at ffffffff813f522e
#16 [ffffc900011ebea8] ksys_getdents64 at ffffffff813f629c
#17 [ffffc900011ebf18] __x64_sys_getdents64 at ffffffff813f636d
#18 [ffffc900011ebf28] do_syscall_64 at ffffffff81004ad2
#19 [ffffc900011ebf50] entry_SYSCALL_64_after_hwframe at ffffffff82200088
有一个线索,出问题的目录下有80~90万个文件,每个文件几k字节,构造这个场景。
在nfs server上生成很多文件:
i=0
while true
do
echo 1234567890 > file${i} # 不要用touch创建文件,因为更慢
((i++))
echo ${i}
if [ ${i} -eq 1000000 ]
then
break
fi
done
在client上挂载:
mount -t nfs -o rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,local_lock=none 192.168.122.209:/s_test /mnt
100万个文件,用ls
遍历,分多次readdir
请求,每次返回167个文件目录项。
没有特别说明时,client和server都是使用的都是同一版本的内核。
echo 3 > /proc/sys/vm/drop_caches
time ls -lh /mnt/dir | wc
1000002 9000011 50888949
real 0m37.319s
echo 3 > /proc/sys/vm/drop_caches
time ls -lh /mnt/dir | wc & # 连续执行20次
发生了oom。
echo 3 > /proc/sys/vm/drop_caches
time ls -lh /mnt/dir | wc
1000002 9000011 50888949
real 0m42.712s
echo 3 > /proc/sys/vm/drop_caches
time ls -lh /mnt/dir | wc & # 连续执行10次
real 5m30.489s
echo 3 > /proc/sys/vm/drop_caches
time ls -lh /mnt/dir | wc
1000002 9000011 50888949
real 1m21.104s
echo 3 > /proc/sys/vm/drop_caches
time ls -lh /mnt/dir | wc & # 连续执行10次
real 1m26.528s
补丁为0001-nfs-add-old-iterate-directory-operation-for-debug.patch
echo 3 > /proc/sys/vm/drop_caches
time ls -lh /mnt/dir-100w | wc
1000002 9000011 50888949
real 1m34.346s
echo 3 > /proc/sys/vm/drop_caches
time ls -lh /mnt/dir-100w | wc & # 连续执行10次
real 5m26.562s
没多久就oom了。
echo 3 > /proc/sys/vm/drop_caches
time ls -lh /mnt/dir-100w | wc
1000002 9000011 50888949
real 1m34.346s
echo 3 > /proc/sys/vm/drop_caches
time ls -lh /mnt/dir-100w | wc & # 连续执行15次
real 7m50.229s
echo 3 > /proc/sys/vm/drop_caches
time ls -lh /mnt/dir-10w | wc
100002 900011 4988949
real 0m7.268s
echo 3 > /proc/sys/vm/drop_caches
time ls -lh /mnt/dir-10w | wc & # 连续执行15次
real 0m21.190s
time ls -lh /mnt/dir-1w | wc
10002 90011 488948
real 0m0.830s
real 0m0.838s
real 0m0.848s
echo 3 > /proc/sys/vm/drop_caches
time ls -lh /mnt/dir-1w | wc & # 连续执行15次
real 0m0.981s
real 0m0.984s
real 0m1.038s
real 0m0.974s
# 100w 10w 1w
# 1个进程
94.346s 7.268s 0.848s
# 15个进程
470.229s 21.190s 1s
100w、10w、1w个文件时,同样是15个进程的情况下,所用的时间比是470:21:1,文件每增加10倍,所用时间大概增加20+倍。
对结果没什么卵用,请跳过,但毕竟曾经定位过,还是记录一下。
由于现场环境数据量非常大,不能长时间抓包,只能抓取一段时间的数据包然后覆盖文件,判断问题复现后就停止:
while true
do
sudo tcpdump --interface=<网络接口> --buffer-size=20480 -w out.cap
pid=$(sudo pgrep tcpdump)
sleep 60
sudo kill -SIGINT ${pid}
sudo ps aux | grep rsync | grep D
if [ $? -eq 0 ]
break
fi
sudo rm out.cap
done
+-------------+
| test |
| nfsclient |
| (E) |
+-------------+
^
|
nfs
|
v
+-------------+ +-------------+
| huawei | | huawei |
| nfsserver1 | | nfsserver2 |
| (A) | | (D) |
+-------------+ +-------------+
^ ^
| |
nfs nfs
| |
v v
+-------------+ +-------------+
| | | |
| rsyncserver |<--rsync-->| rsyncclient |
| (B) | | (C) |
+-------------+ +-------------+
抓包步骤(已证明不可行,长时间抓包数据太多文件太大):
rsync
进程卡住),连接nas盘的E正常,注意连接的是同一个nas盘sudo tcpdump --interface=<网络接口> --buffer-size=20480 -w out.cap
ls | wc
,30秒后停止抓包查看进程状态步骤:
rsync
进程号: ps aux | grep rsync
rsync
进程栈: cat /proc/<进程号>/stack
rsync
进程cat /proc/<执行到nfs请求的rsync进程号>/status | grep voluntary_ctxt_switches
strace
跟踪系统调用分析+-------------+
| test |
| nfsclient |
| (E) |
+-------------+
^
|
nfs
|
v
+-------------+ +-------------+
| huawei | | huawei |
| nfsserver1 | | nfsserver2 |
| (A) | | (D) |
+-------------+ +-------------+
^ ^
| |
nfs nfs
| |
v v
+-------------+ +-------------+
| | | |
| rsyncserver |<--rsync-->| rsyncclient |
| (B) | | (C) |
+-------------+ +-------------+
使用strace
命令记录readdir
请求的时间点:
strace -ttt -f -v -s 40960 -o strace.out ls <百万文件的那个目录> | wc
strace
命令后,B的strace
命令也停止(直接ctrl + c
)strace.out
文件复制出来nU2BlPA3xoSfYaPLVs1DqiZU4ANOtV # 文件名
1712802314.374988 # 机器B的时间点
1712802323.791969 # 机器E的时间点
Y348qeD3dxaeJOsyiTNON3qCtEE8yn
1712802401.142976-1712802314.374988=86.767988
1712802325.169445-1712802323.791969=1.377476
xxwdIL00iCzXviXcoCaNCvrvIyAkeP
1712802509.642010-1712802314.374988=195.267022
1712802326.528755-1712802323.791969=2.736786
L8mSFdfuAtzgWAp646sVy73uTMgbaH
1712802607.192381-1712802314.374988=292.817393
1712802328.083538-1712802323.791969=4.291569
9wHPHGNObYKMRERIhHuRaKiZW9Jer4
1712802707.587591-1712802314.374988=393.212603
1712802329.226065-1712802323.791969=5.434096
Z9Jj6YvkzMCt0lymymEtZhmIAFEDMH
1712802803.228566-1712802314.374988=488.853578
1712802330.254807-1712802323.791969=6.462838
VLww0v31UJplVOvh1bYlQ2Wwz4OfiI
1712802824.954462-1712802314.374988=510.579474
1712802330.369828-1712802323.791969=6.577859
抽取几个时间点的数据可以看出,ls
命令执行的速度差距将近100倍。
由于nfs没有实现iterate_shared
方法,所以执行了写锁down_write_killable
,与其他进程的写锁、读锁都是互斥的,当并发量大时,iterate_dir
函数中每次获取写锁的等待时间就非常长。而用ps
命令查看进程状态时,大部分时间都处于down_write_killable
和__rpc_execute
执行后进入休眠的状态,所以查看到的都是处于D
状态。
getdents64
ksys_getdents64
iterate_dir// nfs没有实现iterate_shared,实现的是iterate,条件不满足
if (file->f_op->iterate_shared)
shared == false// 这里使用写锁,无法和其他进程并发
// 获取锁的时间长也是ps命令查看进程总是处于D状态的原因之一
down_write_killable(&inode->i_rwsem)
call_rwsem_down_write_failed_killable
rwsem_down_write_failed_killable
schedule
__schedule
nfs_readdir
do_read_cache_page
nfs_readdir_filler
nfs_readdir_xdr_to_array
nfs4_proc_readdir
_nfs4_proc_readdir
nfs4_call_sync_sequence
rpc_run_task
__rpc_execute
out_of_line_wait_on_bit
__wait_on_bit
rpc_wait_bit_killable
schedule// 由于循环的过程中,等待nfs server的回复时间占大部分
// 所以用ps命令看到这个进程都是处于D状态
__schedule
修复补丁: 93a6ab7b691f NFS: Switch readdir to using iterate_shared()
,移除所有iterate_shared
接口的补丁3e3271549670 vfs: get rid of old '->iterate' directory operation
。
+-------------+ +-------------+
| huawei | | huawei |
| nfsserver1 | | nfsserver2 |
| (A) | | (D) |
+-------------+ +-------------+
^ ^
| |
nfs nfs
| |
v v
+-------------+ +-------------+
| | | |
| rsyncserver |<--rsync-->| rsyncclient |
| (B) | | (C) |
+-------------+ +-------------+
rsyncserver上有很多进程操作有百万数量文件的目录,而由于遍历目录时无法多进程并发,所以导致每个进程执行的时间非常长,出现很多用ps
命令看是D
状态的进程。
建议增大rsync同步的时间间隔。