nfs没实现iterate_shared()导致的遍历目录无法并发问题

1 问题描述

内核版本tag: 4.19.90-23.23.v2101

现场环境如下:

+-------------+           +-------------+
|   huawei    |           |   huawei    |
| nfsserver1  |           | nfsserver2  |
|    (A)      |           |    (D)      |
+-------------+           +-------------+
      ^                          ^
      |                          |
     nfs                        nfs
      |                          |
      v                          v
+-------------+           +-------------+
|             |           |             |
| rsyncserver |<--rsync-->| rsyncclient |
|    (B)      |           |    (C)      |
+-------------+           +-------------+

rsyncclient(C)通过nfs挂载nfsserver2(D),rsyncserver(B)通过nfs挂载nfsserver1(A),rsyncclient(C)通过rsync备份rsyncserver(B)的数据。

rsyncserver(B)的挂载参数如下:

172.23.15.11:/WGPCDIAPP_FS01 on /home/cdis/data type nfs4 (rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=172.23.8.123,local_lock=none,addr=172.23.15.11)

rsyncserver(B)使用的rsync命令如下:

rsync -avz rsuser@172.23.8.123::CDISDATA_FS01  /home/cdis/data/  --port 10000 --password-file=/etc/rsync_cdis.password

现在问题出在B访问A上的文件时,出现hung task现象,但在B上用ls命令查看A上的文件可以正常返回输出,说明A和B的nfs连接是正常的。

rsyncserver(B)的日志如下:

[316100.086897] INFO: task rsync:3621271 blocked for more than 120 seconds.
[316100.086950]       Not tainted 4.19.90-23.23.v2101.ky10.x86_64 #1
[316100.086987] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[316100.087036] rsync           D    0 3621271 152322 0x00000080
[316100.087039] Call Trace:
[316100.087052]  schedule+0x28/0x80
[316100.087055]  rwsem_down_read_failed+0x11c/0x190
[316100.087059]  call_rwsem_down_read_failed+0x14/0x30
[316100.087060]  down_read+0x13/0x30
[316100.087063]  lookup_slow+0x27/0x50
[316100.087066]  walk_component+0x1c4/0x350
[316100.087071]  path_lookupat+0x6e/0x270
[316100.087074]  filename_lookup+0xb6/0x190
[316100.087085]  vfs_statx+0x73/0xe0
[316100.087087]  __do_sys_newlstat+0x39/0x70
[316100.087093]  do_syscall_64+0x5b/0x1d0
[316100.087095]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[316100.087097] RIP: 0033:0x7f612a9e5575
[316100.087102] Code: Bad RIP value.
[316100.087103] RSP: 002b:00007fff67b25fd8 EFLAGS: 00000246 ORIG_RAX: 0000000000000006
[316100.087105] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f612a9e5575
[316100.087105] RDX: 00007fff67b26100 RSI: 00007fff67b26100 RDI: 00007fff67b26190
[316100.087106] RBP: 0000000000000000 R08: 00007fff67b261bf R09: 0000000000000000
[316100.087107] R10: 0000000000000001 R11: 0000000000000246 R12: 00007fff67b26190
[316100.087107] R13: 00007fff67b26100 R14: 0000000000000002 R15: 0000000000010004
[316341.753069] INFO: task rsync:3621271 blocked for more than 120 seconds.

2 vmcore解析

解压提取vmlinuxko文件:

rpm2cpio kernel-debuginfo-4.19.90-23.23.v2101.ky10.x86_64.rpm | cpio -div

加载ko文件:

crash> mod -s nfs kernel-debuginfo-4.19.90-23.23.v2101.ky10.x86_64/usr/lib/debug/lib/modules/4.19.90-23.23.v2101.ky10.x86_64/kernel/fs/nfs/nfs.ko.debug
crash> mod -s nfsv3 kernel-debuginfo-4.19.90-23.23.v2101.ky10.x86_64/usr/lib/debug/lib/modules/4.19.90-23.23.v2101.ky10.x86_64/kernel/fs/nfs/nfsv3.ko.debug # 不用加载
crash> mod -s nfsv4 kernel-debuginfo-4.19.90-23.23.v2101.ky10.x86_64/usr/lib/debug/lib/modules/4.19.90-23.23.v2101.ky10.x86_64/kernel/fs/nfs/nfsv4.ko.debug

打印rsync的所有进程,其中IN代表TASK_INTERRUPTIBLEUN代表TASK_UNINTERRUPTIBLE,具体看include/linux/sched.h

crash> ps rsync # 或用 ps | grep UN 找到所有 TASK_UNINTERRUPTIBLE 的进程
   PID    PPID  CPU       TASK        ST  %MEM     VSZ    RSS  COMM
  152322      1   5  ffff91fd87239780  IN   0.0  213108   1592  rsync
  3621271  152322   7  ffff91fd6d044680  UN   0.7  409284 128232  rsync
  3632440  152322   5  ffff91fde98a1780  UN   1.0  410648 181880  rsync
  3644149  152322   0  ffff91fe2c35de00  UN   1.0  410052 183764  rsync
  3655486  152322   5  ffff91fd8f1ac680  UN   1.0  410948 184780  rsync
  3667062  152322   5  ffff91fe2d47c680  UN   1.0  409860 183572  rsync
  3678549  152322   3  ffff91fe2f851780  UN   1.0  410072 182824  rsync
  3690262  152322   4  ffff91fddd830000  UN   1.0  409832 182816  rsync
  3701961  152322   4  ffff91fd8723c680  UN   1.0  410392 182852  rsync
  3713720  152322   7  ffff91fe12449780  UN   1.0  410080 183720  rsync
  3725136  152322   3  ffff91fb2af01780  UN   1.0  410104 183060  rsync
  3736677  152322   1  ffff91fb1d6e9780  UN   1.0  409856 182816  rsync
  3748012  152322   5  ffff91fe27069780  UN   1.0  410108 183348  rsync
  3759132  152322   0  ffff91fab0c0de00  UN   1.0  407016 180696  rsync
  3770908  152322   7  ffff91fde4e95e00  UN   1.0  411200 184848  rsync
  3782024  152322   7  ffff91fde9f82f00  UN   1.0  411028 184848  rsync
  3794329  152322   5  ffff91fb26acaf00  UN   1.0  409796 182864  rsync
  3805549  152322   7  ffff91fab7445e00  UN   1.0  410084 182880  rsync
  3817313  152322   2  ffff91fab0c0af00  UN   1.0  410084 183448  rsync
  3828467  152322   7  ffff91fd8f1aaf00  UN   1.0  410600 183164  rsync
  3840283  152322   6  ffff91fe2d5a5e00  UN   1.0  410384 183152  rsync

其中有18个状态为UN的进程的栈如下:

crash> bt 3632440
PID: 3632440  TASK: ffff91fde98a1780  CPU: 5   COMMAND: "rsync"
 #0 [ffffa932c498bca0] __schedule at ffffffffa749c4a6
 #1 [ffffa932c498bd40] schedule at ffffffffa749cb48
 #2 [ffffa932c498bd48] rwsem_down_write_failed_killable at ffffffffa74a074d
 #3 [ffffa932c498bdf0] call_rwsem_down_write_failed_killable at ffffffffa7493a53
 #4 [ffffa932c498be30] down_write_killable at ffffffffa749f7b0
 #5 [ffffa932c498be38] iterate_dir at ffffffffa6edccc9
 #6 [ffffa932c498be78] ksys_getdents64 at ffffffffa6eddc30
 #7 [ffffa932c498bf30] __x64_sys_getdents64 at ffffffffa6edde46
 #8 [ffffa932c498bf38] do_syscall_64 at ffffffffa6c0430b
 #9 [ffffa932c498bf50] entry_SYSCALL_64_after_hwframe at ffffffffa7600088
    RIP: 00007f612a9bd757  RSP: 00007fff67b2a338  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: 0000562293813e70  RCX: 00007f612a9bd757
    RDX: 0000000000008000  RSI: 0000562293813ea0  RDI: 0000000000000009
    RBP: 0000562293813ea0   R8: 0000000000000000   R9: 0000000000000001
    R10: 0000000000000001  R11: 0000000000000246  R12: ffffffffffffff80
    R13: 0000000000000000  R14: 000056229381be73  R15: 000056229381be60
    ORIG_RAX: 00000000000000d9  CS: 0033  SS: 002b

另外2个进程的栈是:

crash> bt 3621271
PID: 3621271  TASK: ffff91fd6d044680  CPU: 7   COMMAND: "rsync"
 #0 [ffffa932d099fa90] __schedule at ffffffffa749c4a6
 #1 [ffffa932d099fb30] schedule at ffffffffa749cb48
 #2 [ffffa932d099fb38] rwsem_down_read_failed at ffffffffa74a02fc
 #3 [ffffa932d099fbe0] call_rwsem_down_read_failed at ffffffffa74939d4
 #4 [ffffa932d099fc28] down_read at ffffffffa749f703
 #5 [ffffa932d099fc30] lookup_slow at ffffffffa6ed42f7
 #6 [ffffa932d099fc58] walk_component at ffffffffa6ed47d4
 #7 [ffffa932d099fcb8] path_lookupat at ffffffffa6ed4f5e
 #8 [ffffa932d099fd18] filename_lookup at ffffffffa6ed8866
 #9 [ffffa932d099fe40] vfs_statx at ffffffffa6ecc593
#10 [ffffa932d099fe98] __do_sys_newlstat at ffffffffa6eccbd9
#11 [ffffa932d099ff38] do_syscall_64 at ffffffffa6c0430b
#12 [ffffa932d099ff50] entry_SYSCALL_64_after_hwframe at ffffffffa7600088
    RIP: 00007f612a9e5575  RSP: 00007fff67b25fd8  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: 0000000000000000  RCX: 00007f612a9e5575
    RDX: 00007fff67b26100  RSI: 00007fff67b26100  RDI: 00007fff67b26190
    RBP: 0000000000000000   R8: 00007fff67b261bf   R9: 0000000000000000
    R10: 0000000000000001  R11: 0000000000000246  R12: 00007fff67b26190
    R13: 00007fff67b26100  R14: 0000000000000002  R15: 0000000000010004
    ORIG_RAX: 0000000000000006  CS: 0033  SS: 002b

crash> bt 3748012
PID: 3748012  TASK: ffff91fe27069780  CPU: 5   COMMAND: "rsync"
 #0 [ffffa932c4ea77f8] __schedule at ffffffffa749c4a6
 #1 [ffffa932c4ea7898] schedule at ffffffffa749cb48
 #2 [ffffa932c4ea78a0] rpc_wait_bit_killable at ffffffffc072230e [sunrpc]
 #3 [ffffa932c4ea78b8] __wait_on_bit at ffffffffa749cf94
 #4 [ffffa932c4ea78f0] out_of_line_wait_on_bit at ffffffffa749d071
 #5 [ffffa932c4ea7940] __rpc_execute at ffffffffc0722fd0 [sunrpc]
 #6 [ffffa932c4ea7998] rpc_run_task at ffffffffc0716e69 [sunrpc]
 #7 [ffffa932c4ea79d8] nfs4_call_sync_sequence at ffffffffc0828854 [nfsv4]
 #8 [ffffa932c4ea7a48] _nfs4_proc_readdir at ffffffffc082ab85 [nfsv4]
 #9 [ffffa932c4ea7b58] nfs4_proc_readdir at ffffffffc0834ff6 [nfsv4]
#10 [ffffa932c4ea7bd0] nfs_readdir_xdr_to_array at ffffffffc07e053c [nfs]
#11 [ffffa932c4ea7cd0] nfs_readdir_filler at ffffffffc07e07dd [nfs]
#12 [ffffa932c4ea7cf0] do_read_cache_page at ffffffffa6e16e6a
#13 [ffffa932c4ea7da0] nfs_readdir at ffffffffc07e096c [nfs]
#14 [ffffa932c4ea7e38] iterate_dir at ffffffffa6edcce6
#15 [ffffa932c4ea7e78] ksys_getdents64 at ffffffffa6eddc30
#16 [ffffa932c4ea7f30] __x64_sys_getdents64 at ffffffffa6edde46
#17 [ffffa932c4ea7f38] do_syscall_64 at ffffffffa6c0430b
#18 [ffffa932c4ea7f50] entry_SYSCALL_64_after_hwframe at ffffffffa7600088
    RIP: 00007f612a9bd757  RSP: 00007fff67b2a338  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: 000056228f6a5a90  RCX: 00007f612a9bd757
    RDX: 0000000000008000  RSI: 000056228f6a5ac0  RDI: 0000000000000009
    RBP: 000056228f6a5ac0   R8: 0000000000000000   R9: 0000000000000001
    R10: 0000000000000001  R11: 0000000000000246  R12: ffffffffffffff80
    R13: 0000000000000000  R14: 000056228f6ada93  R15: 000056228f6ada80
    ORIG_RAX: 00000000000000d9  CS: 0033  SS: 002b

下面分析inode相关的信息,栈中的数据暂时不分析汇编(TODO),因为从代码上看只会操作一个inode

与进程3632440栈一样的几个进程操作的inode如下:

crash> bt 3632440 -FF
 #5 [ffffa932c498be38] iterate_dir at ffffffffa6edccc9
    ffffa932c498be60: 00000000fffffff2 [ffff91fe2bb65b00:filp]
crash> kmem ffff91fe2bb65b00
  FREE / [ALLOCATED]
  [ffff91fe2bb65b00]
crash> struct file.f_inode ffff91fe2bb65b00
  f_inode = 0xffff91fca5985500,
struct inode 0xffff91fca5985500 -x
  i_mode = 0x41e0, # S_IFDIR为0x4000,所以是文件类型是文件夹

crash> bt 3644149 -FF
 #5 [ffffa932cf64fe38] iterate_dir at ffffffffa6edccc9
    ffffa932cf64fe60: 00000000fffffff2 [ffff91fb608c1a00:filp] 
crash> kmem ffff91fb608c1a00
  FREE / [ALLOCATED]
  [ffff91fb608c1a00]
crash> struct file.f_inode ffff91fb608c1a00
  f_inode = 0xffff91fca5985500,

crash> bt 3655486 -FF
 #5 [ffffa932c4d37e38] iterate_dir at ffffffffa6edccc9
    ffffa932c4d37e60: 00000000fffffff2 [ffff91fe2b81b300:filp] 
crash> kmem ffff91fe2b81b300
  FREE / [ALLOCATED]
  [ffff91fe2b81b300]
crash> struct file.f_inode ffff91fe2b81b300
  f_inode = 0xffff91fca5985500,

crash> bt 3667062 -FF
 #5 [ffffa932d0857e38] iterate_dir at ffffffffa6edccc9
    ffffa932d0857e60: 00000000fffffff2 [ffff91fe2bd4a900:filp] 
crash> kmem ffff91fe2bd4a900
  FREE / [ALLOCATED]
  [ffff91fe2bd4a900]
crash> struct file ffff91fe2bd4a900
  f_inode = 0xffff91fca5985500,

# 以下就不列出命令了,只写结果
# 3678549
[ffff91fcbfe03f00:filp]
  FREE / [ALLOCATED]
  [ffff91fcbfe03f00]
  f_inode = 0xffff91fca5985500,

# 3690262
[ffff91fe2ce37c00:filp]
  FREE / [ALLOCATED]
  [ffff91fe2ce37c00]
  f_inode = 0xffff91fca5985500,

# 3701961
[ffff91fddda2bb00:filp]
  FREE / [ALLOCATED]
  [ffff91fddda2bb00]
  f_inode = 0xffff91fca5985500,

# 3713720
[ffff91fe12783b00:filp]
  FREE / [ALLOCATED]
  [ffff91fe12783b00]
  f_inode = 0xffff91fca5985500,

# 3725136
[ffff91fde961b900:filp]
  FREE / [ALLOCATED]
  [ffff91fde961b900]
  f_inode = 0xffff91fca5985500,

# 3736677
[ffff91fe29e98300:filp]
  FREE / [ALLOCATED]
  [ffff91fe29e98300]
  f_inode = 0xffff91fca5985500,

# 3759132
[ffff91fe10973900:filp]
  FREE / [ALLOCATED]
  [ffff91fe10973900]
  f_inode = 0xffff91fca5985500,

# 3770908
[ffff91fe2f936800:filp]
  FREE / [ALLOCATED]
  [ffff91fe2f936800]
  f_inode = 0xffff91fca5985500,

# 3782024
[ffff91fde4eb7c00:filp]
  FREE / [ALLOCATED]
  [ffff91fde4eb7c00]
  f_inode = 0xffff91fca5985500,

# 3794329
[ffff91fe2a111c00:filp]
  FREE / [ALLOCATED]
  [ffff91fe2a111c00]
  f_inode = 0xffff91fca5985500,

# 3805549
[ffff91fe2bd4a000:filp]
  FREE / [ALLOCATED]
  [ffff91fe2bd4a000]
  f_inode = 0xffff91fca5985500,

# 3817313
[ffff91fd8704b800:filp]
  FREE / [ALLOCATED]
  [ffff91fd8704b800]
  f_inode = 0xffff91fca5985500,

# 3828467
[ffff91fe063d3500:filp]
  FREE / [ALLOCATED]
  [ffff91fe063d3500]
  f_inode = 0xffff91fca5985500,

# 3840283
[ffff91fe25417100:filp]
  FREE / [ALLOCATED]
  [ffff91fe25417100]
  f_inode = 0xffff91fca5985500,

进程3748012inode:

crash> bt 3748012 -FF
[ffff91fdfdcbf000:filp]
crash> kmem ffff91fdfdcbf000
  FREE / [ALLOCATED]
  [ffff91fdfdcbf000]
crash> struct file.f_inode ffff91fdfdcbf000
  f_inode = 0xffff91fca5985500,

进程3621271inode:

crash> bt 3621271 -FF
PID: 3621271  TASK: ffff91fd6d044680  CPU: 7   COMMAND: "rsync"
 #0 [ffffa932d099fa90] __schedule at ffffffffa749c4a6
 #1 [ffffa932d099fb30] schedule at ffffffffa749cb48
 #2 [ffffa932d099fb38] rwsem_down_read_failed at ffffffffa74a02fc
 #3 [ffffa932d099fbe0] call_rwsem_down_read_failed at ffffffffa74939d4
     ffffa932d099fbe8: 0000000000000000 61c8864680b583eb 
    ffffa932d099fbf8: [ffff91fe29cee027:names_cache] 0000000000000028 
    ffffa932d099fc08: 0000000000000000 00072a8500000000 
    ffffa932d099fc18: [ffff91fcb02b7440:dentry] [ffff91fca59855a0:nfs_inode_cache] 
    ffffa932d099fc28: down_read+19     
 #4 [ffffa932d099fc28] down_read at ffffffffa749f703
 #5 [ffffa932d099fc30] lookup_slow at ffffffffa6ed42f7
 #6 [ffffa932d099fc58] walk_component at ffffffffa6ed47d4
 #7 [ffffa932d099fcb8] path_lookupat at ffffffffa6ed4f5e
 #8 [ffffa932d099fd18] filename_lookup at ffffffffa6ed8866
 #9 [ffffa932d099fe40] vfs_statx at ffffffffa6ecc593
#10 [ffffa932d099fe98] __do_sys_newlstat at ffffffffa6eccbd9
#11 [ffffa932d099ff38] do_syscall_64 at ffffffffa6c0430b
#12 [ffffa932d099ff50] entry_SYSCALL_64_after_hwframe at ffffffffa7600088

crash> kmem ffff91fcb02b7440 # [ffff91fcb02b7440:dentry]
  FREE / [ALLOCATED]
  [ffff91fcb02b7440]
crash> struct dentry ffff91fcb02b7440
  d_inode = 0xffff91fca5985500,

crash> kmem ffff91fca59855a0 # [ffff91fca59855a0:nfs_inode_cache]
  FREE / [ALLOCATED]
  [ffff91fca5985320]
crash> struct nfs_inode ffff91fca5985320 -ox
  [ffff91fca5985500] struct inode vfs_inode;

可以看出,20个进程操作的inode是一样的,且这个inode的文件类型是目录。

2.1 qemu构造vmcore解析

ctrl + a c打开QEMU控制台,使用以下命令导出vmcore:

(qemu) dump-guest-memory /your_path/vmcore

列出所有ls进程:

crash> ps ls
   PID    PPID  CPU       TASK        ST  %MEM     VSZ    RSS  COMM
    571    570   5  ffff88812f9b2680  UN   3.3  189588 175328  ls
    574    573  12  ffff888117444d00  UN   3.4  190248 180860  ls
    578    577  13  ffff888117440000  UN   2.3  182328 123108  ls
    581    580  10  ffff88810217cd00  UN   2.9  186420 153032  ls
    584    583   2  ffff88812e5da680  UN   2.1  180876 112452  ls
    587    586   4  ffff88812e5dcd00  UN   3.2  188796 170200  ls
    590    589   3  ffff88812f988000  UN   2.7  184836 141652  ls
    593    592   4  ffff88810c8dcd00  UN   2.8  185628 147248  ls
    596    595  14  ffff88810c8d8000  UN   2.9  186288 152092  ls
    599    598   2  ffff8881176b8000  UN   3.2  188796 169756  ls
    602    601   4  ffff88812fb32680  UN   3.8  352624 197372  ls
    605    604   2  ffff88812fb34d00  UN   3.2  188796 169760  ls
    608    607   0  ffff88810769a680  UN   2.5  183384 131228  ls
    611    610   4  ffff88810761cd00  UN   2.8  185892 149336  ls
    614    613   1  ffff8881076e4d00  UN   2.9  186024 150032  ls
    617    616  14  ffff888134ab4d00  UN   3.1  188004 164404  ls
    620    619   6  ffff888102244d00  UN   2.3  182196 121744  ls
    623    622  13  ffff88812fa88000  UN   2.8  185628 147200  ls
    626    625  13  ffff888107618000  UN   3.0  187344 159756  ls
    629    628   3  ffff888117762680  UN   2.6  184440 137968  ls

有19个进程的栈如下所示:

PID: 571    TASK: ffff88812f9b2680  CPU: 5   COMMAND: "ls"
 #0 [ffffc900011dbce0] __schedule at ffffffff82186cdc
 #1 [ffffc900011dbd78] schedule at ffffffff82187222
 #2 [ffffc900011dbd88] rwsem_down_write_failed_killable at ffffffff8218a202
 #3 [ffffc900011dbe18] call_rwsem_down_write_failed_killable at ffffffff8217bab3
 #4 [ffffc900011dbe58] down_write_killable at ffffffff821893c9
 #5 [ffffc900011dbe68] iterate_dir at ffffffff813f52fc
 #6 [ffffc900011dbea8] ksys_getdents64 at ffffffff813f629c
 #7 [ffffc900011dbf18] __x64_sys_getdents64 at ffffffff813f636d
 #8 [ffffc900011dbf28] do_syscall_64 at ffffffff81004ad2
 #9 [ffffc900011dbf50] entry_SYSCALL_64_after_hwframe at ffffffff82200088

1个进程的栈如下所示(过一会儿,进入nfs请求的进程变成其他进程):

 PID: 574    TASK: ffff888117444d00  CPU: 12  COMMAND: "ls"
 #0 [ffffc900011eb860] __schedule at ffffffff82186cdc
 #1 [ffffc900011eb8f8] schedule at ffffffff82187222
 #2 [ffffc900011eb908] rpc_wait_bit_killable at ffffffff820c0d55
 #3 [ffffc900011eb928] __wait_on_bit at ffffffff82187652
 #4 [ffffc900011eb960] out_of_line_wait_on_bit at ffffffff82187722
 #5 [ffffc900011eb9a8] __rpc_execute at ffffffff820c22fa
 #6 [ffffc900011eb9e8] rpc_execute at ffffffff820c297e
 #7 [ffffc900011eba10] rpc_run_task at ffffffff820adc89
 #8 [ffffc900011eba50] nfs4_call_sync_sequence at ffffffff8156b97f
 #9 [ffffc900011ebad8] _nfs4_proc_readdir at ffffffff81570173
#10 [ffffc900011ebbd8] nfs4_proc_readdir at ffffffff81579dd6
#11 [ffffc900011ebc50] nfs_readdir_xdr_to_array at ffffffff8153f3b0
#12 [ffffc900011ebd50] nfs_readdir_filler at ffffffff8153f5cb
#13 [ffffc900011ebd70] do_read_cache_page at ffffffff812ec110
#14 [ffffc900011ebdc0] nfs_readdir at ffffffff8153f7eb
#15 [ffffc900011ebe68] iterate_dir at ffffffff813f522e
#16 [ffffc900011ebea8] ksys_getdents64 at ffffffff813f629c
#17 [ffffc900011ebf18] __x64_sys_getdents64 at ffffffff813f636d
#18 [ffffc900011ebf28] do_syscall_64 at ffffffff81004ad2
#19 [ffffc900011ebf50] entry_SYSCALL_64_after_hwframe at ffffffff82200088

3 构造

有一个线索,出问题的目录下有80~90万个文件,每个文件几k字节,构造这个场景。

在nfs server上生成很多文件:

i=0
while true
do
    echo 1234567890 > file${i} # 不要用touch创建文件,因为更慢
    ((i++))
    echo ${i}
    if [ ${i} -eq 1000000 ]
    then
        break
    fi
done

在client上挂载:

mount -t nfs -o rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,local_lock=none 192.168.122.209:/s_test /mnt

100万个文件,用ls遍历,分多次readdir请求,每次返回167个文件目录项。

没有特别说明时,client和server都是使用的都是同一版本的内核。

3.1 4.19内核测试结果

3.1.1 100w个文件20个进程

echo 3 > /proc/sys/vm/drop_caches
time ls -lh /mnt/dir | wc
1000002 9000011 50888949
real    0m37.319s

echo 3 > /proc/sys/vm/drop_caches
time ls -lh /mnt/dir | wc & # 连续执行20

发生了oom。

3.1.2 100w个文件10个进程

echo 3 > /proc/sys/vm/drop_caches
time ls -lh /mnt/dir | wc
1000002 9000011 50888949
real    0m42.712s

echo 3 > /proc/sys/vm/drop_caches
time ls -lh /mnt/dir | wc & # 连续执行10
real    5m30.489s

3.2 最新内核测试结果

3.2.1 100w个文件10个进程

echo 3 > /proc/sys/vm/drop_caches
time ls -lh /mnt/dir | wc
1000002 9000011 50888949
real    1m21.104s

echo 3 > /proc/sys/vm/drop_caches
time ls -lh /mnt/dir | wc & # 连续执行10
real    1m26.528s

3.3 最新内核改成写锁测试结果

补丁为0001-nfs-add-old-iterate-directory-operation-for-debug.patch

3.3.1 100w个文件10个进程

echo 3 > /proc/sys/vm/drop_caches
time ls -lh /mnt/dir-100w | wc
1000002 9000011 50888949
real    1m34.346s

echo 3 > /proc/sys/vm/drop_caches
time ls -lh /mnt/dir-100w | wc & # 连续执行10
real    5m26.562s

3.3.2 100w个文件20个进程

没多久就oom了。

3.3.3 100w个文件15个进程

echo 3 > /proc/sys/vm/drop_caches
time ls -lh /mnt/dir-100w | wc
1000002 9000011 50888949
real    1m34.346s

echo 3 > /proc/sys/vm/drop_caches
time ls -lh /mnt/dir-100w | wc & # 连续执行15
real    7m50.229s

3.3.4 10w个文件15个进程

echo 3 > /proc/sys/vm/drop_caches
time ls -lh /mnt/dir-10w | wc
100002  900011 4988949
real    0m7.268s

echo 3 > /proc/sys/vm/drop_caches
time ls -lh /mnt/dir-10w | wc & # 连续执行15
real    0m21.190s

3.3.5 1w个文件15个进程

time ls -lh /mnt/dir-1w | wc
  10002   90011  488948
real    0m0.830s
real    0m0.838s
real    0m0.848s

echo 3 > /proc/sys/vm/drop_caches
time ls -lh /mnt/dir-1w | wc & # 连续执行15
real    0m0.981s
real    0m0.984s
real    0m1.038s
real    0m0.974s

3.3.6 数据分析

# 100w        10w       1w
# 1个进程
  94.346s     7.268s    0.848s
# 15个进程
  470.229s    21.190s   1s

100w、10w、1w个文件时,同样是15个进程的情况下,所用的时间比是470:21:1,文件每增加10倍,所用时间大概增加20+倍。

4 其他命令(请跳过,没什么卵用)

对结果没什么卵用,请跳过,但毕竟曾经定位过,还是记录一下。

由于现场环境数据量非常大,不能长时间抓包,只能抓取一段时间的数据包然后覆盖文件,判断问题复现后就停止:

while true
do
    sudo tcpdump --interface=<网络接口> --buffer-size=20480 -w out.cap
    pid=$(sudo pgrep tcpdump)
    sleep 60
    sudo kill -SIGINT ${pid}
    sudo ps aux | grep rsync | grep D
    if [ $? -eq 0 ]
        break
    fi
    sudo rm out.cap
done
+-------------+
|     test    |
|  nfsclient  |
|    (E)      |
+-------------+
      ^
      | 
     nfs
      |
      v        
+-------------+           +-------------+
|   huawei    |           |   huawei    |
| nfsserver1  |           | nfsserver2  |
|    (A)      |           |    (D)      |
+-------------+           +-------------+
      ^                          ^
      |                          |
     nfs                        nfs
      |                          |
      v                          v
+-------------+           +-------------+
|             |           |             |
| rsyncserver |<--rsync-->| rsyncclient |
|    (B)      |           |    (C)      |
+-------------+           +-------------+

抓包步骤(已证明不可行,长时间抓包数据太多文件太大):

查看进程状态步骤:

5 strace跟踪系统调用分析

+-------------+
|     test    |
|  nfsclient  |
|    (E)      |
+-------------+
      ^
      | 
     nfs
      |
      v        
+-------------+           +-------------+
|   huawei    |           |   huawei    |
| nfsserver1  |           | nfsserver2  |
|    (A)      |           |    (D)      |
+-------------+           +-------------+
      ^                          ^
      |                          |
     nfs                        nfs
      |                          |
      v                          v
+-------------+           +-------------+
|             |           |             |
| rsyncserver |<--rsync-->| rsyncclient |
|    (B)      |           |    (C)      |
+-------------+           +-------------+

使用strace命令记录readdir请求的时间点:

nU2BlPA3xoSfYaPLVs1DqiZU4ANOtV # 文件名
1712802314.374988 # 机器B的时间点
1712802323.791969 # 机器E的时间点

Y348qeD3dxaeJOsyiTNON3qCtEE8yn
1712802401.142976-1712802314.374988=86.767988
1712802325.169445-1712802323.791969=1.377476

xxwdIL00iCzXviXcoCaNCvrvIyAkeP
1712802509.642010-1712802314.374988=195.267022
1712802326.528755-1712802323.791969=2.736786

L8mSFdfuAtzgWAp646sVy73uTMgbaH
1712802607.192381-1712802314.374988=292.817393
1712802328.083538-1712802323.791969=4.291569

9wHPHGNObYKMRERIhHuRaKiZW9Jer4
1712802707.587591-1712802314.374988=393.212603
1712802329.226065-1712802323.791969=5.434096

Z9Jj6YvkzMCt0lymymEtZhmIAFEDMH
1712802803.228566-1712802314.374988=488.853578
1712802330.254807-1712802323.791969=6.462838

VLww0v31UJplVOvh1bYlQ2Wwz4OfiI
1712802824.954462-1712802314.374988=510.579474
1712802330.369828-1712802323.791969=6.577859

抽取几个时间点的数据可以看出,ls命令执行的速度差距将近100倍。

6 代码分析

由于nfs没有实现iterate_shared方法,所以执行了写锁down_write_killable,与其他进程的写锁、读锁都是互斥的,当并发量大时,iterate_dir函数中每次获取写锁的等待时间就非常长。而用ps命令查看进程状态时,大部分时间都处于down_write_killable__rpc_execute执行后进入休眠的状态,所以查看到的都是处于D状态。

getdents64
  ksys_getdents64
    iterate_dir
      // nfs没有实现iterate_shared,实现的是iterate,条件不满足
      if (file->f_op->iterate_shared)
      shared == false
      // 这里使用写锁,无法和其他进程并发
      // 获取锁的时间长也是ps命令查看进程总是处于D状态的原因之一
      down_write_killable(&inode->i_rwsem)
        call_rwsem_down_write_failed_killable
          rwsem_down_write_failed_killable
            schedule
              __schedule
      nfs_readdir
        do_read_cache_page
          nfs_readdir_filler
            nfs_readdir_xdr_to_array
              nfs4_proc_readdir
                _nfs4_proc_readdir
                  nfs4_call_sync_sequence
                    rpc_run_task
                      __rpc_execute
                        out_of_line_wait_on_bit
                          __wait_on_bit
                            rpc_wait_bit_killable
                              schedule
                                // 由于循环的过程中,等待nfs server的回复时间占大部分
                                // 所以用ps命令看到这个进程都是处于D状态
                                __schedule

修复补丁: 93a6ab7b691f NFS: Switch readdir to using iterate_shared(),移除所有iterate_shared接口的补丁3e3271549670 vfs: get rid of old '->iterate' directory operation

7 问题分析结果

+-------------+           +-------------+
|   huawei    |           |   huawei    |
| nfsserver1  |           | nfsserver2  |
|    (A)      |           |    (D)      |
+-------------+           +-------------+
      ^                          ^
      |                          |
     nfs                        nfs
      |                          |
      v                          v
+-------------+           +-------------+
|             |           |             |
| rsyncserver |<--rsync-->| rsyncclient |
|    (B)      |           |    (C)      |
+-------------+           +-------------+

rsyncserver上有很多进程操作有百万数量文件的目录,而由于遍历目录时无法多进程并发,所以导致每个进程执行的时间非常长,出现很多用ps命令看是D状态的进程。

8 解决方案

建议增大rsync同步的时间间隔。