4.19 nfs lazy umount 后无法挂载的问题

1 环境信息

uname 
Linux localhost.localdomain 4.19.90-24.4.v2101.ky10.x86_64 #1 SMP Mon May 24 12:14:55 CST 2021 x86_64 x86_64 x86_64 GNU/Linux

mount | grep nfs
200.22.252.66:/data0/media on /data0/media type nfs4 (rw,relatime,sync,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=200.22.252.67,local_lock=none,addr=200.22.252.66)

2 问题描述

client端df -h卡住,读写不可用。umount -l <挂载点>后,无法重新挂载,重启client的操作系统后,恢复。

client端报错:

Nov 14 13:21:32 localhost kernel: [2762097.294397] nfs: server 200.22.252.66 not responding, still trying

server端报错(暂不确定是否相关):

Nov 14 13:02:17 localhost kernel: [2761217.103877] nfsd4_validate_stateid: 26 callbacks suppressed
...
Nov 14 13:02:17 localhost kernel: [2761217.104230] NFSD: client 200.22.252.69 testing state ID with incorrect client ID

3 代码分析

社区stable仓库的linux.4.19.y分支执行命令git log -L:nfsd4_validate_stateid:fs/nfsd/nfs4state.c可以看到nfsd4_validate_stateid函数还经过以下修改:

在这两个补丁未合入时,test stateid操作时,server端检测到stateid不匹配,server返回给client端错误NFS4ERR_BAD_STATEID,client端不会调用 free stateid 操作,函数nfs41_test_and_free_expired_stateid返回错误值-NFS4ERR_BAD_STATEID,但client端的后续处理流程和执行free stateid操作返回-NFS4ERR_EXPIRED错误的处理流程没什么区别,为什么会导致client端不断发起test stateid,还没搞明白。

server端代码:

nfsd4_test_stateid
  nfsd4_validate_stateid
    if (!same_clid(&stateid->si_opaque.so_clid, &cl->cl_clientid))
    pr_warn_ratelimited("NFSD: client %s testing state ID with incorrect client ID\n", addr_str);
      printk_ratelimited
        __ratelimit
          ___ratelimit
            printk_deferred(KERN_WARNING "%s: %d callbacks suppressed\n"
    return nfserr_bad_stateid

client端代码:

nfs41_test_and_free_expired_stateid
  nfs41_test_stateid // return -NFS4ERR_BAD_STATEID
    _nfs41_test_stateid
      .rpc_proc = &nfs4_procedures[NFSPROC4_CLNT_TEST_STATEID]
      status = nfs4_call_sync_sequence != NFS_OK
      return -NFS4ERR_BAD_STATEID // -res.status, 10025
  // 只有 -NFS4ERR_EXPIRED, -NFS4ERR_ADMIN_REVOKED, -NFS4ERR_DELEG_REVOKED 三种错误,才会调用 free stateid
  // 在问题场景下,永远不会调用 free stateid
  nfs41_free_stateid
    .rpc_proc = &nfs4_procedures[NFSPROC4_CLNT_FREE_STATEID]
  return -NFS4ERR_EXPIRED

client端的open stateid相关代码流程:

nfs4_state_manager
  nfs4_do_reclaim
    nfs4_reclaim_open_state
      .recover_open   = nfs41_open_expired,
        nfs41_open_expired
          nfs41_check_open_stateid
            nfs41_test_and_free_expired_stateid

update_open_stateid
  nfs4_test_and_free_stateid
    .test_and_free_expired = nfs41_test_and_free_expired_stateid, // nfs_v4_1_minor_ops
      nfs41_test_and_free_expired_stateid