grokking hard

code smarter, not harder

MongoDB failed to lock file on OpenShift using NFS file system

Posted at — 2021-Sep-27

Summary

This page documented the investigation of an issue: MongoDB Pod failed to lock mongod.lock on OpenShift/OKDv4.

The problem was because NFSv3 was used to mount between OKDv4 Worker Nodes to the NFS server. MongoDB used internally flock/fcntl to lock the mongod.lock file descriptor. In a network file system like NFSv3, locking support was required from both server and client, otherwise, a “No locks available” would be the outcome. NFSv4, on the other hand, came with built-in support for locking as specified in the protocol. Changing to mount using only NFSv4 solved the problem, but there were more than just that.

Timeline

On 2021-09-22, a colleague reported that deploying MongoDB into our OKDv4 failed because of the error:

...exception in initAndListen: DBPathInUse: Unable to lock the lock file /var/lib/mongodb/data/mongod.lock (No locks available)...

At first site, we thought that the process mongod did not have permission to access to the file. This was quickly dismissed as the process created the file itself.

In order to verify this, I had to read the source code of MongoDB (I had investigated a similar issue in the past) to see what happened at the source.

Searching in the code base of MongoDB, tag r4.0.5, I could pin point where exactly the error was raised, the module storage_engine_lock_file_posix.cpp:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
165 #if !defined(__sun)
166     int ret = ::flock(lockFile, LOCK_EX | LOCK_NB);
167 #else
168     struct flock fileLockInfo = {0};
169     fileLockInfo.l_type = F_WRLCK;
170     fileLockInfo.l_whence = SEEK_SET;
171     int ret = ::fcntl(lockFile, F_SETLK, &fileLockInfo);
172 #endif  // !defined(__sun)
173     if (ret != 0) {
174         int errorcode = errno;
175         ::close(lockFile);
176         return Status(ErrorCodes::DBPathInUse,
177                       str::stream() << "Unable to lock the lock file: " << _filespec << " ("
178                                     << errnoWithDescription(errorcode)
179                                     << ")."
180                                     << " Another mongod instance is already running on the "
181                                     << _dbpath
182                                     << " directory")

At line 171, mongod attempted to lock the mongod.lock file using fcntl. This was a system call to the kernel API fcntl(2) and it failed. The lines 176-182 returned the error. The errnowithdescription(errorcode) at line 178 was used to translate the global variable errno (assigned by fcntl call). Depending the implementation of fcntl, the text might vary. In our case, it was No locks available.

Googled No locks avaible with fcntl showed that the issued related to the file system, especially with NFSv3. This provided a great hint to me as we were using NFS for our PersistentVolume. However, the problem did not exist when we deployed MongoDB on our OKDv3 which also used NFS. The only difference between OKDv3 and OKDv4 was that OKDv4 used Fedora CoreOS which I didn’t know how the NFS client was set up.

NOTE: Reading further, I noticed that in order for NFSv3 to work with file locking, support for the protocol MUST be done from both server and client.

I checked first how it worked on OKDv3 by log into the node:

1
2
node01$> nfsstat -m | grep -oE 'vers=[^,]*' | uniq
vers=4.1

OK, so for OKDv3, only NFSv4 was used.

Then, I had to log into the OKDv4 Worker Node to find out how the NFS mounts were done.

1
2
[core@ip-172-31-141-195 ~]$ nfsstat -m | grep -oE 'vers=[^,]*' | uniq
vers=3

OK, so only NFSv3 was used on OKDv4. WHY?

Then I realized the second big difference between OKDv3 and OKDv4 was that, the NFS PersistentVolume was provisioned automatically by using an Operator called nfs-subdir-external-provisioner. This Operator will register a storageClass namely managed-nfs-storage (configurable). Any PersistentVolumeClaim uses this managed-nfs-storage will automatically be bound with a newly created PersistentVolume.

Could that be this Operator use by default only NFSv3? Can we force OKDv4 to use a particular NFS version on PersistentVolume? It turned out that it was possible by using mountOptions like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
apiVersion: v1
kind: PersistentVolume
metadata:
  name: pv0003
spec:
  capacity:
    storage: 5Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Recycle
  storageClassName: slow
  mountOptions:
    - hard
    - nfsvers=4.1
  nfs:
    path: /tmp
    server: 172.17.0.2

But since all PersistentVolumes were created by the Operator, I needed to find out how to inject this mountOptions field. Luckily, they allowed to specified it.

1
2
3
4
5
6
7
8
9
helm install -n default `# It has to be the namespace default` 
  nfs4-subdir-external-provisioner 
  nfs-subdir-external-provisioner/nfs-subdir-external-provisioner 
    --set nfs.server=nfs.example.org 
    --set nfs.path=/export/okd4_volume 
    --set 'nfs.mountOptions[0]'='nfsvers=4.1' 
    --set storageClass.name=managed-nfs4-storage 
    --set storageClass.pathPattern='${.PVC.namespace}-${.PVC.name}' 
    --set storageClass.reclaimPolicy=Retain

NOTE: There was a little investigation of how to specify values for a Helm value which was an Array instead of an Dict.

Once this was deployed, I discovered that the Operator failed to bootstrap with a mount fail.

1
2
$> sudo mount -t nfs -o hard,vers=4.1 nfs.example.org:/export/okd4_volume /tmp/test_mount_nfs/
mount.nfs: mounting nfs.example.org:/export/okd4_volume failed, reason given by server: No such file or directory

That was strange because I knew for sure that the directory existed and it was exported. Then I had to check the NFS server to see how the directories were exported and it was done something like below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# file:
# /etc/exports

/export/shared_volume/ 
       *(fsid=0,subtree_check,async)

/export/shared_volume/public 
       *(fsid=1,rw,nohide,insecure,no_subtree_check,async)

/export/shared_volume/something_else 
       *(fsid=2,rw,nohide,insecure,no_subtree_check,async)

# lots of other directories omitted

/export/okd4_volume 172.31.0.0/16(fsid=71,rw,nohide,insecure,no_subtree_check,async)

Looked normal to me though! However, digging in Google hinted me a valuable information: NFSv4 treated fsid seriously as the answer of the StackOverflow question “Cannot mount nfs4 share: no such file or directory”.

So in the /etc/exports above, the path /export/shared_volume was defined as fsid=0 which NFSv4 would treat it as root directory. So when the path /export/okd4_volume was mounted using NFSv4, the server rejected that path because the parent of the directory was not exported. Another great hint was that, the mounting path MUST not contain the fsid=0, in this case the /export.

So I made the following change:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# fixed
# /etc/exports

/export
       *(fsid=0,subtree_check,async)

/export/shared_volume/ 
       *(fsid=10,subtree_check,async)

/export/shared_volume/public 
       *(fsid=1,rw,nohide,insecure,no_subtree_check,async)

/export/shared_volume/something_else 
       *(fsid=2,rw,nohide,insecure,no_subtree_check,async)

# lots of other directories omitted

/export/okd4_volume 172.31.0.0/16(fsid=71,rw,nohide,insecure,no_subtree_check,async)

There were one change: adding the /export and made it fsid=0 (adjusted the existing ones as well). Then the following command will work:

NOTE: Re-apply the configuration using sudo exportfs -a.

1
2
3
4
# Note that the `/export` was removed
$> sudo mount -t nfs -o hard,vers=4.1 nfs.example.org:/okd4_volume /tmp/test_mount_nfs
d4/
# success

But hey, how did NFSv3 was chosen then? It turned out that, the NFS client tried with several combination of options until it succeed.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Note that the `/export` was removed
$> sudo mount -t nfs nfs.example.org:/export/okd4_volume /tmp/test_mount_nfs
mount.nfs: timeout set for Mon Sep 27 13:08:04 2021
mount.nfs: trying text-based options 'vers=4.2,addr=172.31.137.216,clientaddr=172.31.141.195'
mount.nfs: mount(2): No such file or directory
mount.nfs: trying text-based options 'addr=172.31.137.216'
mount.nfs: prog 100003, trying vers=3, prot=6
mount.nfs: trying 172.31.137.216 prog 100003 vers 3 prot TCP port 2049
mount.nfs: prog 100005, trying vers=3, prot=17
mount.nfs: portmap query retrying: RPC: Timed out
mount.nfs: prog 100005, trying vers=3, prot=6
mount.nfs: trying 172.31.137.216 prog 100005 vers 3 prot TCP port 4002

Without any options, the NFS client attempts to try out vers=4.2, then fallback to vers=3,prot=6 with different port until it could find a working option.

With the final finding of the path, I just needed to adjust the Helm Chart command to deploy the Operator nfs-subdir-external-provisioner to use /okd4_volume instead of /export/okd4_volume.

Deploying a new MongoDB Pod showed that mongod could lock the file again successfully. Phew!