I am developing an all-flash storage application. I found that mount bind has strange behavior on NVMe device power off/on.
Mount partition /dev/nvme10n1p1 to /mnt/10n1p1
Code:
# mount --bind /dev/nvme10n1p1 /mnt/10n1p1
# lsblk /mnt/10n1p1
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
nvme10n1p1 259:11 0 3.6T 0 part
# stat /dev/nvme10n1p1
File: /dev/nvme10n1p1
Size: 0 Blocks: 0 IO Block: 4096 block special file
Device: 5h/5d Inode: 23620 Links: 1 Device type: 103,b
# stat /mnt/10n1
File: /mnt/10n1
Size: 0 Blocks: 0 IO Block: 4096 block special file
Device: 5h/5d Inode: 23620 Links: 1 Device type: 103,b
Power off/on nvme device in short time to simulate hot-plug or power surge.
Code:
# ls -lat /sys/block | grep "nvme10n1"
.../0000:be:00.0/nvme/nvme2/nvme10n1
# lspci -vmms 0000:be:00.0
...PhySlot: 168
# date && echo 0|sudo tee /sys/bus/pci/slots/${PHYSLOT}/power
# sleep 5
# date && echo 1|sudo tee /sys/bus/pci/slots/${PHYSLOT}/power
After power off/on mount bind point to new drive which is just power on.
Code:
# lsblk /mnt/10n1
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
nvme30n2 259:11 0 3.6T 0 disk
└─nvme30n2p1 259:51 0 3.6T 0 part
# stat /dev/nvme30n2
File: /dev/nvme30n2
Size: 0 Blocks: 0 IO Block: 4096 block special file
Device: 5h/5d Inode: 24836 Links: 1 Device type: 103,b
After advance test, I found that mount bind even could point to another drive which is power on in short time after original drive is power off.
Code:
# mount --bind /dev/nvme0n1p1 /mnt/0n1
# lsblk 0n1
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
nvme0n1p1 259:44 0 3.6T 0 part
# nvme id-ctrl /dev/nvme0n1 | grep sn
sn: : PHLJ043200234P0DGN
# nvme id-ctrl /dev/nvme1n1 | grep sn
sn: : PHLJ043105AU4P0DGN
# PHYSLOT_0=195
# PHYSLOT_1=194
# date && echo 0|sudo tee /sys/bus/pci/slots/${PHYSLOT_1}/power
# sleep 5
# date && echo 0|sudo tee /sys/bus/pci/slots/${PHYSLOT_0}/power
# sleep 5
# date && echo 1|sudo tee /sys/bus/pci/slots/${PHYSLOT_1}/power
# sleep 5
# date && echo 1|sudo tee /sys/bus/pci/slots/${PHYSLOT_0}/power
# lsblk /mnt/0n1
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
nvme31n2p1 259:44 0 3.6T 0 part
# nvme id-ctrl /dev/nvme31n2 | grep sn
sn : PHLJ043105AU4P0DGN
# nvme id-ctrl /dev/nvme32n2 | grep sn
sn : PHLJ043200234P0DGN
I noticed that after power off/on the mount point's inode number differ from new drive device's inode number, however the mount point's inode and new drive device inode share same minor number. I think this is the reason that mount point can access new drive device which cause data corruption. My guess is mount point hold ref to inode of power offed drive which cause inode not destructed, then new drive is power on and takes the reclaimed minor number which happens to be original drive's minor number.
I'm not sure if this behavior is expected, or if it's a limitation or a bug of mount bind.