LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Software (https://www.linuxquestions.org/questions/linux-software-2/)
-   -   Mount bind point to incorrect NVMe device after power off/on device. (https://www.linuxquestions.org/questions/linux-software-2/mount-bind-point-to-incorrect-nvme-device-after-power-off-on-device-4175727319/)

fanfanfan 07-25-2023 01:26 AM

Mount bind point to incorrect NVMe device after power off/on device.
 
I am developing an all-flash storage application. I found that mount bind has strange behavior on NVMe device power off/on.

Mount partition /dev/nvme10n1p1 to /mnt/10n1p1

Code:

# mount --bind /dev/nvme10n1p1 /mnt/10n1p1
# lsblk /mnt/10n1p1
NAME      MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
nvme10n1p1 259:11  0  3.6T  0 part

# stat /dev/nvme10n1p1
  File: /dev/nvme10n1p1
  Size: 0              Blocks: 0          IO Block: 4096  block special file
Device: 5h/5d  Inode: 23620      Links: 1    Device type: 103,b

# stat /mnt/10n1
  File: /mnt/10n1
  Size: 0              Blocks: 0          IO Block: 4096  block special file
Device: 5h/5d  Inode: 23620      Links: 1    Device type: 103,b


Power off/on nvme device in short time to simulate hot-plug or power surge.

Code:

# ls -lat /sys/block | grep "nvme10n1"
.../0000:be:00.0/nvme/nvme2/nvme10n1
# lspci -vmms 0000:be:00.0
...PhySlot:        168
# date && echo 0|sudo tee /sys/bus/pci/slots/${PHYSLOT}/power
# sleep 5
# date && echo 1|sudo tee /sys/bus/pci/slots/${PHYSLOT}/power

After power off/on mount bind point to new drive which is just power on.
Code:

# lsblk /mnt/10n1
NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
nvme30n2    259:11  0  3.6T  0 disk
└─nvme30n2p1 259:51  0  3.6T  0 part

# stat /dev/nvme30n2
  File: /dev/nvme30n2
  Size: 0              Blocks: 0          IO Block: 4096  block special file
Device: 5h/5d  Inode: 24836      Links: 1    Device type: 103,b

After advance test, I found that mount bind even could point to another drive which is power on in short time after original drive is power off.

Code:

# mount --bind  /dev/nvme0n1p1 /mnt/0n1
# lsblk 0n1
NAME      MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
nvme0n1p1 259:44  0  3.6T  0 part

# nvme id-ctrl /dev/nvme0n1 | grep sn
sn:      : PHLJ043200234P0DGN

# nvme id-ctrl /dev/nvme1n1 | grep sn
sn:      : PHLJ043105AU4P0DGN

# PHYSLOT_0=195
# PHYSLOT_1=194
# date && echo 0|sudo tee /sys/bus/pci/slots/${PHYSLOT_1}/power
# sleep 5
# date && echo 0|sudo tee /sys/bus/pci/slots/${PHYSLOT_0}/power
# sleep 5
# date && echo 1|sudo tee /sys/bus/pci/slots/${PHYSLOT_1}/power
# sleep 5
# date && echo 1|sudo tee /sys/bus/pci/slots/${PHYSLOT_0}/power

# lsblk /mnt/0n1
NAME      MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
nvme31n2p1 259:44  0  3.6T  0 part

# nvme id-ctrl /dev/nvme31n2 | grep sn
sn        : PHLJ043105AU4P0DGN

# nvme id-ctrl /dev/nvme32n2 | grep sn
sn        : PHLJ043200234P0DGN

I noticed that after power off/on the mount point's inode number differ from new drive device's inode number, however the mount point's inode and new drive device inode share same minor number. I think this is the reason that mount point can access new drive device which cause data corruption. My guess is mount point hold ref to inode of power offed drive which cause inode not destructed, then new drive is power on and takes the reclaimed minor number which happens to be original drive's minor number.

I'm not sure if this behavior is expected, or if it's a limitation or a bug of mount bind.

fanfanfan 04-24-2024 04:08 AM

Turns out this is a linux kernel bdev lifecycle bug, and is fixed in version 5.15.
Related patch: https://lore.kernel.org/all/20210816...-2-hch@lst.de/

syg00 04-24-2024 04:58 AM

Nice find - but that's a pretty old kernel ... do you expect your user(s) to be that far behind the current release levels ?.


All times are GMT -5. The time now is 12:30 PM.