Linux - ServerThis forum is for the discussion of Linux Software used in a server related context.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I have a multi-function server, an old HP Z600 engineering workstation. 12 years old but still pretty fast with dual Xeons, 12 cores 24 threads, and 48GB ECC memory. It has 4 large (mechanical) hard drives plus an NVMe drive where Linux Mint 20.2 (desktop version) resides & runs. Mostly it is a media server(1) with 1 large drive for pictures and graphics, 1 for music, 1 for videos, and 1 for data files, applications, source code, and other misc. files. Linux itself is rock solid on this machine except for one problem… The videos drive keeps getting corrupted! This has been an ongoing problem for at least a year. I have pretty well eliminated the possibility of a hardware issue, as this is 4th drive overall, and the 3rd brand new drive including 2 different WD models and a Seagate. I’ve also moved it to different SATA ports and I’ve also moved it to different bays. It has plenty of airflow and I’ve checked the troublesome drive as well as the others and other system components with both an infrared temperature gun and a candy thermometer. It was originally formatted btrfs but is now formatted ext4. The other drives all work perfectly with no corruption. A drive that was previous the video drive (which kept getting corrupted) is now the misc. files drive (which never gets corrupted).
The only 2 things I can think of that make this drive any different from the others is that 1) it has comparatively fewer files which are comparatively larger(2), and 2) Plex media server is running on that drive. Plex seemed like the obvious culprit so I stopped the service on several occasions for up to 10 days and even uninstalled it but that made no difference. Other than update scans by Plex, the video drive sees very little use. The music drive is in near constant use without problems.
The corruption problem is very inconsistent. It may work for several days, it may work for only a few hours. About 2/3 of the time the drive is still accessible but all the folders except $RECYCLE.BIN have disappeared. Sometimes the drive unmounts itself. Usually (but not always) re-mounting it will cause the other folders to reappear and likewise manually unmounting and re-mounting usually makes the other folders to reappear. Sometimes though that does not work but restarting the server does. A couple times I had to use fsck and a couple times even that didn’t work, I had to restore a backup. Backups are a pair of external USB 12TB NTFS drives that I alternate between, and which have everything from the server’s 5 drives including the videos. The backup drives have never gotten corrupted but they are only plugged in long enough to back up (or restore) stuff (which takes 2 days over USB 2.0).
What are my next steps to troubleshoot (and fix!) this problem?
{Sorry for the long-winded post but I’m trying to include all details which might even remotely be relevant}
(1) Also running Mosquito and Node-Red services, but they keep their data on the NVMe drive, and I haven’t (yet) configured them to do much of anything.
(2) ~7K video files averaging 440KB, using 3.1TB out of 4TB vs 115K music files averaging 13KB, using 1.5TB out of 3TB. I buy CDs, DVDs, and BluRays, rip them, and then put the originals in climate controlled storage. A huge collection amassed over 30+ years.
You neglected to mention what partitioning looks like and what file system format and configuration is in use.
I might note that NTFS does not support Linux time stamps, permissions/ACLS, or other meta-detail well, at least directly.
One of the data drives, the misc. one has 2 partitions. A tiny ext2 boot partition(1) + the main ext4 partition for all the misc. files.
The other 3 drives including the video drive only have 1 partition each, formatted ext4 (I did mention that). They were initially formatted btrfs but after all the problems I went to ext4 since it is much better understood and documented and there are more tools. The NVMe drive has 3 partitions for root, home, and swap. They too were initially on btrfs but then switched to ext4. Only the external USB backup drives use NTFS, they are only plugged in when running a backup (or restore) otherwise stored in a safe.
(1) Since its an old computer with legacy BIOS it cannot boot directly from the NVMe. The small boot partition on the SATA drive does the initial boot then hands over to the NVMe.
I assume these are DOS partition tables and not GPT then. That does help clarify things a little.
No RAID I presume, or your would have mentioned it.
I also assume you would have mentioned LVM if it were being used.
How solid is your power? I assume there are no instances of the host going down uncleanly?
Are you mounting this drive RO when it does not need to be written, or is it written on-demand all the time it is in use?
Have you examined the system log files for references to this device, the modules, or IO buffers related to this device?
What does the DMESG report for this device look like on reboot during an event?
Aw crap, just noticed I left out probably the most important detail!
The trigger- at least some of the times, quite possibly all of them, is writing some large files to the drive. I can usually copy one new video, (typically about 1.5-2.5GB) from a windows client to the server's video share/drive without issue. But if I copy 2 or 3, even with a significant pause between them, I often get an I/O error on the client. Seems to be related to the large size of each individual file, as I routinely copy large directories (many GB, but many files) to the other drives without issue. So that sounds to me like some sort of buffer overflow(??)
Quote:
I assume these are DOS partition tables and not GPT then.
The first drive, sda, which has the little boot partition is a 1TB and is partitioned msdos. The rest of the drives which are all 3+TB are partitioned gpt.
Quote:
How solid is your power? I assume there are no instances of the host going down uncleanly?
The server, being a high-powered (for its day) machine, dual processors, lots of RAM, draws quite a few watts. However, it is connected to a ridiculously huge UPS I got for free. It will keep the server running for at least 10 hours, and if the battery gets low it sends a signal over USB to trigger an orderly shut down. Other than the drive problem, the machine is rock solid.
Quote:
Are you mounting this drive RO when it does not need to be written, or is it written on-demand all the time it is in use?
The drive is mounted RW although I don't often write to the drive. Only when I buy and rip a new Blu-Ray or DVD. And for that matter, I don't even read the drive that often. Plex Media Server re-scans the drive once a day, but it saves its indexes to the NVMe. And as I mentioned, I have turned Plex off for up to 10 days and still had problems.
Quote:
Have you examined the system log files for references to this device, the modules, or IO buffers related to this device?
What does the DMESG report for this device look like on reboot during an event?
Yes, I have looked at the logs several times in the past, but being fairly new to Linux I did not know what to look for but I was looking at the logs well after the crash and nothing jumped out at me as immediately obvious. I just went and checked them again immediately after a crash and found this:
(and it goes on...)
I have no clue how to interpret that. Seems like bad sectors or a failing drive but as I elaborated above I have gone to great extents to eliminate the possibility of a hardware issue. This is the 4th drive and the 3rd brand new one and they have been connected to multiple SATA ports in multiple bays all of which are fine with the other drives.
Also noticed the first line mentions drive sdd which is the 'misc' drive but then all the other lines mention sde which is the troublesome video drive. WTF.
Maybe I need to increase a buffer size? Which/where/how?
Just one more note, I'm a retired IT guy. The companies I was working for were all in bed with MS so I'm pretty knowledgeable about Windows, hardware, and computers in general. I feel MS and Windoze has really gone downhill and now that I'm retired and not forced to work with them any more I am trying to transition myself to Linux. Alas, being older, I do not pick up on new stuff as easily so its turning into a slow migration.
Changing the buffer sizes would be something to use to tune for performance, but it would not cause or solve this problem. the error mentions buffers because it was loading the buffers (read-ahead is normal) when the issue expressed. Had you not already swapped devices I would look REALLY hard at hardware here, but since you have ruled out the most likely half of the hardware I am considering the other candidates.
I presume you made sure that the problem did not follow a SATA cable?
What are your mount parameters for those drives? (A copy of /etc/fstab would show all I need to know)
I have fund some interesting hardware issues that can cause the problem, but they do not maintain over hardware changes, which this does. It sounds like a software issue, but I am having a problem identifying any software that could cause the problem.
Not giving up, but let me see this fstab entries before I go crazy. (OK, crazier! ;-) )
Just for giggles, what KERNEL version are you running? (There are a few with known issues.)
Finally, what are you using to copy TO that drive, and what do you copy FROM?
What do the SMART diagnostics (smartctl) tell you? Another thing to check would be the cables and see if changing them and plugging to a different point helps.
Drive that was in there when I first built the machine a year ago: used Seagate BarraCuda 4TB which was previously in the home theater machine, formatted NTFS, and had all the videos (and working fine). Backed it up and reformatted to btrfs when assembling the server and then restored the videos.
Replaced with new WD Red v3.0 4TB when the problems started (pretty much immediately). It is now the 'misc.' drive.
Replaced with new WD Black 4TB, now being used as a backup to my "main brain" machine.
Replaced with new WD Red v4.0 4TB about 6-8 weeks ago.
Each of the 4 drive bays has it's own SATA cable. The problem drives have at some point or another been in each of the 4 bays.
Usually, the newly ripped video files are copied from the home theater machine, an old Intel i5 machine, 8GB RAM, Windows 10 Pro x64 21H2 (slow, but I let it encode overnight).
Sometimes the video files are copied from my "main brain" machine, AMD Ryzen 5, Asus Rog motherboard, 32GB RAM, dual boot Windows 10 Pro x64 21H2 and Linux Mint 20.3.
Videos are usually watched on the home theater machine, sometimes my sister's and niece's laptops, and our roommate's gaming machines by opening the file across the network (don't copy it down) or by connecting to Plex Media Server in a browser. They are also watched on several smart TVs via Plex. None of this happens often as sister/neice/roommates typically watch regular live TV and I watch very little. About half of my BluRays I have not gotten around to watching yet. Music I play near continuously.
The video files, music files, misc files, etc. are copied to WD 12TB USB external backup drive "A" on or about the 1st of each month, and to another WD 12TB drive "B" near the middle of the month.
fstab:
Code:
# /etc/fstab: static file system information.
#
# Use 'blkid' to print the universally unique identifier for a
# device; this may be used with UUID= as a more robust way to name devices
# that works even if disks are added and removed. See fstab(5).
#
# <file system> <mount point> <type> <options> <dump> <pass>
# / was on /dev/nvme0n1p1 during installation
UUID=c060e847-d7a2-4221-85fb-600a328c6f72 / ext4 errors=remount-ro 0 1
# /boot was on /dev/sda1 during installation
UUID=6610d282-725b-4bc1-b946-93d863ad6a61 /boot ext2 defaults 0 2
# swap was on /dev/nvme0n1p5 during installation
UUID=e0e6779d-3ce2-46fb-ac83-d9f4102fa8c2 none swap sw 0 0
#
# * * * My Custom * * *
#
/dev/disk/by-uuid/b29cea1d-68fa-450d-8ca1-ce9515b0c711 /mnt/Music auto nosuid,nodev,nofail,noexec,x-gvfs-show 0 0
/dev/disk/by-uuid/7086e37a-c99a-4f04-a83d-75bc57f94283 /mnt/WizNaz auto nosuid,nodev,nofail,noexec,acl,x-gvfs-show 0 0
/dev/disk/by-uuid/7f590335-217f-4a91-8752-13c61d53d746 /mnt/DocShare auto nosuid,nodev,nofail,x-gvfs-show 0 0
/dev/disk/by-uuid/8f5fc217-2f3f-45c1-a17a-10c1a4a5ea6c /mnt/Videos auto noexec,nosuid,nodev,nofail,x-gvfs-show 0 0
#/dev/disk/by-id/usb-WD_My_Passport_25E2_57583531444136445859454B-0:0-part1 /mnt/OrangeBrick auto nosuid,nodev,nofail,x-gvfs-show 0 0
What are you using to copy TO that drive?
You have them on the Windows machine, then they get to the Linux machine running kernel 5.4.0. When that happens is when things go south on you, so what is actually doing that and how becomes an important question! Can you describe the detail of that process?
PS. I am using 5.18.7, but know of no specific issues with 5.4 related to this issue. There were some 3.10.x issues if I remember correctly, but you are certainly on the good side of that.
T
he x-gvfs-show should not really be needed as long as you are mounting under /mnt, but all it does is publish the information to the X so it can use it in your desktop display. Should be harmless.
All of my OLD MACHINE data mounts have either the noatime or reltime options. Using recent kernels (recent ext4 drivers) reltime is in the defaults. I also prefix the option list with defaults, to ensure the defaults take effect: later options on the line will override. I recommend using defaults, as updates may provide improved defaults and I want to take advantage of those without having to change fstab every time.
Reltime or noatime is only to avoid useless access time stamp writes, which serve no purpose and add to the I/O load on the drive. Nothing about writing that would cause or avoid your problem.
In the DMESG report are there any errors or messages when these drives mount? The nofail option forces the mount during some fault conditions, but should not stop them from being reported.
So, I just (re) attempted to copy some Christmas movies I had picked up at a yard sale and ripped. Opened a copy of Windows File Explorer on the theater machine and navigated to my ripping folder C:/WizStuff/NewRips, and another copy of Windows File Explorer to the Videos Drive mapped to drive V: and navigated to the V:\Movies\Adult\Christmas folder. I then drag-dropped the first file National.Lampoon.Christmas.Vacation.Remastered_1080p(1989).mp4 which is 1.4GB from the C: window to the V: window. It copied about 25% and I got "Interrupted Action: An unexpected error is keeping you from copying the file. If you continue to receive this error, you can use the error code to search for help with this problem. Error 0x8007003B: An unexpected network error occurred."
(Note If you google that error, you will see there was a known issue several years ago with the Windows Firewall that caused that error but it has long since been patched. However, I have tried it with the firewall turned off and still have the problem.)
That error gets followed by "Interrupted Action: The disk is write protected. Remove the write-protection or use another disk."
I then went to the server and opened it's file explorer (nemo) and the videos drive has been unmounted. I remount it and nothing shows up, the root directory is completely blank. But the little status bar shows the drive to be about 3/4 full and the status bar sez "0 items, Free space: 868.8 GB" Now I cannot unmount it (by normal means) and if I run ls0f in the terminal I can see that smbd (Samba I assume) is using it. I check the logs and find:
Code:
7:15:38 PM kernel: EXT4-fs error (device sdd1): __ext4_find_entry:1538: inode #2: comm nemo: reading directory lblock 0
7:08:50 PM ata_id: unable to open '/dev/sde'
7:08:50 PM kernel: Buffer I/O error on dev sde, logical block 976754645, async page read
7:08:50 PM kernel: blk_update_request: I/O error, dev sde, sector 7814037160 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
7:08:40 PM kernel: ata5.00: status: { DRDY }
7:08:40 PM kernel: ata5: SError: { PHYRdyChg DevExch }
7:08:40 PM kernel: ata5.00: irq_stat 0x00400040, connection status changed
7:08:34 PM kernel: unable to open '/dev/sde'
7:08:34 PM kernel: Buffer I/O error on dev sde, logical block 3, async page read
7:08:34 PM kernel: blk_update_request: I/O error, dev sde, sector 24 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
7:08:34 PM kernel: Dev sde: unable to read RDB block 8
7:08:34 PM kernel: Buffer I/O error on dev sde, logical block 1, async page read
7:08:34 PM kernel: blk_update_request: I/O error, dev sde, sector 8 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
7:08:34 PM kernel: Buffer I/O error on dev sde, logical block 1, async page read
7:08:34 PM kernel: blk_update_request: I/O error, dev sde, sector 8 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
7:08:24 PM kernel: ata5.00: status: { DRDY }
7:08:24 PM kernel: ata5: SError: { PHYRdyChg DevExch }
7:08:24 PM kernel: ata5.00: irq_stat 0x00400040, connection status changed
7:08:01 PM kernel: EXT4-fs error (device sdd1): __ext4_get_inode_loc:4723: inode #162268448: block 649068657: comm PMS LibUpdater: unable to read itable block
7:07:56 PM systemd: Failed unmounting /mnt/Videos.
7:07:56 PM kernel: EXT4-fs error (device sdd1): __ext4_find_entry:1538: inode #2: comm PMS GTP: reading directory lblock 0
7:07:56 PM systemd: Failed unmounting /mnt/Videos.
7:07:56 PM kernel: EXT4-fs error (device sdd1): __ext4_get_inode_loc:4723: inode #162267186: block 649068579: comm PMS ScannerPipe: unable to read itable block
7:07:56 PM systemd: Failed unmounting /mnt/Videos.
7:07:56 PM kernel: JBD2: Error while async write back metadata bh 737673220.
7:07:56 PM kernel: Buffer I/O error on dev sdd1, logical block 737673220, lost async page write
7:07:56 PM kernel: EXT4-fs (sdd1): ext4_writepages: jbd2_start: 9223372036854773759 pages, ino 162268447; err -30
7:07:56 PM kernel: Buffer I/O error on dev sdd1, logical block 0, lost sync page write
7:07:56 PM kernel: EXT4-fs error (device sdd1) in ext4_writepages:2918: IO failure
7:07:56 PM kernel: Buffer I/O error on dev sdd1, logical block 0, lost sync page write
7:07:56 PM kernel: JBD2: Error -5 detected when updating journal superblock for sdd1-8.
7:07:56 PM kernel: Buffer I/O error on dev sdd1, logical block 488144896, lost sync page write
7:07:56 PM kernel: Aborting journal on device sdd1-8.
7:07:56 PM kernel: blk_update_request: I/O error, dev sdd, sector 5903670272 op 0x1:(WRITE) flags 0x4000 phys_seg 21 prio class 0
7:07:56 PM kernel: Buffer I/O error on device sdd1, logical block 737955849
7:07:56 PM kernel: blk_update_request: I/O error, dev sdd, sector 5903664128 op 0x1:(WRITE) flags 0x0 phys_seg 8 prio class 0
7:07:46 PM kernel: ata5.00: status: { DRDY }
7:07:46 PM kernel: ata5: SError: { PHYRdyChg LinkSeq DevExch }
7:07:46 PM kernel: ata5.00: irq_stat 0x08400040, interface fatal error, connection status changed
Also note that /dev/sde is not normally a valid drive. Its valid only when one of the USB backup drives is plugged in, which they aren't at the moment. /dev/sdd1 is the videos drive (duh).
So next I restarted the server to get the drive working again. (I have googled trying to find a faster/better/safer/easier way to do it from the terminal but got way too much conflicting information and advice) Once the server was back up, I tried a movie from my "main brain" machine. Used just one copy of File Explorer, navigated to D:\StuphFromRichard\Vidz, right clicked on "Wyrd Sisters.avi" and selected copy. Then navigated to V:\Movies\Adult, right clicked, and did paste. It never even started to copy, I immediately got "1 Interrupted Action: An unexpected error is keeping you from copying the file. If you continue to receive this error, you can use the error code to search for help with this problem. Error 0x8007045D: The request could not be performed because of an I/O device error." Went to the server and the drive had dismounted. I clicked it in nemo and it remounted itself without issue this time.
Next, I navigated to W:\FixThis on the main brain (W: is actually a local drive on that machine) where I had been renaming and tagging a bunch of music files that were incorrect. I selected 7 folders containing roughly 100+ .flac and .mp3 files totalling 5.5GB, right clicked and did copy. Navigated to the mapped network drive M:\Songs\Classic Rock on the server, right clicked and did paste. That worked fine, they all copied without errors.
In case you are wondering, here is my samba.conf
Code:
[global]
workgroup = WORKGROUP
store dos attributes = no
unix charset = UTF-8
server min protocol = NT1
ntlm auth = yes
load printers = no
printing = bsd
printcap name = /dev/null
disable spoolss = yes
server string = Wizard's file and media server
bind interfaces only = yes
log file = /var/log/samba/log.%m
max log size = 1000
logging = file
panic action = /usr/share/samba/panic-action %d
server role = standalone server
obey pam restrictions = yes
unix password sync = yes
passwd program = /usr/bin/passwd %u
passwd chat = *Enter\snew\s*\spassword:* %n\n *Retype\snew\s*\spassword:* %n\n *password\supdated\ssuccessfully* .
pam password change = yes
map to guest = bad user
usershare allow guests = no
[Videos]
path = /mnt/Videos
comment = Movies, TV Shows, and more!
valid users = wizard, betty, sheri, jack, sandy, pikzil
write list = wizard,sandy,pikzil
admin users = wizard
create mode = 0777
read only = no
available = yes
browseable = yes
writable = yes
guest ok = no
public = no
printable = no
locking = yes
strict locking = no
[WizNaz]
path = /mnt/WizNaz
comment = Little bit of everything
valid users = wizard, sandy, pikzil
write list = wizard,sandy,pikzil
admin users = wizard
create mode = 0777
read only = no
available = yes
browseable = yes
writable = yes
guest ok = no
public = no
printable = no
locking = yes
strict locking = no
[Music]
path = /mnt/Music
comment = Music and Music Videos
valid users = wizard, betty, sheri, jack, sandy, pikzil
write list = wizard,sandy,pikzil
admin users = wizard
create mode = 0777
read only = no
available = yes
browseable = yes
writable = yes
guest ok = no
public = no
printable = no
locking = yes
strict locking = no
[DocShare]
path = /mnt/DocShare
comment = Documents shared by all computers
valid users = wizard, sandy, pikzil
write list = wizard,sandy,pikzil
admin users = wizard
create mode = 0777
read only = no
available = yes
browseable = yes
writable = yes
guest ok = no
public = no
printable = no
locking = yes
strict locking = no
So next I'm gonna try copying FROM LINUX on the main brain to the server. Will post the results of that shortly (have to reboot).
#1 never use Windows software and expect it to deal properly with Linux drives or transfers.
#2 if it can be done in a gui does not mean it SHOULD be done in a gui. In this case avoiding the gui might well avoid the problem because of the tools used in the background.
How is that video drive mounted on the Windows machine? Samba tools? NFS? SSHFS? Or something else? I suspect the transfer of large files over that channel is part of the problem. While tools on each end may manage it just fine, that communication channel may not. (I have run into that issue often, with different symptoms and specific errors but the result of the same misbehavior.)
Linux abandoned POSIX limits LONG ago, but some standard tools that are completely POSIX standard still have old limits hard-coded. You want to ensure that you are using modern tools with the newer limits and standards. Working from the command line gives you control so you can ensure you have not engaged the older tools.
So first of all, Thank you very much wpeckham for your assistance and your time! I am now having problems writing small files to that drive and even reading it so I'm starting to question if I do indeed have a hardware problem with the drive, despite all my tests, diagnostics, and swapping of components. Fortunately I have yet another brand new WD Red 3TB drive still sealed in the box straight from Western Digital themselves (warranty replacement for failed drive in my NVR). But first I will need to run a new full set of backups and copy some files that I use often to an alternate location. The server is not "mission-critical" to our household, but it's certainly inconvenient when it is offline. Problems with the HP NC264T 4-port network card in it have resurfaced so I will be replacing that as well and I won't rule out that it is causing the read/write problems to the drive across the network (unlikely that would affect just one drive).
Finally, I have other projects I am juggling right now as well. I need to retire from being retired-- go back to work so I can rest! I will post an update in a few days. Once again, Thank you!
While there may be a hardware component to this, you did the right steps to isolate that and it hid. I suspect the kicker, the thing that triggered the problem, is software and it MAY be form mixing Windows and Linux mounting of the device with writes of large files. As I mentioned, that one has bit me before.
Please do get back after you are rested and reconfigured: this will nag at me until I know for sure what went south. ;-)
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.