Broken software RAID5 set

Curlyau · 05-16-2008, 10:21 AM

I have a debian box, running 2.6.18, and I appear to have broken my RAID5 set.

Previously I had 4 500GB drives, and it was working perfectly. I added another disk and did:

mdadm --add /dev/md1 /dev/sde1
mdadm --grow /dev/md1 --raid-devices=5

It started to reshape the array, but then the new drive started giving errors, and the machine hung. I think it was due to a problem with a PCI sata card.

I resolved that, but now I've somehow broken the array.

cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md1 : inactive sdd1[5] sdb1[3] sdc1[1]
1465151616 blocks super 1.0

unused devices: <none>

mdadm -D /dev/md1
/dev/md1:
Version : 01.00.03
Creation Time : Sun Dec 23 01:28:08 2007
Raid Level : raid5
Device Size : 488383744 (465.76 GiB 500.10 GB)
Raid Devices : 5
Total Devices : 3
Preferred Minor : 1
Persistence : Superblock is persistent

Update Time : Fri May 16 22:05:16 2008
State : clean, degraded, Not Started
Active Devices : 3
Working Devices : 3
Failed Devices : 0
Spare Devices : 0

Layout : left-symmetric
Chunk Size : 128K

Delta Devices : 1, (4->5)

Name : 'Fuckyfucky3':1
UUID : 43eff327:8d1aa506:c0df2849:005c003f
Events : 1420750

Number Major Minor RaidDevice State
5 8 49 0 active sync /dev/sdd1
1 8 33 1 active sync /dev/sdc1
3 8 17 2 active sync /dev/sdb1
3 0 0 3 removed
4 0 0 4 removed

mdadm -E /dev/sda1
/dev/sda1:
Magic : a92b4efc
Version : 01
Feature Map : 0x4
Array UUID : 43eff327:8d1aa506:c0df2849:005c003f
Name : 'Fuckyfucky3':1
Creation Time : Sun Dec 23 01:28:08 2007
Raid Level : raid5
Raid Devices : 5

Device Size : 976767856 (465.76 GiB 500.11 GB)
Array Size : 3907069952 (1863.04 GiB 2000.42 GB)
Used Size : 976767488 (465.76 GiB 500.10 GB)
Super Offset : 976767984 sectors
State : clean
Device UUID : a15eee10:6cd6b795:d18cb3b2:770139c2

Reshape pos'n : 143872 (140.52 MiB 147.32 MB)
Delta Devices : 1 (4->5)

Update Time : Fri May 16 21:40:35 2008
Checksum : c6697c39 - correct
Events : 1420746

Layout : left-symmetric
Chunk Size : 128K

Array Slot : 4 (failed, 1, failed, 2, 3, 0)
Array State : uuuU_ 2 failed

mdadm -E /dev/sdb1
/dev/sdb1:
Magic : a92b4efc
Version : 01
Feature Map : 0x4
Array UUID : 43eff327:8d1aa506:c0df2849:005c003f
Name : 'Fuckyfucky3':1
Creation Time : Sun Dec 23 01:28:08 2007
Raid Level : raid5
Raid Devices : 5

Device Size : 976767856 (465.76 GiB 500.11 GB)
Array Size : 3907069952 (1863.04 GiB 2000.42 GB)
Used Size : 976767488 (465.76 GiB 500.10 GB)
Super Offset : 976767984 sectors
State : clean
Device UUID : 5b38c5a2:798c6793:91ad6d1e:9cfee153

Reshape pos'n : 143872 (140.52 MiB 147.32 MB)
Delta Devices : 1 (4->5)

Update Time : Fri May 16 22:05:16 2008
Checksum : 53542fac - correct
Events : 1420750

Layout : left-symmetric
Chunk Size : 128K

Array Slot : 3 (failed, 1, failed, 2, failed, 0)
Array State : uuU__ 3 failed

mdadm -E /dev/sdc1
/dev/sdc1:
Magic : a92b4efc
Version : 01
Feature Map : 0x4
Array UUID : 43eff327:8d1aa506:c0df2849:005c003f
Name : 'Fuckyfucky3':1
Creation Time : Sun Dec 23 01:28:08 2007
Raid Level : raid5
Raid Devices : 5

Device Size : 976767856 (465.76 GiB 500.11 GB)
Array Size : 3907069952 (1863.04 GiB 2000.42 GB)
Used Size : 976767488 (465.76 GiB 500.10 GB)
Super Offset : 976767984 sectors
State : clean
Device UUID : 673ba6d4:6c46fd55:745c9c93:3fa8bf21

Reshape pos'n : 143872 (140.52 MiB 147.32 MB)
Delta Devices : 1 (4->5)

Update Time : Fri May 16 22:05:16 2008
Checksum : 8ad7452f - correct
Events : 1420750

Layout : left-symmetric
Chunk Size : 128K

Array Slot : 1 (failed, 1, failed, 2, failed, 0)
Array State : uUu__ 3 failed

mdadm -E /dev/sdd1
/dev/sdd1:
Magic : a92b4efc
Version : 01
Feature Map : 0x4
Array UUID : 43eff327:8d1aa506:c0df2849:005c003f
Name : 'Fuckyfucky3':1
Creation Time : Sun Dec 23 01:28:08 2007
Raid Level : raid5
Raid Devices : 5

Device Size : 976767856 (465.76 GiB 500.11 GB)
Array Size : 3907069952 (1863.04 GiB 2000.42 GB)
Used Size : 976767488 (465.76 GiB 500.10 GB)
Super Offset : 976767984 sectors
State : clean
Device UUID : 99b87c50:a919bd63:599a135f:9af385ba

Reshape pos'n : 143872 (140.52 MiB 147.32 MB)
Delta Devices : 1 (4->5)

Update Time : Fri May 16 22:05:16 2008
Checksum : 78ab1ee2 - correct
Events : 1420750

Layout : left-symmetric
Chunk Size : 128K

Array Slot : 5 (failed, 1, failed, 2, failed, 0)
Array State : Uuu__ 3 failed

mdadm -E /dev/sde1
/dev/sde1:
Magic : a92b4efc
Version : 01
Feature Map : 0x4
Array UUID : 43eff327:8d1aa506:c0df2849:005c003f
Name : 'Fuckyfucky3':1
Creation Time : Sun Dec 23 01:28:08 2007
Raid Level : raid5
Raid Devices : 5

Device Size : 976767856 (465.76 GiB 500.11 GB)
Array Size : 3907069952 (1863.04 GiB 2000.42 GB)
Used Size : 976767488 (465.76 GiB 500.10 GB)
Super Offset : 976767984 sectors
State : clean
Device UUID : 89b53542:d1d820bc:f2ece884:4785869a

Reshape pos'n : 143872 (140.52 MiB 147.32 MB)
Delta Devices : 1 (4->5)

Update Time : Fri May 16 22:05:16 2008
Checksum : c89db84b - correct
Events : 1418968

Layout : left-symmetric
Chunk Size : 128K

Array Slot : 6 (failed, 1, failed, 2, failed, 0)
Array State : uuu__ 3 failed

What should I do next? I should zero the superblock on a drive?

If I try to force the array to start:

mdadm --assemble --force /dev/md1 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1
mdadm: forcing event count in /dev/sda1(3) from 1420746 upto 1420750
mdadm: clearing FAULTY flag for device 0 in /dev/md1 for /dev/sda1
mdadm: /dev/md1 has been started with 4 drives (out of 5).

Then in dmesg.

raid5:md1: read error not correctable (sector 96720 on sda1).
raid5:md1: read error not correctable (sector 96728 on sda1).
raid5:md1: read error not correctable (sector 96736 on sda1).
raid5:md1: read error not correctable (sector 96744 on sda1).
raid5:md1: read error not correctable (sector 96752 on sda1).
raid5:md1: read error not correctable (sector 96760 on sda1).
ata1: EH complete
md: md1: sync done.
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
ata1.00: (BMDMA stat 0x20)
ata1.00: tag 0 cmd 0xc8 Emask 0x9 stat 0x51 err 0x40 (media error)
ata1: EH complete

Any ideas or suggestions are much appreciated.

JimBass · 05-16-2008, 11:52 PM

With 3 failed drives, I suspect you're learning the hard way that software RAID5 is not the best of choices for data. Probably whatever crashed the system is what screwed up the drives, which is the unfortunate nature of software RAID. Obviously if your OS goes down for any reason, and the OS is also controlling the array, the array has no controller, so data will probably get lost.

It is also possible you have a hardware error with some of the drives, but I think that is a remote possibility at best.

I would expect the data could be recovered, but it would probably need recovery experts. Their cost would be way above the cost of a hardware controller.

Peace,
JimBass

Curlyau · 05-18-2008, 07:33 AM

Hmm, yeah, that's probably not the answer I was looking for.

Thanks for your honesty though.

Now, the data is just a bunch of movies and music I'd ripped at home, so it's not overly critical that I get it all back. However I'd like to think that I didn't do anything rashly irreversable to lose it.

Can anyone point me in the direction of some documentation I can read in order to better understand software raid in order to maybe try and recover the array? Bear in mind that I have (within reason) all the time in the world to fiddle around with the nuts and bolts rather than looking for one magic command to rebuild the array?

Curlyau · 05-20-2008, 10:21 PM

Well, just in case it helps anyone else out.

I pulled each drive from the box seperately, and did a full SMART test. The new Seagate failed, as well as one of the Samsungs.

I then did a full surface scan on the Samsung, which showed 5 faulty blocks.

Both should be replaced under warranty, but that doesn't help me. Since the 'grow' operation only got a very small way through, almost all of my data should still be on the old array members, right?

So what I'm thinking is that I get a couple of new drives, and dd the faulty samsung onto a new one. Then do the same for the faulty seagate and assemble the array. That /should/ contiue the grow operation, or at least bring up the array in a degraded state, right?