Raid failure

ravand · 03-07-2013, 02:20 AM

Yesterday, after a restart of the server, we had to make the horrifying experience that our home folder had been rollbacked for almost half a year!
We immediatly contacted our server Provider (Hetzner) and they told us that the raid md127 didn't start up correctly and that we should reload the raid manually.

I wanted to get help from this forum, since i am new to this whole subject and since there is a high risk in losing all your data if approaching falsely to this issue.

To our problem:

When typing in "cat /proc/mdstat":
http://puu.sh/2dtXE (Screenshot)
We can see, that md127 marks a _U. As far as i understood that means sdba4 can't be loaded but sdb4 is loaded.

When having a closer look into md127 with "mdadm -D /dev/md127":
http://puu.sh/2du19 (Screenshot)
We see, that partition number 0 has been removed and 1 is running

I would have also given you the etc/raidtab but for some reason its missing on our root!

As mentioned above i don't really know how to approach in such a case, do i just reactivate the raid with commands, copy it over to another, or do we even have to get the disk swapped?

I would be very thankful for any kind of help and advice you can give me.
I am kind of scared to lose our important data, thats why i am asking here :S I hope you have comprehension for that

Thanks in advance
ravand

chrism01 · 03-07-2013, 05:03 AM

1. How about /etc/mdadm.conf ?

2. what distro+version

Code:

uname -a

cat /etc/*release*

3. It looks like a strange setup; you appear to have 4 RAID1 (mirror) sets, but only 2 physical disks: sda, sdb.
This is not a good idea if one disk goes bad, all RAID sets would be affected.

md0 = sda1, sdb1
md1 = sda2, sdb2
md2 = sda3, sdb3

& I suspect md3 should be = sda4, sdb4.

What you appear to have is md3 has split into 2 single disk RAID1 sets; md3 & md127.
Can you check the conf file or somehow other check how the RAID sets were built eg ask your Provider ?

ravand · 03-07-2013, 05:26 AM

1. I haven't found the mdadm.conf in the /etc folder but in the /etc/mdadm/ folder but it doesn't say much :/

Quote:

DEVICES /dev/[hs]d*
MAILADDR xxxxx@xxxxxx
MAILFROM xxxxx@xxxxx

I x'ed the email address for private purposes

2. The kernel is:

Quote:

Linux localhost 2.6.32-5-amd64 #1 SMP Sun May 6 04:00:17 UTC 2012 x86_64 GNU/Linux

The command you provided for distro didn't work neither did "cat /etc/*-release"

But i got it working with "lsb_release -a" :

Quote:

No LSB modules are available.
Distributor ID: Debian
Description: Debian GNU/Linux 6.0.6 (squeeze)
Release: 6.0.6
Codename: squeeze

3. Do u think that would explain the odd number 127?

chrism01 · 03-07-2013, 05:41 AM

2. actually, there's no '-' in my 'cat' cmd; its deliberate so that it usually works on most distros.

3. It certainly looks like it, given the other RAID arrays and their numbering.
That's why its important to find out if they are 2 halves of the same RAID set, or if you've broken 2 sets. My money is on the former.
Whoever built the sets should know... and you NEED to know before you try fixing anything.
Incidentally, if you can avoid using those 2 and ideally unmount them, that should stop any further drift in content.

ravand · 03-07-2013, 05:49 AM

Quote:

Originally Posted by chrism01

2. actually, there's no '-' in my 'cat' cmd; its deliberate so that it usually works on most distros.

3. It certainly looks like it, given the other RAID arrays and their numbering.
That's why its important to find out if they are 2 halves of the same RAID set, or if you've broken 2 sets. My money is on the former.
Whoever built the sets should know... and you NEED to know before you try fixing anything.
Incidentally, if you can avoid using those 2 and ideally unmount them, that should stop any further drift in content.

We haven't really touched anything on the raids it must have happened automatically.
Do u have any other ways of finding out how everything looked like before the incident since the conf files dont provide anything for some reason :/
Would the provider know?

EDIT: I might have found something that could support your assumption

When typing in "mdadm --detail --scan >> /etc/mdadm/mdadm.conf" i get the following + 1 error message:

mdadm.conf:

Quote:

ARRAY /dev/md/0 metadata=1.2 name=rescue:0 UUID=457ffb60:47f0ba44:0aa1b92a:647d5935
ARRAY /dev/md/1 metadata=1.2 name=rescue:1 UUID=b36da940:7c5b51e8:78805318:4bf6110a
ARRAY /dev/md/2 metadata=1.2 name=rescue:2 UUID=182a8f0d:8f295d0a:bf4e2ebf:7113e813
ARRAY /dev/md/3 metadata=1.2 name=rescue:3 UUID=45958b4b:1024b8cb:30a98470:705d7110

error:

Quote:

mdadm: cannot open /dev/md/rescue:3: No such file or directory

EDIT2:

Also here a screenshot of md3 details. Both md127 and md3 refer to the name "rescue:3" do u think that is a hint for a split?
http://puu.sh/2dxdd

ravand · 03-07-2013, 11:17 AM

EDIT: Sry i didn't want to spam that hard it seemed like i was lagging or the webpage so i may have accidently clicked post several times

ravand · 03-07-2013, 11:18 AM

EDIT: Sry i didn't want to spam that hard it seemed like i was lagging or the webpage so i may have accidently clicked post several times

ravand · 03-07-2013, 11:19 AM

I unmounted md127 to see what would happen, i restarted the server and the md127 entry was gone, also the md127 file in /dev/. The /home directory is empty now

Is that normal? Or did we screw up here?

also we get this error:

Quote:

mount: wrong fs type, bad option, bad superblock on /dev/md3,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so

EDIT: Sry i didn't want to spam that hard it seemed like i was lagging or the webpage so i may have accidently clicked post several times

chrism01 · 03-07-2013, 05:08 PM

1. check the partitions again

Code:

cat /proc/mdstat

mdadm --detail /dev/md3
mdadm --detail /dev/md127

2. Might be worth listing the disks/partitions as well

Code:

fdisk -l

that's a lowercase L

3. Do ask your provider how they set it up

4. hope you have a backup

ravand · 03-08-2013, 05:10 AM

It seems like the md127 has reapeared after another restart but the home directory is still empty

1. I noticed that after typing the command md3 and md127 both say "(auto-read-only)" What does that mean?

Quote:

Personalities : [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md127 : active (auto-read-only) raid1 sdb4[1]
1822442815 blocks super 1.2 [2/1] [_U]

md3 : active (auto-read-only) raid1 sda4[0]
1822442815 blocks super 1.2 [2/1] [U_]

md2 : active raid1 sda3[0] sdb3[1]
1073740664 blocks super 1.2 [2/2] [UU]

md1 : active raid1 sda2[0] sdb2[1]
524276 blocks super 1.2 [2/2] [UU]

md0 : active (auto-read-only) raid1 sda1[0] sdb1[1]
33553336 blocks super 1.2 [2/2] [UU]

unused devices: <none>

mdadm --detail /dev/md3:

Quote:

/dev/md3:
Version : 1.2
Creation Time : Sat Jun 23 13:47:29 2012
Raid Level : raid1
Array Size : 1822442815 (1738.02 GiB 1866.18 GB)
Used Dev Size : 1822442815 (1738.02 GiB 1866.18 GB)
Raid Devices : 2
Total Devices : 1
Persistence : Superblock is persistent

Update Time : Thu Mar 7 17:51:13 2013
State : clean, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 0
Spare Devices : 0

Name : rescue:3
UUID : 45958b4b:1024b8cb:30a98470:705d7110
Events : 3101836

Number Major Minor RaidDevice State
0 8 4 0 active sync /dev/sda4
1 0 0 1 removed

mdadm --detail /dev/md127:

Quote:

/dev/md127:
Version : 1.2
Creation Time : Sat Jun 23 13:47:29 2012
Raid Level : raid1
Array Size : 1822442815 (1738.02 GiB 1866.18 GB)
Used Dev Size : 1822442815 (1738.02 GiB 1866.18 GB)
Raid Devices : 2
Total Devices : 1
Persistence : Superblock is persistent

Update Time : Thu Mar 7 17:51:13 2013
State : clean, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 0
Spare Devices : 0

Name : rescue:3
UUID : 45958b4b:1024b8cb:30a98470:705d7110
Events : 67654

Number Major Minor RaidDevice State
0 0 0 0 removed
1 8 20 1 active sync /dev/sdb4

2. fdisk -l gives the following:

Quote:

WARNING: GPT (GUID Partition Table) detected on '/dev/sdb'! The util fdisk doesn't support GPT. Use GNU Parted.

Disk /dev/sdb: 3000.6 GB, 3000592982016 bytes
255 heads, 63 sectors/track, 364801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000

Device Boot Start End Blocks Id System
/dev/sdb1 1 267350 2147483647+ ee GPT
Partition 1 does not start on physical sector boundary.

WARNING: GPT (GUID Partition Table) detected on '/dev/sda'! The util fdisk doesn't support GPT. Use GNU Parted.

Disk /dev/sda: 3000.6 GB, 3000592982016 bytes
256 heads, 63 sectors/track, 363376 cylinders
Units = cylinders of 16128 * 512 = 8257536 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000

Device Boot Start End Blocks Id System
/dev/sda1 1 266306 2147483647+ ee GPT
Partition 1 does not start on physical sector boundary.

Disk /dev/md0: 34.4 GB, 34358616064 bytes
2 heads, 4 sectors/track, 8388334 cylinders
Units = cylinders of 8 * 512 = 4096 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000

Disk /dev/md0 doesn't contain a valid partition table

Disk /dev/md1: 536 MB, 536858624 bytes
2 heads, 4 sectors/track, 131069 cylinders
Units = cylinders of 8 * 512 = 4096 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000

Disk /dev/md1 doesn't contain a valid partition table

Disk /dev/md2: 1099.5 GB, 1099510439936 bytes
2 heads, 4 sectors/track, 268435166 cylinders
Units = cylinders of 8 * 512 = 4096 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000

Disk /dev/md2 doesn't contain a valid partition table

Disk /dev/md3: 1866.2 GB, 1866181442560 bytes
2 heads, 4 sectors/track, 455610703 cylinders
Units = cylinders of 8 * 512 = 4096 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000

Disk /dev/md3 doesn't contain a valid partition table

Disk /dev/md127: 1866.2 GB, 1866181442560 bytes
2 heads, 4 sectors/track, 455610703 cylinders
Units = cylinders of 8 * 512 = 4096 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000

Disk /dev/md127 doesn't contain a valid partition table

Also here the fstab:

Quote:

proc /proc proc defaults 0 0
none /dev/pts devpts gid=5,mode=620 0 0
/dev/md/0 none swap sw 0 0
/dev/md/1 /boot ext3 defaults 0 0
/dev/md/2 / ext4 defaults 0 0
#Old entry
#/dev/md/3 /home ext4 defaults 0 0
/dev/md127 /home ext4 defaults 0 0
/dev/md3 /home ext4 defaults 0 0
tmpfs /ramdisk tmpfs defaults,size=6000M 0 0

3. I have read somewhere that if a raid can't be loaded the system tries to create a new raid set as a mirror of the broken one and all of these automatic created raids start with the number 127-. So in this case i would assume that the setup was md0,md1,md2,md3. Also if you check on the hetzner wiki about repairing a broken raid they are also talking about this setup. However if you think this is not yet enough information we can contact the provider again to be 100% sure.

4. Hmm... More or less. We had most of our backups in the home directy which has been rollbacked for 5 months (i honestly dont understand why 5 months) and we only have backups that are 2-3 months old on external devices. We are kind of in problematic situation.

EDIT: Btw if we can't manage to get the raids mounted again, do you know any way of extracting or bumping the content of a raid1 file to a directory or isn't this possible? We are planning on formatting the whole system IF we can get the files back

whizje · 03-08-2013, 05:53 AM

Code:

mdadm -S /dev/md127
mdadm --add /dev/md3 /dev/sdb4

Stop the md127 array and add the disk back to md3. Remove md127 from fstab and add /dev/md/3 back to fstab.

ravand · 03-08-2013, 06:01 AM

Ok i have done that now i get the following:

Quote:

Personalities : [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md3 : active (auto-read-only) raid1 sda4[0] sdb4[1]
1822442815 blocks super 1.2 [2/1] [U_]

md2 : active raid1 sda3[0] sdb3[1]
1073740664 blocks super 1.2 [2/2] [UU]

md1 : active raid1 sda2[0] sdb2[1]
524276 blocks super 1.2 [2/2] [UU]

md0 : active (auto-read-only) raid1 sda1[0] sdb1[1]
33553336 blocks super 1.2 [2/2] [UU]

unused devices: <none>

The home folder is still empty tho. I still get the "auto-read-only" i didnt have that before. Any explanation?

whizje · 03-08-2013, 06:06 AM

after editing fstab do a

Code:

mount -a

ravand · 03-08-2013, 06:08 AM

THis is what i get for mount -a:

Quote:

mount: none already mounted or /dev/pts busy
mount: according to mtab, devpts is already mounted on /dev/pts
mount: wrong fs type, bad option, bad superblock on /dev/md3,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so

mount: wrong fs type, bad option, bad superblock on /dev/md3,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so

EDIT:
Here the dmesg tail:

Quote:

[ 19.551459] INFO-xpp: FEATURE: with sync_tick() from DAHDI
[ 19.643844] INFO-xpp_usb: revision Unknown
[ 19.644024] usbcore: registered new interface driver xpp_usb
[ 20.307779] dahdi: Registered tone zone 0 (United States / North America)
[ 21.759758] eth0: no IPv6 routers present
[ 34.032136] [drm] Initialized drm 1.1.0 20060810
[ 34.635433] lp: driver loaded but no devices found
[ 34.771899] ppdev: user-space parallel port driver
[ 142.734088] EXT4-fs (md3): VFS: Can't find ext4 filesystem
[ 142.742375] EXT4-fs (md3): VFS: Can't find ext4 filesystem

EDIT2:
md3 details give this:

Quote:

/dev/md3:
Version : 1.2
Creation Time : Sat Jun 23 13:47:29 2012
Raid Level : raid1
Array Size : 1822442815 (1738.02 GiB 1866.18 GB)
Used Dev Size : 1822442815 (1738.02 GiB 1866.18 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent

Update Time : Fri Mar 8 12:58:03 2013
State : clean, degraded
Active Devices : 1
Working Devices : 2
Failed Devices : 0
Spare Devices : 1

Name : rescue:3
UUID : 45958b4b:1024b8cb:30a98470:705d7110
Events : 3101840

Number Major Minor RaidDevice State
0 8 4 0 active sync /dev/sda4
1 8 20 1 spare rebuilding /dev/sdb4

It now says "spare rebuilding"

whizje · 03-08-2013, 06:23 AM

try fsck /dev/md3