[SOLVED] Raid Repair Now wont boot - Other mounting problems
Linux - ServerThis forum is for the discussion of Linux Software used in a server related context.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Raid Repair Now wont boot - Other mounting problems
OK So it has been a long 3 days of just frustration and waiting.
Here is my situation.
1) Server went down and got booted into a rescue mode where I was able to use putty to find out that my md3 (soft raid) was degraded which is funny cause that is just the /var not the / which in theory should still let the server boot i just wouldn't have my www files.
2) I managed to repair the raid using mdadm and status is all good
3) I reboot and the server never comes back online. In normal Mode (boot from HD) So I put it back into the rescue mode
4) I used to be able to go into the rescue mode and type
Code:
mount /dev/md3 /mnt/
This would mount the raid1 and allow me to view the files etc
5) Now couple reboots later still not booting I can't run the
Code:
mount /dev/md3 /mnt/
it comes up with this error.
Code:
root@rescue:/var/log# mount /dev/md3 /mnt
/dev/md3 looks like swapspace - not mounted
mount: you must specify the filesystem type
6) I am upset, and just frustrated as to why after repairing a raid the server no longer boots. I haven't changed a setting or config in months its just a file server.
Any help would be great and Id even pay for some help. I have AIM/MSN/SKYPE/GTALK if anyone knows this stuff well and can lend a quick hand..
md1 and md3 with the corresponding sda1 sdb1 , sda3,sdb3.
Here is my fdisk
Code:
root@rescue:~# fdisk -l
Disk /dev/sda: 2000.4 GB, 2000398934016 bytes
255 heads, 63 sectors/track, 243201 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000d0305
Device Boot Start End Blocks Id System
/dev/sda1 * 1 5100 40958976+ fd Linux raid autodetect
/dev/sda2 5100 8924 30718976 82 Linux swap / Solaris
/dev/sda3 8924 243201 1881830400 fd Linux raid autodetect
Disk /dev/sdb: 2000.4 GB, 2000398934016 bytes
255 heads, 63 sectors/track, 243201 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000e5562
Device Boot Start End Blocks Id System
/dev/sdb1 1 5100 40958976+ fd Linux raid autodetect
/dev/sdb2 5100 8924 30718976 82 Linux swap / Solaris
/dev/sdb3 8924 243201 1881830400 fd Linux raid autodetect
Disk /dev/md3: 1927.0 GB, 1926994264064 bytes
2 heads, 4 sectors/track, 470457584 cylinders
Units = cylinders of 8 * 512 = 4096 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000
Disk /dev/md3 doesn't contain a valid partition table
Disk /dev/md1: 41.9 GB, 41941925888 bytes
2 heads, 4 sectors/track, 10239728 cylinders
Units = cylinders of 8 * 512 = 4096 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000
Disk /dev/md1 doesn't contain a valid partition table
When i try to manually do fsck i get
Code:
root@rescue:~# fsck -fc /dev/sda1
fsck from util-linux-ng 2.17.2
fsck: fsck.linux_raid_member: not found
fsck: Error 2 while executing fsck.linux_raid_member for /dev/sda1
or if i try on the md1 or md3 i get
Code:
root@rescue:~# fsck -fc /dev/md1
fsck from util-linux-ng 2.17.2
fsck: fsck.swap: not found
fsck: Error 2 while executing fsck.swap for /dev/md1
You still haven't told us what you did in step #2.
What does mdadm --misc --detail /dev/md3 say?
Since sdb3 was not part of the raid i added it back then ran the repair. Following / Modifying this as needed as a new drive was not put it just that partition had a glitch.
so i ended up doing this
Code:
mdadm /dev/md3 --manage --add /dev/sdb3
mdadm --misc --detail /dev/md3 Result... BOTH md1 (boot) and md3 are clean
Code:
root@rescue:~# mdadm --misc --detail /dev/md3
/dev/md3:
Version : 0.90
Creation Time : Fri Jan 27 17:55:17 2012
Raid Level : raid1
Array Size : 1881830336 (1794.65 GiB 1926.99 GB)
Used Dev Size : 1881830336 (1794.65 GiB 1926.99 GB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 3
Persistence : Superblock is persistent
Update Time : Sun May 20 20:48:03 2012
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
UUID : 4f96ca65:0859f8bf:a4d2adc2:26fd5302 (local to host rescue.ovh.net)
Events : 0.1089706
Number Major Minor RaidDevice State
0 8 3 0 active sync /dev/sda3
1 8 19 1 active sync /dev/sdb3
Since sdb3 was not part of the raid i added it back then ran the repair. Following / Modifying this as needed as a new drive was not put it just that partition had a glitch.
so i ended up doing this
Code:
mdadm /dev/md3 --manage --add /dev/sdb3
So you're saying that your RAID device wasn't working at all (in the sense that it didn't contain a valid file system), and further inspection revealed that only /dev/sda3 was part of the array?
That means the RAID 1 array was degraded, not broken. You should still have been able to mount /dev/md3. The fact that you couldn't, indicates that the data on /dev/sda3 is corrupt.
You then added /dev/sdb3 to the degraded array with mdadm /dev/md3 --manage --add /dev/sdb3. Wouldn't that initiate a synchronization, causing the entire /dev/sdb3 to be overwritten with the (known corrupt) data from /dev/sda3?
Could you post the output from smartctl -a /dev/sda?
So you're saying that your RAID device wasn't working at all (in the sense that it didn't contain a valid file system), and further inspection revealed that only /dev/sda3 was part of the array?
That means the RAID 1 array was degraded, not broken. You should still have been able to mount /dev/md3. The fact that you couldn't, indicates that the data on /dev/sda3 is corrupt.
You then added /dev/sdb3 to the degraded array with mdadm /dev/md3 --manage --add /dev/sdb3. Wouldn't that initiate a synchronization, causing the entire /dev/sdb3 to be overwritten with the (known corrupt) data from /dev/sda3?
Could you post the output from smartctl -a /dev/sda?
here is that output... as for the other statement.... sdb3 was considered degraded... and it was not showing up as part of the raid so i simply re-added sdb3 back to the raid the "A" disk did not have the problem. ... I know this cause after the thing repair itself i could mount it and see files.
OUTPUT: this is for A
Code:
root@rescue:~# smartctl -a /dev/sda
smartctl 5.40 2010-07-12 r3124 [i686-pc-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Device Model: Hitachi HDS723020BLA642
Serial Number: MN1220F32U0U5D
Firmware Version: MN6OA5C0
User Capacity: 2,000,398,934,016 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Sun May 20 21:08:40 2012 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x80) Offline data collection activity
was never started.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (18950) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 255) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0
2 Throughput_Performance 0x0005 133 133 054 Pre-fail Offline - 93
3 Spin_Up_Time 0x0007 147 147 024 Pre-fail Always - 390 (Average 390)
4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 44
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0
7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0
8 Seek_Time_Performance 0x0005 135 135 020 Pre-fail Offline - 26
9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 2737
10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 44
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 44
193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 44
194 Temperature_Celsius 0x0002 162 162 000 Old_age Always - 37 (Lifetime Min/Max 20/46)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Output for B:
Code:
root@rescue:~# smartctl -a /dev/sdb
smartctl 5.40 2010-07-12 r3124 [i686-pc-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Device Model: WDC WD2002FAEX-007BA0
Serial Number: WD-WMAY02495554
Firmware Version: 05.01D05
User Capacity: 2,000,398,934,016 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Sun May 20 21:13:12 2012 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (29580) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 255) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x3037) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 157 129 051 Pre-fail Always - 171761
3 Spin_Up_Time 0x0027 253 253 021 Pre-fail Always - 8166
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 112
5 Reallocated_Sector_Ct 0x0033 171 171 140 Pre-fail Always - 230
7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0
9 Power_On_Hours 0x0032 095 095 000 Old_age Always - 4275
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 110
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 109
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 2
194 Temperature_Celsius 0x0022 112 107 000 Old_age Always - 40
196 Reallocated_Event_Count 0x0032 021 021 000 Old_age Always - 179
197 Current_Pending_Sector 0x0032 200 198 000 Old_age Always - 13
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 187 187 000 Old_age Offline - 2787
SMART Error Log Version: 1
Warning: ATA error count 5774 inconsistent with error log pointer 3
ATA Error Count: 5774 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 5774 occurred at disk power-on lifetime: 3825 hours (159 days + 9 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 58 eb 8a 77 e2 Error: UNC 88 sectors at LBA = 0x02778aeb = 41388779
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 58 b8 8a 77 e2 08 21d+11:23:24.946 READ DMA
c8 00 08 b0 8a 77 e2 08 21d+11:23:24.929 READ DMA
c8 00 38 78 8a 77 e2 08 21d+11:23:24.587 READ DMA
c8 00 20 50 8a 77 e2 08 21d+11:23:24.121 READ DMA
c8 00 08 10 8a 77 e2 08 21d+11:23:24.121 READ DMA
Error 5773 occurred at disk power-on lifetime: 3825 hours (159 days + 9 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 11 90 c5 e2 Error: UNC 8 sectors at LBA = 0x02c59011 = 46501905
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 10 90 c5 e2 08 21d+11:07:40.587 READ DMA
c8 00 08 10 b5 a2 e2 08 21d+11:07:40.575 READ DMA
c8 00 08 00 90 c5 e2 08 21d+11:07:40.514 READ DMA
c8 00 08 e8 a0 73 e2 08 21d+11:07:40.512 READ DMA
c8 00 08 e0 a0 73 e2 08 21d+11:07:40.499 READ DMA
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 1534 -
# 2 Short offline Completed without error 00% 1523 -
# 3 Short offline Completed without error 00% 1523 -
# 4 Short offline Completed without error 00% 24 -
# 5 Short offline Completed without error 00% 13 -
# 6 Short offline Completed without error 00% 13 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
If I read this correctly the B drive appears to be failing .. could this cause it not to boot? Even though the A drive is just fine? Even if that is so i should be able to Mount just A and look at it but i can't do that.
Maybe the mdadm terminology is a bit off, because "degraded" is one possible state of a RAID array, not the state of any single member of an array.
After you added /dev/sdb3 to /dev/md3, what (if any) other mdadm commands did you run?
Anyway, /dev/sda seems good from the S.M.A.R.T. data. Specifically, "Reallocated_Sector_Count" and "Current_Pending_Sector" are both 0. Could you post the same data for /dev/sdb?
Maybe the mdadm terminology is a bit off, because "degraded" is one possible state of a RAID array, not the state of any single member of an array.
After you added /dev/sdb3 to /dev/md3, what (if any) other mdadm commands did you run?
Anyway, /dev/sda seems good from the S.M.A.R.T. data. Specifically, "Reallocated_Sector_Count" and "Current_Pending_Sector" are both 0. Could you post the same data for /dev/sdb?
I just updated my post with the B data below the A data.. That drive appears to be bad..
as for what else i ran i just used this
Code:
cat /proc/mdstat
where it showed something like this sample when that finished it showed 2/2 UU and it showed green again in the web GUI for rescue mode.
If you look here http://help.ovh.co.uk/RaidSoft i might have managed to mess up the very bottom of that cause the swap commands did nothing and error ed. Might have not put the right letters idk im stuck. I know my data is good on drive a just not accessible for some reason
That is one seriously broken drive. You should unplug /dev/sdb immediately, or at the very least use mdadm /dev/md3 --manage --fail /dev/sdb3 (and repeat the command for md1 and /dev/sdb1).
That is one seriously broken drive. You should unplug /dev/sdb immediately, or at the very least use mdadm /dev/md3 --manage --fail /dev/sdb3 (and repeat the command for md1 and /dev/sdb1).
I am not able to access the server psychically as it is in another country lol will these commands do the same thing as unplugging the drive and letting it as a single drive server until the Datacenter is able to put a new drive in?
Alright i did both of those commands and got this output
Code:
root@rescue:~# mdadm /dev/md1 --manage --fail /dev/sdb1
mdadm: set /dev/sdb1 faulty in /dev/md1
root@rescue:~# mdadm /dev/md3 --manage --fail /dev/sdb3
mdadm: set /dev/sdb3 faulty in /dev/md3
Im guessing that is good?
Should I tell the server to Boot from the Hard Drive Now??? Or do i need to change other things to get the server to boot.. shouldnt the raid just say hey there is a good drive here we can use this...?
Well system never comes online after reboot Something must be wrong... should i be able to mount anything?? I just like to rsync the data to somewhere and start over
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.