SATA Issues

claymen · 07-12-2007, 07:22 PM

Hi all,

I've just built a new file-server using 5x500gb WD drives and a Silicon Image 3124 PCI 4port sata card. Using software raid (mdadm) and one of the onboard SATA ports I have my raid5 array. Now things were fine up untill I really hammer the array and im getting what appear to be sector errors like the following.

Quote:

[ 885.460000] ata5.00: exception Emask 0x0 SAct 0x1ff SErr 0x0 action 0x2 frozen
[ 885.460000] ata5.00: cmd 61/00:00:00:ee:e1/01:00:01:00:00/40 tag 0 cdb 0x0 data 131072 out
[ 885.460000] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 885.460000] ata5.00: cmd 61/08:08:00:ef:e1/00:00:01:00:00/40 tag 1 cdb 0x0 data 4096 out
[ 885.460000] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 885.460000] ata5.00: cmd 61/00:10:00:f3:e1/01:00:01:00:00/40 tag 2 cdb 0x0 data 131072 out
[ 885.460000] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 885.460000] ata5.00: cmd 61/f8:18:08:ef:e1/00:00:01:00:00/40 tag 3 cdb 0x0 data 126976 out
[ 885.460000] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 885.460000] ata5.00: cmd 61/08:20:00:f0:e1/00:00:01:00:00/40 tag 4 cdb 0x0 data 4096 out
[ 885.460000] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 885.460000] ata5.00: cmd 61/f8:28:08:f0:e1/00:00:01:00:00/40 tag 5 cdb 0x0 data 126976 out
[ 885.460000] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 885.460000] ata5.00: cmd 61/00:30:00:ed:e1/01:00:01:00:00/40 tag 6 cdb 0x0 data 131072 out
[ 885.460000] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 885.460000] ata5.00: cmd 61/00:38:00:f1:e1/01:00:01:00:00/40 tag 7 cdb 0x0 data 131072 out
[ 885.460000] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 885.460000] ata5.00: cmd 61/00:40:00:f2:e1/01:00:01:00:00/40 tag 8 cdb 0x0 data 131072 out
[ 885.460000] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 885.772000] ata5: soft resetting port
[ 895.860000] ata5: softreset failed (timeout)
[ 895.860000] ata5: hard resetting port
[ 898.252000] ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[ 898.252000] ata5.00: ata_hpa_resize 1: hpa sectors (0) is smaller than sectors (976773168)
[ 898.252000] ata5.00: failed to set xfermode (err_mask=0x1)
[ 898.252000] ata5: failed to recover some devices, retrying in 5 secs
[ 903.256000] ata5: hard resetting port
[ 905.648000] ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[ 905.684000] ata5.00: ata_hpa_resize 1: hpa sectors (0) is smaller than sectors (976773168)
[ 905.684000] ata5.00: failed to set xfermode (err_mask=0x1)
[ 905.684000] ata5: limiting SATA link speed to 1.5 Gbps
[ 905.684000] ata5.00: limiting speed to UDMA/100:PIO3
[ 905.684000] ata5: failed to recover some devices, retrying in 5 secs
[ 910.688000] ata5: hard resetting port
[ 913.080000] ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[ 913.116000] ata5.00: ata_hpa_resize 1: hpa sectors (0) is smaller than sectors (976773168)
[ 913.116000] ata5.00: failed to set xfermode (err_mask=0x1)
[ 913.120000] ata5.00: disabled
[ 913.624000] ata5: EH complete
[ 913.624000] sd 4:0:0:0: [sdd] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
[ 913.624000] end_request: I/O error, dev sdd, sector 31584768
[ 913.624000] raid5: Disk failure on sdd, disabling device. Operation continuing on 4 devices
[ 913.624000] sd 4:0:0:0: [sdd] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
[ 913.624000] end_request: I/O error, dev sdd, sector 31584512
[ 913.624000] sd 4:0:0:0: [sdd] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
[ 913.624000] end_request: I/O error, dev sdd, sector 31583488
[ 913.624000] sd 4:0:0:0: [sdd] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
[ 913.624000] end_request: I/O error, dev sdd, sector 31584264
[ 913.624000] sd 4:0:0:0: [sdd] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
[ 913.624000] end_request: I/O error, dev sdd, sector 31584256
[ 913.624000] sd 4:0:0:0: [sdd] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
[ 913.624000] end_request: I/O error, dev sdd, sector 31584008
[ 913.624000] sd 4:0:0:0: [sdd] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
[ 913.624000] end_request: I/O error, dev sdd, sector 31585024
[ 913.624000] sd 4:0:0:0: [sdd] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
[ 913.624000] end_request: I/O error, dev sdd, sector 31584000
[ 913.624000] sd 4:0:0:0: [sdd] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
[ 913.624000] end_request: I/O error, dev sdd, sector 31583744
[ 913.624000] sd 4:0:0:0: [sdd] READ CAPACITY failed
[ 913.624000] sd 4:0:0:0: [sdd] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
[ 913.624000] sd 4:0:0:0: [sdd] Sense not available.
[ 913.624000] sd 4:0:0:0: [sdd] Write Protect is off
[ 913.624000] sd 4:0:0:0: [sdd] Mode Sense: 00 00 00 00
[ 913.624000] sd 4:0:0:0: [sdd] Asking for cache data failed
[ 913.624000] sd 4:0:0:0: [sdd] Assuming drive cache: write through

It seems to happen when I am really hammering the array and when the drive wont come back till I get a reboot done. What has happened a few times is if I continue to run the array in degraded mode it'l drop another drive. Its totally random as sometimes it doesnt happen at all. At first I thought it was simply a bad drive, but it changes drives and also I have done a complete read/write bad block scan of each of the drives with no errors. I also zero'd the drives prior to use.

I'm running ubuntu and i've tried a few kernels. 2.6.20 and 2.6.22 both have the same issue. And it appears in both 64bit and 32bit modes. I dont beleive this is purely a linux issue, im almost leaning towards the actual SATA card overheating as I have noticed it is pretty warm on the back of the card which *could* possibly be the issue, I just havent had a chance to test that as of yet.

Basically what I am chasing is any ideas over what could be causing it, am I right in thinking over heating. It makes sense in my mind that its overheating dropping a channel then you keep hammering and it'l drop another. But then I dont want to thermal epoxy a heatsink onto this sucker and have to then send it back for waranty if its just a dead card. That said I got a good run of approx 2 days without any issues just copying data to it but I am still not sure whats causing the problem.

For further information the motherboard it is in is an Asus A8S-X with an Athlon64 X2 4200+, 2x512mb of DDR433. Aside from the drives messing around its a really fast system, but I just want to get my array backup and running. I have also spent about a week on this issue browsing the web looking for similair problems. The only thing I have found is a possibility in an older kernel that the 3124 under high load was locking up, which again could lend itself to the heat issue.

Any help would be greatly appreciated.

claymen · 08-08-2007, 08:01 PM

Since my initial post the array was working fine for some time, however it has now failed again with the same issue.

Things that I have done so far
* Heatsink onto the SATA chipset with a fan blowing air through
* Forced all drives to SATA1/SATA150 via jumper on the drive
* Multiple kernels tested
* BIOS update for SATA card

And still no luck, it randomly spews those errors and fails out claiming bad sectors but scanning the drive they come out fine. I'm starting to think its the controller itself, but then again it should be happening on all channels not just two. All the cabling is in securely but I still cant figure this out.

Any ideas from anyone? Anyone running the silicon image 3124 card and NOT having problems?

antibios · 08-14-2007, 06:33 PM

Hey,
I'm using the Silicon Image, Inc. SiI 3112 and I'm getting much the same problems. Mine is setup in a RAID 1, and while I'm trying to copy data to it I get the following:

[52238.066384] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
[52238.066395] ata3.00: cmd 35/00:00:57:e8:bf/00:04:2c:00:00/e0 tag 0 cdb 0x0 data 524288 out
[52238.066398] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[52238.378125] ata3: soft resetting port
[52238.533975] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[52238.543241] ata3.00: ata_hpa_resize 1: sectors = 976773168, hpa_sectors = 976773168
[52238.555222] ata3.00: ata_hpa_resize 1: sectors = 976773168, hpa_sectors = 976773168
[52238.555232] ata3.00: configured for UDMA/33
[52238.555253] ata3: EH complete
[52238.559857] SCSI device sdc: 976773168 512-byte hdwr sectors (500108 MB)
[52238.559955] sdc: Write Protect is off
[52238.559958] sdc: Mode Sense: 00 3a 00 00
[52238.590755] SCSI device sdc: write cache: enabled, read cache: enabled, doesn't support DPO or FUA

---------------
00:00.0 Host bridge: Intel Corporation 82845G/GL[Brookdale-G]/GE/PE DRAM Controller/Host-Hub Interface (rev 02)
00:01.0 PCI bridge: Intel Corporation 82845G/GL[Brookdale-G]/GE/PE Host-to-AGP Bridge (rev 02)
00:1d.0 USB Controller: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) USB UHCI Controller #1 (rev 02)
00:1d.1 USB Controller: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) USB UHCI Controller #2 (rev 02)
00:1d.2 USB Controller: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) USB UHCI Controller #3 (rev 02)
00:1d.7 USB Controller: Intel Corporation 82801DB/DBM (ICH4/ICH4-M) USB2 EHCI Controller (rev 02)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 82)
00:1f.0 ISA bridge: Intel Corporation 82801DB/DBL (ICH4/ICH4-L) LPC Interface Bridge (rev 02)
00:1f.1 IDE interface: Intel Corporation 82801DB (ICH4) IDE Controller (rev 02)
00:1f.3 SMBus: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) SMBus Controller (rev 02)
00:1f.5 Multimedia audio controller: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) AC'97 Audio Controller (rev 02)
01:00.0 VGA compatible controller: nVidia Corporation NV17 [GeForce4 MX 440] (rev a3)
02:02.0 Multimedia controller: Philips Semiconductors SAA7133/SAA7135 Video Broadcast Decoder (rev d0)
02:03.0 RAID bus controller: Silicon Image, Inc. SiI 3112 [SATALink/SATARaid] Serial ATA Controller (rev 01)
02:08.0 Ethernet controller: Intel Corporation 82801DB PRO/100 VE (CNR) Ethernet Controller (rev 82)
--------------------------

claymen · 08-14-2007, 07:26 PM

I followed up on the lkml and looks like my drive was actually throwing errors and the silicon image card was simply reporting those errors. Upon further investigation SMART revealed that each time the error occured the reallocated sector count was going up, so basically when the drive hit a bad spot and was reallocating on the fly it'd throw that error which of course would trip the array up. That said the drives even though only a month old have not hit the critical threshold in the SMART system and so i've replaced under warranty. So far so good, but only time will tell.

Even the WD diagnostic tool failed within 10 seconds running on the drive, it came back screaming with the error. So I'd check your drives with SMART or alternatively the equivalent vendor diagnostic tool.

antibios · 08-16-2007, 01:10 AM

Hey Claymen,
Thanks for your response. So I wondered off and grabbed the Maxtor/Seagate HD diagnostics ISO from their website (nice to see they are using freedos), and ran both the Long and Short tests on my drives, and they both reported fine (no errors). Just to check that there aren't any problems with the drives, I copied a fair amount of information onto one of the drives and then proceeded to check the md5sums of each of the files. They looked good. (Not the most comprehensive test I know).

Anyway, I'm more convinced now that the kernel driver is probably either pushing more data than it should, or I haven't got the right settings. Could you please point me to the thread that you were emailing on the lkml? If you found some helpful people there, I'd like to see if anyone has seen this problem before too.

Cheers,

Matt K

enderox · 08-31-2007, 01:03 AM

Hello Claymen/Antibios,

I can't really offer anything to this thread other than checking (say thru a google search) about A8S-X & IRQ problems, which could create problems if there was a irq sharing conflict.

I do however want to commend you both. Looks like you have tried various scenarios and checks, detailed errors and system specs and more importantly, kept this thread up to date. I have a A8S-X which is what sparked my interest, and how I found this thread. Keep us posted on how you get on.

Mike...

claymen · 08-31-2007, 01:26 AM

I am aware of the IRQ sharing issues on the A8S-X board but if your cards you are using are reasonably recent it tends not to be such an issue. Granted performance would suffer slightly but for the most part its great. At the time I was having problems I had tested the SATA card in all the slots so sharing with the different devices and found no issue there. Unlike my A7V-Deluxe which needed irqpoll to even recognize PCI devices...

So far I've found this board to be pretty damn stable and as it turns out my problems were simply the drives themselves. After replacing them and monitoring with SMART tools ive not had any problems since *touch wood*. Its turned out to be a very affordable file-server.

antibios · 09-02-2007, 07:46 PM

After testing the Hard Drives and being sure that they were both correct, I thought that I had better try installing a different controller. I got a second card, exactly the same model and it works perfectly. No more errors, and I'm not getting between 40-60 MB/s on writes, when I was previously getting 6-10 MB/s.

I'm thinking that perhaps the problem could be related to the fact that when I first installed the original card, I set it to work as a RAID1, but could never get it to work in linux, so I reverted the settings on that card.

Either that, or it is just simply a broken card.

At anyrate, it looks like both claymen and myself both have found the errors related to hardware.

oli · 06-05-2008, 11:34 AM

Hi all.

RAID bus controller: Silicon Image, Inc. SiI 3124 PCI-X Serial ATA Controller
RAID bus controller: Silicon Image, Inc. SiI 3132 Serial ATA Raid II Controller

I have both of the above controllers and both seem to have similar problems to what some of you have described. I'm using CentOS 5, but had the same problems with Fedora 8 and 9 beforehand...

My messages would just spit out the following types of messages when there was access to the drives (not all the time, but on a daily basis since the drives all get used):

Code:

kernel: ata11.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0
kernel: ata11.00: (irq_stat 0x00060002, device error via SDB FIS)
kernel: ata11.00: cmd 60/08:00:ef:50:00/00:00:00:00:00/40 tag 0 cdb 0x0 data 4096 in
kernel:          res 41/01:00:ef:50:00/00:00:00:00:00/40 Emask 0x1 (device error)

Apart from this thread I saw this issue mentioned a few other times after doing many Google searches. I hope adding to this thread helps others with the same problems down the track.

I suggest replacing the controller with something better... It's worth spending extra on these types of components because problems here can cause real disasters with storage systems. In particular if you're running RAID5 arrays and controller errors cause multiple drives to drop out at once.

I've just ordered an 8000 series 3Ware card from eBay.

Cheers,
Oli

claymen · 06-07-2008, 02:37 AM

Depends on your application. There are advantages in not using hardware RAID remember.

With that said my setup has been rock solid and was related to the drives itself not the controller as I had first thought. I wouldn't be so quick to recommend a hardware RAID card over any other card before checking out driver support. I've played with some excellent RAID controllers in the past but they have had horrible linux support.

The Silicon Image cards seem to have excellent support in the kernel and IMO most of these issues are related to other components or faulty disks or cabling.

oli · 06-07-2008, 03:40 AM

I don't actually intend to use the 3ware as a Hardware RAID controller. Simply because I like the flexibility of mdadm in the long run. My main reason for choosing it was because I know the Linux support is excellent, and the 12 ports should keep me going for a long time.

I am fairly sure my issue and the one discussed here is a bug in either the driver or the controller. I have been able to reproduce it with disks in software RAID arrays and as standalone disks now.

I've also tried a least 4 different cables and other connectors, as well as running the disks on other controllers (NForce on board and a USB to SATA adaptor) which show no problem.

longbow-core · 06-28-2008, 01:01 AM

The very similar issue with MSI K9N Ultra-2F motherboard (on-board SATA-II controller) and Seagate ST3750330AS SATA-II 750Gb [SD15 firmware] disk drive:

During the initialization of hard drives in BIOS (before GRUB starts) I receive a message that there was an error occured with my SATA drive. Unfortunately I was not able to retrieve some additional info from BIOS regarding this issue, but my primary drive is older 80Gb IDE device so I was able to retrieve some error messages from dmesg:

Code:

ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: failed to read native max address (err_mask=0x1)
ata1.00: HPA support seems broken, will skip HPA handling
ata1.00: ATA-8: ST3750330AS, SD15, max UDMA/133
ata1.00: 0 sectors, multi 0: LBA NCQ (depth 0/32)
ata1.00: failed to set xfermode (err_mask=0x1)
ata1: failed to recover some devices, retrying in 5 secs
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: failed to set xfermode (err_mask=0x1)
ata1: limiting SATA link speed to 1.5 Gbps
ata1.00: limiting speed to UDMA/133:PIO3
ata1: failed to recover some devices, retrying in 5 secs
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata1.00: failed to set xfermode (err_mask=0x1)
ata1.00: disabled

It looks like it tries to downgrade to SATA-I automatically, but that seems to be not a proper cure. Drive is equipped with the additional fans, and its max temperature reached only 31 Celsius degrees.

If someone is interested here is the specification of onboard controller:

Code:

00:05.0 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a3)
00:05.1 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a3)
00:05.2 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a3)

If someone is aware of possible bugs in SATA drivers for this chipset, please let us know.