Help. Replaced unavail disk with new disk but still unavail

everyday · 04-28-2021, 05:36 PM

Hi. We have a Solaris 11.3 system that was purchased many years ago. The company is no longer dealing with Solaris installs.
I am the sysadmin but know near nothing about Solaris, but do know a bit of Linux.
Our Solaris unit has 36 SAS hard drives. One of them in tank1 has a red light (c15t1d30). I have replaced it with another, exactly the same, brand new drive.
There is also another slot that went through the same issue about a year ago, which is in the same situation. So would like to fix that also (c15t1d8).

My question is, can someone please help me bring it (c15t1d30) back online. I have followed the oracle instructions on replacing a drive with a new one in the same slot, and nothing I seem to do can bring it back. Drive still remains in unavail state with red light on.

Below is my 'zpool status -v' output and a list of the things I have tried:

Code:

# zpool status -v
 pool: rpool
 state: ONLINE
status: The pool is formatted using an older on-disk format. The pool can
        still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'. Once this is done, the
        pool will no longer be accessible on older software versions.
  scan: resilvered 34.8G in 6m59s with 0 errors on Mon Nov  4 14:05:41 2013

config:

        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            c8t1d0  ONLINE       0     0     0
            c8t0d0  ONLINE       0     0     0

errors: No known data errors

 pool: tank1
 state: DEGRADED
status: One or more devices are unavailable in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or 'fmadm repaired', or replace the device
        with 'zpool replace'.
  scan: scrub canceled on Tue Nov 12 17:18:14 2013

config:

        NAME                      STATE     READ WRITE CKSUM
        tank1                     DEGRADED     0     0     0
          raidz2-0                ONLINE       0     0     0
            c15t1d0               ONLINE       0     0     0
            c15t1d1               ONLINE       0     0     0
            c15t1d2               ONLINE       0     0     0
            c15t1d3               ONLINE       0     0     0
            c15t1d4               ONLINE       0     0     0
            c15t1d5               ONLINE       0     0     0
          raidz2-1                DEGRADED     0     0     0
            c15t1d6               ONLINE       0     0     0
            c15t1d7               ONLINE       0     0     0
            c15t1d8               UNAVAIL      0     0     0
            c15t1d9               ONLINE       0     0     0
            c15t1d10              ONLINE       0     0     0
            c15t1d11              ONLINE       0     0     0
          raidz2-2                ONLINE       0     0     0
            c15t1d12              ONLINE       0     0     0
            c15t1d13              ONLINE       0     0     0
            c15t1d14              ONLINE       0     0     0
            c15t1d15              ONLINE       0     0     0
            c15t1d16              ONLINE       0     0     0
            c15t1d17              ONLINE       0     0     0
          raidz2-3                ONLINE       0     0     0
            c15t1d18              ONLINE       0     0     0
            c15t1d19              ONLINE       0     0     0
            c15t1d20              ONLINE       0     0     0
            c15t1d21              ONLINE       0     0     0
            c15t1d22              ONLINE       0     0     0
            c15t1d23              ONLINE       0     0     0
          raidz2-4                ONLINE       0     0     0
            c15t1d24              ONLINE       0     0     0
            c15t1d25              ONLINE       0     0     0
            c15t1d26              ONLINE       0     0     0
            c15t1d27              ONLINE       0     0     0
            c15t1d28              ONLINE       8     0     0
            c15t1d29              ONLINE       0     0     0
          raidz2-5                DEGRADED     0     0     0
            c15t1d30              UNAVAIL      0    23     0
            c15t1d31              ONLINE       0     0     0
            c15t1d32              ONLINE       0     0     0
            c15t1d33              ONLINE       0     0     0
            c15t1d34              ONLINE       0     0     0
            c15t1d35              ONLINE       0     0     0
        logs
          c9t5000A72B30077A50d0   ONLINE       0     0     0
          c11t5000A72B30077A4Cd0  ONLINE       0     0     0
        cache
          c8t2d0                  ONLINE       0     0     0
          c8t3d0                  ONLINE       0     0     0
          c8t4d0                  ONLINE       0     0     0
          c8t5d0                  ONLINE       0     0     0

device details:

        c15t1d8                 UNAVAIL           too many errors
        status: FMA has faulted this device.
        action: Run 'fmadm faulty' for more information. Clear the errors
                using 'fmadm repaired'.
           see: http://support.oracle.com/msg/FMD-8000-4M for recovery

        c15t1d30                UNAVAIL           too many errors
        status: FMA has faulted this device.
        action: Run 'fmadm faulty' for more information. Clear the errors
                using 'fmadm repaired'.
           see: http://support.oracle.com/msg/ZFS-8000-FD for recovery


errors: No known data errors

What I have tried:

Code:

# zpool offline tank1 c15t1d30

- physically replaced the hard drive with a brand new, exactly the same one, in the same slot.

Code:

# zpool replace tank1 c15t1d30
cannot label 'c15t1d30': try using fdisk(1M) and then provide a specific slice
Unable to build pool from specified devices: invalid vdev configuration

- this could be the issue, but I have ZERO idea what it means. Google revealed nothing that helpful.
- zpool status -v showed no change at all. ie still unavail

Code:

# zpool clear tank1 c15t1d30

- zpool status -v showed no change at all except now error count is at 0, but still unavail

Code:

# zpool online tank1 c15t1d30

- zpool status -v showed no change
- put old drive back in then:

Code:

# devfsadm -Cv
devfsadm[21586]: verbose: removing file: /dev/dsk/c12t5000A72A30077A50d0s9
devfsadm[21586]: verbose: removing file: /dev/dsk/c15t1d8
...
...
...

- did lots of those 'removing files'
- replace with new hard drive again

Code:

# devfsadm -Cv

- zpool status -v showed no change

Code:

# zpool replace tank1 c15t1d30 (again)
cannot label 'c15t1d30': try using fdisk(1M) and then provide a specific slice
Unable to build pool from specified devices: invalid vdev configuration

- zpool status -v showed no change

Code:

# fmadm faulty

--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Apr 29 08:49:13 64ebbc9f-ecca-4a54-bf04-ca3eefb783a6  ZFS-8000-LR    Major

Problem Status    : open
Diag Engine       : zfs-diagnosis / 1.0
System
    Manufacturer  : unknown
    Name          : unknown
    Part_Number   : unknown
    Serial_Number : unknown

System Component
    Manufacturer  : Cisco Systems Inc
    Name          : UCSC-C24-M3S
    Part_Number   :
    Serial_Number : WZP1709000E
    Host_ID       : 008457b8

----------------------------------------
Suspect 1 of 1 :
   Problem class : fault.fs.zfs.open_failed
   Certainty   : 100%
   Affects     : zfs://pool=64513e8f0e484ee2/vdev=ea0091c4caa611ec/pool_name=tank1/vdev_name=id1,sd@n600100404f361ca0a2900c8d00000000/a
   Status      : faulted and taken out of service

   FRU
     Status           : faulty
     FMRI             : "zfs://pool=64513e8f0e484ee2/vdev=ea0091c4caa611ec/pool_name=tank1/vdev_name=id1,sd@n600100404f361ca0a2900c8d00000000/a"

Description : ZFS device 'id1,sd@n600100404f361ca0a2900c8d00000000/a' in pool
              'tank1' failed to open.

Response    : An attempt will be made to activate a hot spare if available.

Impact      : Fault tolerance of the pool may be compromised.

Action      : Use 'fmadm faulty' to provide a more detailed view of this event.
              Run 'zpool status -lx' for more information. Please refer to the
              associated reference document at
              http://support.oracle.com/msg/ZFS-8000-LR for the latest service
              procedures and policies regarding this diagnosis.

Code:

# fmadm repaired zfs://pool=64513e8f0e484ee2/vdev=ea0091c4caa611ec/pool_name=tank1/vdev_name=id1,sd@n600100404f361ca0a2900c8d00000000/a

- zpool status -v showed no change

Current state same as original condition except "write erros 0":

Code:

# zpool status -v
  pool: rpool
 state: ONLINE
status: The pool is formatted using an older on-disk format. The pool can
        still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'. Once this is done, the
        pool will no longer be accessible on older software versions.
  scan: resilvered 34.8G in 6m59s with 0 errors on Mon Nov  4 14:05:41 2013

config:

        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            c8t1d0  ONLINE       0     0     0
            c8t0d0  ONLINE       0     0     0

errors: No known data errors

  pool: tank1
 state: DEGRADED
status: One or more devices are unavailable in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or 'fmadm repaired', or replace the device
        with 'zpool replace'.
  scan: scrub canceled on Tue Nov 12 17:18:14 2013

config:

        NAME                      STATE     READ WRITE CKSUM
        tank1                     DEGRADED     0     0     0
          raidz2-0                ONLINE       0     0     0
            c15t1d0               ONLINE       0     0     0
            c15t1d1               ONLINE       0     0     0
            c15t1d2               ONLINE       0     0     0
            c15t1d3               ONLINE       0     0     0
            c15t1d4               ONLINE       0     0     0
            c15t1d5               ONLINE       0     0     0
          raidz2-1                DEGRADED     0     0     0
            c15t1d6               ONLINE       0     0     0
            c15t1d7               ONLINE       0     0     0
            c15t1d8               UNAVAIL      0     0     0
            c15t1d9               ONLINE       0     0     0
            c15t1d10              ONLINE       0     0     0
            c15t1d11              ONLINE       0     0     0
          raidz2-2                ONLINE       0     0     0
            c15t1d12              ONLINE       0     0     0
            c15t1d13              ONLINE       0     0     0
            c15t1d14              ONLINE       0     0     0
            c15t1d15              ONLINE       0     0     0
            c15t1d16              ONLINE       0     0     0
            c15t1d17              ONLINE       0     0     0
          raidz2-3                ONLINE       0     0     0
            c15t1d18              ONLINE       0     0     0
            c15t1d19              ONLINE       0     0     0
            c15t1d20              ONLINE       0     0     0
            c15t1d21              ONLINE       0     0     0
            c15t1d22              ONLINE       0     0     0
            c15t1d23              ONLINE       0     0     0
          raidz2-4                ONLINE       0     0     0
            c15t1d24              ONLINE       0     0     0
            c15t1d25              ONLINE       0     0     0
            c15t1d26              ONLINE       0     0     0
            c15t1d27              ONLINE       0     0     0
            c15t1d28              ONLINE       8     0     0
            c15t1d29              ONLINE       0     0     0
          raidz2-5                DEGRADED     0     0     0
            c15t1d30              UNAVAIL      0     0     0
            c15t1d31              ONLINE       0     0     0
            c15t1d32              ONLINE       0     0     0
            c15t1d33              ONLINE       0     0     0
            c15t1d34              ONLINE       0     0     0
            c15t1d35              ONLINE       0     0     0
        logs
          c9t5000A72B30077A50d0   ONLINE       0     0     0
          c11t5000A72B30077A4Cd0  ONLINE       0     0     0
        cache
          c8t2d0                  ONLINE       0     0     0
          c8t3d0                  ONLINE       0     0     0
          c8t4d0                  ONLINE       0     0     0
          c8t5d0                  ONLINE       0     0     0

device details:

        c15t1d8                 UNAVAIL           too many errors
        status: FMA has faulted this device.
        action: Run 'fmadm faulty' for more information. Clear the errors
                using 'fmadm repaired'.
           see: http://support.oracle.com/msg/FMD-8000-4M for recovery

        c15t1d30                UNAVAIL           too many errors
        status: FMA has faulted this device.
        action: Run 'fmadm faulty' for more information. Clear the errors
                using 'fmadm repaired'.
           see: http://support.oracle.com/msg/ZFS-8000-LR for recovery


errors: No known data errors

Phew, alot of info!

So basically, can anyone help get this drive back online or shed any insight?

Thanks!
Jono

wpeckham · 04-29-2021, 10:28 AM

IS THIS SERVER PERFORMING A CRITICAL FUNCTION?
Before taking ANY steps I would evaluate what investment it is worth to keep this going, and recover it as it was oor replace it entirely.
One past that:

Step one would to be absolutely certain that you have everything about the system well documented and multiple verified full backups of all critical data. One would hope you do this regularly anyway, but when hardware starts failing it becomes immediate and critical.

Step two: If the recovery and replacment steps are not working, there is almost certainly a good reason. If it is in hardware, there may not be a great replacement plan. I would verify the hardware (this requires a hardware engineer familiar with that platform). A field engineer may also have recovery advice and pointers to documentation that will help you.

Step three, while awaiting the Engineer get working on a full platform replacement plan. I do not think anyone can purchase that new today, and would bet it is long out of support. That means someone should have planned the replacement long ago. Since they did not it now falls to you. I would investigate HP servers and consider RHEL or SUSE for solid and supported operating systems that should support anything that generation of Solaris server could do, although not quite in the same ways. Also, while local storage has advantages there are very fast SAN options that can be faster and more reliable than any local storage that is not SSD based. (Also, if you go local SSD storage, know that it only takes seven SSD drives to max out the channel bandwidth of a fast/wide SCSI controller and turn it into a choke-point. If you need to go for maximum performance and choose SSD, you will need to limit the active drives per controller to six. With that in mind the SAN with rotational drives but LOTS of cache may be the better deal) This plan will not be wasted. You should use the plan no matter what happens with the server, but if the engineer can provide you a path to recovery on the old hardware then you can take longer to plan, budget, and sell the migration plan.

Step four: react to what you learn from the engineer. The state of the hardware and available parts and options will dictate the direction of your next steps.

everyday · 05-01-2021, 05:39 AM

Thanks so much for your reply.

The server is more of a storage server with the 'not so critical' data on it. I should be clear that it has not failed. All the data is safe. I have backups as well. I just need to repair the 2 disks that have died in it. I have physically replaced the failed drives with brand new, exactly the same drives, and thought I followed the instructions from Oracle correctly, but the 2 drives remain in an 'unavail' state.

That is the bit I cannot figure out.

It may just be related to this message, which I do not know what means:

Code:

cannot label 'c15t1d30': try using fdisk(1M) and then provide a specific slice
Unable to build pool from specified devices: invalid vdev configuration

Any thoughts?

Best
Jono

Pigi_102 · 05-04-2021, 04:17 PM

Is this an x86 machine ?
If yes then you have to invoke format and fdisk before solaris can use the disk.
Take a look at this linkhttps://docs.oracle.com/cd/E19683-01...qva/index.html