First Equallogic Disk Failure in Two Years

By admin, December 12, 2012 6:24 pm

Just received three email alerts simultaneously from SAN HQ, Dell OME and EQL Group Manager all saying slot 5 in one of the Equallogic members has failed, the last disk (slot 15) kicked in and the raid set is reconstructing. Called local Dell ProSupport, parts is being arranged and will be delivered to data center within 2-3 hours. (Update, raid reconstruction took about 3 1/2 hours to complete)

It was quite a black day today as this is the 2nd incident happening to my equipment, and Equallogic SAN was under very light load and the disk just failed without any pre-warning. Where is the predicative disk failure feature in Equallogic after all?

1

Alert from Dell OME:
Device: , Service Tag:, Asset Tag:, Date:12/12/12, Time:17:46:17:000, Severity:Critical, Message:Sent when eqlDiskStatus changes from one state to another state. Variables: eqlDiskStatus=Failed,eqlDiskSlot=5

Alert from EQL Group Manager:
Warning health conditions currently exist.
Correct these conditions before they affect array operation.
Non-fatal RAIDset failure. While the RAID set is degraded, performance and availability might be decreased. There are 1 outstanding health conditions. Correct these conditions before they affect array operation.
Failure: HDD Drive: 5, Model: ST3600057SS , Serial Number: 3SL14VVR
Reconstruction of RAID LUN 0 initiated.

Alert from SAN HQ

  • 12/12/2012 5:45:46 PM to 12/12/2012 5:47:46 PM
    • Warning: Member eql RAID Set Is Degraded
      • Warning: Member eql RAID set is degraded because a disk drive failed or was removed.
    • Warning: Member eql RAID More Spares Expected
      • Warning: Member eql The current RAID configuration requires more spare drives then are currently available.
    • Warning: Member eql has a failed drive in slot 5

15 Responses to “First Equallogic Disk Failure in Two Years”

  1. Danny says:

    My PS6000XV also use same HD model and firmware……

  2. Danny says:

    Totally agree, today is black Friday!

    Just found my old PE2850, which I install SAN HQ and Backup Exec server on it, cannot access. Most likely the hard disc on this server (I form RAID 1 for OS disc, and RAID 0 for max. backup space) was damaged

  3. admin says:

    One of the backup USB disk failed multiple times, and it also prevent my R610 from rebooting.

    Btw, if I know next Friday is the end for sure, then all these wouldn’t matter, I will just let it be and continue my life as it should be, haha…

    The failed disk member has completed raid reconstruction under 3 1/2 hours, not bad at all considering the raid size is about 3.7TB.

    0:Reconstruction of RAID LUN 0 completed in 12190 seconds.

    Dell also replaced the harddisk and it immediately became a spare disk (although with a newer firmware EN03), woola, case solved. :)

  4. Danny says:

    Yea 21-Dec, It is not a matter if it is the end of everyone!

    EN03, can we upgrade the firmware on existing hard disc?

    BTW,do you think it is time to upgrade ESXI from 4.1 to 5.1?

  5. admin says:

    I will prefer not to upgrade the Equallogic disk firmware except if you are having problem.

    Regarding vSphere 5.1, many said it was a rush release for VMware against Microsoft’s Hyper-V 3.0, so it contains quite a lot of bugs, that’s why 5.1a and 5.1b were released within the next 3 months.

    To me, I will stay at 4.1 for a while, as there aren’t many obvious advance improvements in 5.0 or 5.1, except if you need storage DRS, bigger VM (ie, 64 vCPUs and 1TB of Memory) or W2K12, which I don’t need most of them, 4.1 is very stable and I am happy with it, probably will upgrade when 6.1 is out.

  6. Danny says:

    I see. Thanks for your suggestion.

  7. enderwa says:

    We had a 3 disk failure on our PS6510E about 6 months – all 3 disks went offline within about 3mins. Unit was on v5.0.4 firmware at the time.

    Unimpressed with Dell AU Pro/Ent Support – they didnt know enough about the product and basically wrote off all data on the disks and were ready to do a factory reset / total RAID50 array rebuild.

    However, actual EQL Support in the US managed to reconstruct the array back into a usable but degraded state, and then recover to a normal fully functional operating state – only one disk was actually bad (and needed replacement).

    Anyways, my point related to this thread is that the 48disk RAID50 rebuild/reconstruction/integrity-check was all good within about 4 hours. I’ve seen Perc4 ctrlr 5 disk array rebuilds take 12 hours, so that was very impressive.

  8. admin says:

    PS6510E uses 7,200RPM SAS disks and they are known for failure, read the EQL firmware releases, you will see many issues with big size/slow spindle disks.

    12/12/2012 Firmware Release Announcement – Hard Disk Drive Firmware KD0A

    In November, Dell announced the release of Hard Disk Drive Firmware KD0A for selected 7200RPM based 500GB, 1TB, and 2TB drives shipped on the PS4000E, PS6000E, PS6010E, PS6500E and the PS6510E arrays. This Drive Firmware removes a false failure condition that caused drives to fail at an accelerated rate.

    Note that Dell has made improvements in drive error handling routines of EqualLogic array firmware over the course of the last few years and has worked closely with drive manufacturers to improve error handling routines of the hard drives.

    If you are using arrays with these drives, Dell strongly recommends that you update the drive firmware. You can confirm if you are using these drives by reading the instructions in release notes.

    Btw, RAID6 is the latest recommendation from Equallogic for large disks, somehow they just get rid of RAID5 or RAID50.

    Finally, I am really surprised US EQL team can bring back the data after your local Dell ProSupport wiped out everything, how is this possible?

    Nevertheless, US EQL support has much more in depth experience then the local support team, I always ask them to help me via WebEx, it has always been a pleasant experience.

  9. enderwa says:

    hi,

    just to clarify,
    Dell AU Pro/Ent Support wanted to reset and totally rebuild the array, at which point we stalled, and escalated the issue to US Pro/Ent Support [== EQL techies] (that know what they are doing).

    However your post about drive firmware availability highlights my constant bug bear with Dell support – not all support groups are on the same page [info repeatedly doesn't filter through to all relevant support groups] and inexperienced techies are assigned to do major upgrades they have little experience with – we just went thru a EQL SAN firmware upgrade process not less than 2 weeks ago, and not without issues I might add. When asked, no drive firmware upgrades were needed apparently.

    But our ST31000524NS drives are covered by the link you’ve supplied.

    We had our whole VM infrastructure backed up and shutdown for that firmware upgrade [switches, MEM on the servers, SANs - the whole box and dice] (and that was the time to be doing hard drive f/w upgrades). grrr…..

  10. Brian says:

    My PS4000E with equipped with 16 x 1TB have 4 failed disks for used 2 years.

    I totally agree the US support guys are much more expertise to the EQL product. They are professional and very confident on the subject matter.

  11. enderwa says:

    hmmm…. I tempted fate but posting in here.

    We’ve had a single disk fail late last week – bad sectors apparently – one of the two hot-spares kick in as expected.
    RAID reconstruction was successfully completed in ~14hrs – still not bad for a 46 disk array.

    I believe our previous 4 hour rebuild was more likely just an integrity check (bringing 2 good [mistakenly marked as bad] disks back into the array).

    Can anyone clarify the internal RAID setup underlying a RAID50 EQL array. In the process of getting back online in a degraded state during our original 3 disk failure, I remember seeing references to 3 separate RAID50 LUNs + a smaller RAID5 LUN – I didn’t quite know what to make of it, or how to translate that into our 46disk RAID50 array (+2hotspares).

  12. admin says:

    I believe it’s done automatically.

    I’ve also seen it somewhere just last week, so for a 45 + 3 hot spares Disks of PS6510E in RAID50, it’s actually broken into 3 RAID50, that’s 14 disks + 1 hot spares each, so 15 disks x 3 + 3 hot spares = 48 disks. (I suspect I may be wrong with the numbers, so it’s something like this)

    As Equallogic RAID is not a simple RAID5/RAID50/RAID6, that’s why it explained you can have 3 disk failure at the same time without having the volume down I think.

    Finally, I wonder why didn’t you purchase the PS6510ES feature with SSD caching? It would be much helpful with the extreme fast SSD disks inside.

  13. Don Williams says:

    Re: Drive FW. You should always install the drive FW upgrades. They are always important and could prevent an unexpected outage. Keeping up to date on array FW is also very important.

    Re: ESXv5.x offers many benefits over 4.1. Performance, greater storage integration just to name a few.

    Re: RAIDsets. Seveal EQL models use multiple RAIDsets that are joined together to create a single storage instance.

  14. Brian Law says:

    I got a excel from last Dell EQL training, the 48 Disks RAID50
    RAID set relationship as below

    (6+1 6+1)(6+1 6+1)(6+1 6+1)(3+1) + 2hot spare
    Total Usable Spaces: 39 Disk

  15. enderwa says:

    Another single disk (ST31000524NS KD03) failure recently, 13+hrs for the auto rebuild using a hot-spare.

    Really need to schedule downtime to facilitate a drive firmware upgrade to KD0A.

    PS6510ES were not available at our time of purchase.

Leave a Reply to enderwa