Category: Equallogic & VMWare (虛擬化技術)

NO STOCK of this Equallogic 600GB 15K SAS Disk!!!

By admin, December 28, 2015 7:07 pm

This is the 4th time Equallogic 15K SAS failed (previous failed record was on April 30, 2015), but this time comes with another surprise the local Dell logistic and inventory department called me saying they don’t have any stock currently!!! ARE YOU KIDDING ME? There are at least 50-60 Equallogic boxes running 600GB 15K SAS in Hong Kong and they don’t have the most failed parts in stock? and we paid for 4 hours Pro-Support replacement service…to be updated!

The RAID reconstruction took 3 hours and 40 mins to complete this time.

1

Update: Jan-15, 2016

5th time Equallogic 15K SAS failed.

02

Update: Jan-18, 2016

6th time Equallogic 15K SAS failed.

Whenever you see predicative failure kicks-in (ie, Copying to Spare), then you know it’s time to call Dell service, as all previous 3 predicative failures, none of it worked, the process always failed at half way.

02

Update: Jan-30, 2016

7th time Equallogic 15K SAS failed

4 disks failed one after another in 30 days is a Wow!!!

It seems the 5 years + SAS disks do have a life span of around 5 years warranty time, that’s why you need to extend your warranty for another 2 years as Dell gives maximum 7 years warranty for their enterprise products.

Update: Jul-15, 2016

8th time Equallogic 15K SAS failed

Don’t think predicative failure will help, so called pro-support right away to replace the failed disk.

Update: Oct-1, 2018

9th time Equallogic 15K SAS failed

After more than 2 years, finally it comes another disk failure, but this time the copy to spare was successful!

—————————————–
Level Received Member Message
Info 10/01/2018 04:43:05 AM eql01 Completed copying-to-spare operation from drive 4 to drive 15.
—————————————–

One interesting fact I found out, when I re-insert the failed disk, EQL produced the following error that I’ve never seen before.

—————————————–
History-of-Failures

ERROR event from storage array eql01
subsystem: SP
event: 14.4.18
time: Mon Oct 1 11:00:43 2018

0:Orphan disk drive: Disk drive LUN 4 (26bc078a0900-779214cc-8) no longer belongs to RAID LUN 0 (3be078a0900-b0e5c22b-92d39732). This disk has a history of failure and will not be used.

Update: Dec-2, 2018

10th time Equallogic 15K SAS failed, more than half of the disks are replaced, firmware EN03 from EN01.

Update: Dec-31, 2018

11th time Equallogic 15K SAS failed just before the end of 2018, this time is the first disk with firmware EN03.

Update: May-23, 2019

12th time Equallogic 15K SAS failed, firmware EN03 (TB EN03, Disk 0:0)

Update: Nov-15, 2020

13th time Equallogic 15K SAS failed, firmware EN03 (TB EN03, Disk 0:3)

Update: Jul-19, 2023

14th time Equallogic 15K SAS failed, firmware EN03 (TB EN01, Disk 0:5)

Update: Mar-1, 2024

15th time Equallogic 15K SAS failed, firmware EN00 (TB EE06, Disk 0:13)

=========================
Update: Jan-23, 2023

1st time Equallogic02 10K SAS failed (TB FE10, Disk 0:11)

Update: Feb-28, 2023

2nd time Equallogic02 10K SAS failed (TB FE10, Disk 0:12)

Update: Mar-13, 2023

3rd time Equallogic02 10K SAS failed (TB FE10, Disk 0:2)

Update: Mar-26, 2023

4th time Equallogic02 10K SAS failed (TB FE10, Disk 0:3)

Update: Mar-27, 2023

5th time Equallogic02 10K SAS failed (TB FE10, Disk 0:2)

Update: Mar-28, 2023

6th time Equallogic02 10K SAS failed (TB FE10, Disk 0:4)

Update: Mar-30, 2023

7th time Equallogic02 10K SAS failed (TB FE10, Disk 0:6)

Update: Apr-08, 2023

8th time Equallogic02 10K SAS failed (TB FE10, Disk 0:11)

Update: Apr-16, 2023

9th time Equallogic02 10K SAS failed (TB FE10, Disk 0:3)
10th time Equallogic02 10K SAS failed (TB FE10, Disk 0:10)

Update: Apr-17, 2023

11th time Equallogic02 10K SAS failed (TB FE10, Disk 0:7)

Update: Apr-24, 2023

12th time Equallogic02 10K SAS failed (TB FE10, Disk 0:15)

Update: May-6, 2023

13th time Equallogic02 10K SAS failed (TB FE10, Disk 0:3)

Update: May-20, 2023

14th time Equallogic02 10K SAS failed (TB FE10, Disk 0:0)

Update: Jun-8, 2023

15th time Equallogic02 10K SAS failed (TB FE10, Disk 0:21)

Update: Jul-14, 2023

16th time Equallogic02 10K SAS failed (TB FE10, Disk 0:11)

Equallogic vs Nimble and Tegile

By admin, December 23, 2015 10:45 pm

The following is very interesting as they are currently the new kids on the block in storage world!

Recently, I’ve talked to Nimble’s senior technical consultant during the VMWorld Expo in Hong Kong, and indeed they are very similar to Equallogic…ie, downside is Nimble too can’t stand any of the array failed in the same cluster (Lefthand can) and no build-in thin reclaim feature (3Par does) and Nimble is limited to 4 nodes in comparison to Equallogic’s 16 nodes.

nimble04

However Nimble is with SSD built into their array design and firmware from day one, so the performance is much better even with the same set of hardware. So if your site is mainly used for VDI, then Nimble should be on your next purchase consider list.

272355_1

“Dell has become a top-four player in disk arrays, as IBM and NetApp are left behind in Gartner’s 2015 Magic Quadrant for general purpose disk arrays…Nimble Storage enters the leaders’ quadrant for the first time” (Read More)

Also please refer to the latest Gartner Magic Quadrant for General-Purpose Disk Arrays (Published: October 2015)

Of course, there is one main thing I don’t really like…Nimble and Tegile have a rather simple admin GUI (here is Tegile), it’s not really graphical oriented as Equallogic’s group manager…for the record, GUI is also one of the most important things for me to decide on a SAN storage purchase. Not to mention Equallogic has a deep inspection tool..SQNHQ…which I couldn’t find the same level of deep analysis on both Nimble (InfoSight cloud based?) and Tegile GUI.

nimble05

Besides, Image is everything…the actual design of the array itself (box and tray) does not look good as Equallogic…thought I have to admit the face plates for both Nimble and Tegile look cool…but somehow it doesn’t match with the array design behind it, strange…probably the designers are different for face plate and array.

img_1648

Yes, Nimble’s snapshot compression and de-duplication are good, but there is always a trade off in performance as compression requires CPU time to process. Fyi, Equallogic firmware 8.0 has this similar  build-in snapshot compression already.  After all, it seems to me both Nimble and Tegile are putting space utilization as its 1st priority selling point than it’s SSD performance, so Equallogic has no choice but to catch up. Seemed Nimble and Tegile are really pushing Dell hard for this!!! Finally Equallogic has to change under pressure which is really good for its customer.

asdf

The other thing is Nimble and Tegile is x86 based hardware (Ah…SuperMicro!!!) while Equallogic uses propitiatory RISC based CPU and Nimble’s firmware (ie, software) is based on open source Linux cluster file system (GFS? To be confirmed).

asdf (2)

One more thing I don’t like Nimble and Tegile is their disk reliability, 7,200RPM SATA, come on, it’s confirmed those SATA are really unreliable and risky, I wouldn’t trust putting our mission critical data on those disks for sure. Even for our Equallogic 15K SAS, the failure rate per array is almost 1/16 every year…read too many horrible stories about those SATA array (ie, PS4100 series), just plain scary!

Last but not least, probably the most important one…price!!! Nimble or Tegile isn’t cheap really…as a new comer, I think it at least should have an attractive selling price in order for others to switch from the big bothers.

Still…IMOO, somehow, I still wouldn’t classify both Nimble and Tegile as the next generation of SAN vendors and it’s interesting to see Nimble’s founder were actually from Netapp…this reminds me Fortinet’s founder was actually from Netscreen (btw, both founders are Chinese)

Finally, there is an interesting report on Hybrid-Storage-Array-Buyers-Guide by DCIG

Nimble and Tegile: The new Lefthand and EqualLogic by John Sloan

28a4e5c

Which hybrid storage solution should we go with, Nimble or Tegile? You can tell when a technology has crossed the Rubicon of acceptance when the question shifts from validating the technology to comparing the leading evangelists of the new approach.

In this case the new technology or approach is hybrid storage which includes solid state with traditional disk. A couple of years ago organizations wanted information on solid state storage and whether solid state would help mitigate IO bottlenecks. Were the IOPS worth the higher cost per gigabyte of solid state?

Today hybrids are hot and just about everybody has a scheme for adding solid state either as a cache layer or a solid state drive (SSD) tier. Nimble and Tegile stand out because they are built from the ground up to be optimized for a mix of solid state and spinning disk. They both argue that traditional storage is built for disk even if it has solid state added to the mix.

Seven or eight years ago the new storage technology was iSCSI. Fibre Channel was the dominant architecture for storage networks and arrays. But some feisty start-ups started championing this idea of wrapping SCSI storage blocks in internet protocol and pushing it across Ethernet networks.
Nimble Storage
Two of the bright stars in that iSCSI firmament were LeftHand Networks and EqualLogic. There were other players to be sure but these were the two I heard about most from storage buying prospects, especially in small to midsize organizations. Somewhere during 2007 the predominant question from these prospects went from “Should we consider iSCSI solutions?” to “Which iSCSI storage solution should we go with, Lefthand or EqualLogic?”.

Then, as now, it was server visualization that pushed the start-ups into the limelight. Once VMware supported iSCSI for shared storage, midsized organizations buying their first SAN array for server consolidation/virtualization got very interested in Lefthand and EqualLogic.

The trigger now is the need to maintain consistent IOPS for virtual deskop initiatives. Failure of traditional shared storage to scale to VDI boot storms derailed many an implementation. Nimble and Tegile want to be your core storage platform but VDI has been their foot in the door of the data center.

So which of the two should you go with? Nimble or Tegile? Well, of course, it depends. Of the contests I’ve seen over this past summer, among organizations that have shortlisted both vendors, I can say that the two have been competing neck and neck. Those that have made their decisions say it was an extremely close run thing. From what I’ve seen:

Nimble seems to win over Tegile on cost, both upfront acquisition costs and expected future expansion/scaling costs.

esx-tegile-banner

Where Tegile wins over Nimble is in technology. More than one potential customer has told me that they really like how Tegile works. Tegile also supports multiple connection protocols where Nimble is iSCSI-only (that will likely change before the end of 2015).

We ranked both Nimble and Tegile as innovators in our most recent Vendor Landscape: Small to Mid-Range Storage Arrays. Tegile got kudos and ranked higher for their technology.
But as noted it is a very close run thing. Where one leads over another it is by degrees, not a slam-dunk differentiation. There is also the usual caution over risk of going with the small (albeit rapidly growing) start-ups over established players.

Lefthand and EqualLogic were both eventually acquired by HP and Dell respectively. A similar pattern was repeated with two other upstart storage innovators 3par and Compellent.

If you can’t beat them buy them. The established players are struggling right now to beat Nimble and Tegile. But the usual mainline storage suspects have pretty crowded portfolios right now.They may not have room for a Tegile or a Nimble.


Nimble vs EQuallogic by Austin Henderson

The debate between Equallogic and Nimble essentially came down to the fact that we really liked the offering and the people of Nimble but questioned their staying capacity in the market. We were fearful that we would either purchase the product and it would not perform as advertised or that we would purchase the product and in two years they would be gone as a company.

The Equallogic brand name is a solid name and has a lot of experience to stand strong upon, but our research kept turning up some questions about how they had elected to employ snapshots in their platform and we reasoned that our ability to store snaps of our data on that device would be reduced in comparison to the Nimble product. This fear combined with the lack of follow through and inadequate justification of questions by EQ references and sales engineers confirmed our suspicion of the product as inferior in a technical sense.

To address the Nimble performance numbers we discussed with existing customers how the array performed and convinced them to give us 60 day terms on the purchase order – thus if it didn’t perform we would box it up and send it back. After review with other customers and their engineers we were fairly confident this would not be the case and as such decided we could take that risk.

Our Nimble array arrived and was installed with the help of two Nimble resources in very short order. Testing began shortly after install and within approximately a month all production data was moved to the device. Our RPO stretches backwards in excess of several weeks in some cases and the innovative approach to how the product takes data snapshots combined with thin provisioning and inline compression.

Our performance numbers are better than we could have imagined thanks to the use of SSD drives and while not measured yet the energy requirements have dropped considerably given that we went from 60 – 15K spinning drives to 4 SSD drives and 12 spinning drives.  Currently we have seen very few spikes above 3K IOPS and the device can scale on random read/write layout to approximately 15K – thus we have headroom to spare.

In the end the lasting power of the company remains to be seen – and we have our doubts – but conversations with the CEO and other leaders in the company lead us to believe the intent of the company is to leave their mark on the storage industry, not to exit stage right. Given our experiences thus far I would say they are headed for such a legacy and we are proud of and stand by our selection.”

Equallogic MasterClass Workshop

By admin, December 11, 2015 10:32 pm

It turns out to be quite an introductory level one…but it’s nice to see Dell’s demo server room, and I noticed the PS6100 has two disks failed as well. :)

12348046_10203723204000447_2427314080127048910_n

12376267_10203723204040448_8041869133447147863_n

12373231_10203723204080449_1232412102337996683_n

二手電腦市場的高端垃圾

By admin, December 10, 2015 10:25 pm

竟然發現了Equallogic的蹤跡,當然已經是很舊很舊的型號了。 但是如果沒了Firmware,就等同一塊廢鐵般Useless。

IMG_0551

這個Controller更舊,完全不知道是什麼型號。

IMG_0550

Windows Server 2012 R2 and Windows 10 on ESX 4.1!

By admin, October 31, 2015 10:22 pm

Well, you didn’t mis-read the above! Yes, I finally managed to get both (W2K12R2 & Win10) running under the old daddy VMWare ESX 4.1 with the latest VMTools installed, not supported? Huh? You must be kidding me.

w2k12r2

1. Basically, in order to run your OEM copy of Windows Server 2012 R2, you will need to select Windows Server 2008 R2 as the OS template. Also you need the latest SLIC 2.3 ROM, the correct Dell Cert and of course a valid OEM Key.

2. De-select the default video driver (ie, WDDM) when installing VMware Tools or you will get a black screen immediately. At least this is true on Windows 10, but somehow no problem with W2K12R2. Details please refer to this blog article. (Update: actually you don’t need to install VMware Tools here)

3. Still, you will find the mouse isn’t working smoothly in W2K12R2 and Win10, this happens to both Windows Server 2012 and Windows 8 as well, and the solution is simple, download and install the latest VMWare Tools which comes with the updated drivers for video (VMware SVGA 3D), disk controller (PVSCSI) and network cards (VMnet3).

win10

4. Again, after you have successful upgraded your VMWare Tools, then you will need to create a 1MB hard disk and select the SCSI to 1:0 and change the secondary controller type to Paravirtual, restart the VM, shut it down, change the primary hard disk controller to Paravirtual as well, finally remove the 2nd disk and controller. Viola, there is your lightening fast PVSCSI disk!

5. The other things I discover W2K12R2 and WIN10 keeps crashing for every 3-5 hours without any reason, then I found this KB which solved the problem, and make sure you applied all the windows updates.

6. Another tips is to turn off the Sleep Mode under Power Option in Windows 10, or you will face VM constantly went to sleep and hence VMTools status always shows “Not Running”.

7. Finally I discover you can install the latest Linux CentOS 7.x and Ubuntu 15.x without any problem on ESX 4.1 although it’s not listed as supported OS…strange, but working perfectly fine which leads me to think we can in fact install any updated OS on the much older ESX 4.1. One hint is to download the latest VMTools and install it! (Actually you don’t need as somehow CentOS 7-1611 comes with VMTools automatically somehow)

8. I also noticed those Linux VMware Tools status shows as “Unmanaged”!  That’s actually ok, as the tools is installed from the latest package instead of using the default attached CD-ROM (which the version doesn’t work anyway), so you can safely ignore it.

So who said these two latest OS from Microsoft aren’t supported on ESX4.1? Think again. :)

Update: May 15, 2017

Tested the latest CentOS 7-1611 also worked perfectly on ESX4.1.

Update: Oct 10, 2017

Problem found W2K12 constantly hangs, finally identified it’s the old anti-virus program Symantec Endpoint Protection that’s causing the conflicts, you will need the version at least v14.0.2415 in order for the OS to be stable.

Update: Dec 25, 2017

It turns out Windows Server 2016 follows exactly the above step and I can confirm W2K16 is running happily on ESX 4.1 without any problem! Except you need to have SLIC 2.4 BIOS!

Finally A New Breed of Entry SAN from Dell !

By admin, May 8, 2015 1:50 pm

DELLSCv2020

After reading several SCv2000 reviews, I found it’s simply an upgraded version of PowerVault, so I seriously doubt the performance is even near Equallogic PS4100 series.

Quote from StorageReview:

The Dell Storage SCv2000 Series is a customer-inspired series of entry-level storage arrays that offer same common management and several of the core features as higher-end Dell SC Series arrays, such as the SC4020. According to Dell, the SCv2000 Series offers the “best performance and protection in its class with integrated data protection to support specialized projects, database and test environments, or simple storage consolidation.” The new arrays enable customers to standardize on a common platform, saving time and money. The SCv2000 Series comes in three models with expansion options: the SCv2000, SCv2020, and SCv2080

Benefits include:

  • High performance at an affordable price – The arrays deliver best-in-class performance in a single, affordable 2U enclosure. Feature-rich options include proven data protection features, RAID tiering to optimize capacity, thin provisioning, flash support, data-migration services, and multi-protocol connectivity. Granular data protection allows up to 2,000 snapshots and 500 replications.
  • Integrated data protection- Remote Instant Replay, Local Instant Replays and Replay Manager along with RAID tiering that optimizes SAN capacity. Tight integration with common application environments, such as Microsoft and VMware, help to simplify virtualized data centers, allowing local and remote data protection features to take consistent snapshots of virtual machines without sacrificing performance.
  • Future flexibility with Storage Center – Customers can gain more value from existing investments with data migration to higher-tier Dell Storage SC4000 and SC8000 Series arrays. The SCv2000 Series supports today’s needs and future growth as the arrays can be managed with the same, single user interface as other SC Series arrays. Customers can scale capacity when needed with the benefit of no capacity licensing and choose from a broad range of hard drives and SSDs, configured for a variety of applications.

SCv2000 Series specifications:

  • Models: SCv2000 | SCv2020 | SCv2080
  • Internal Storage: 12 x 3.5” drive bays | 24 x 2.5” drive bays | 84 x 2.5” or 3.5” drive bays
  • Supported expansion enclosures:
    • SC100: 12 x 3.5” or 2.5” drive bays | SC120: 24 x 2.5” drive bays | SC180: 84 x 2.5” or 3.5” drive bays
    • SC100: 12 x 3.5” or 2.5” drive bays | SC120: 24 x 2.5” drive bays
  • Maximum drive count:
    • 168 (12 internal, plus 156 external) |168 (24 internal, plus 144 external) | 168 (84 internal, plus 84 external)
  • Total storage capacity: 504TB
  • Supported drive types:
    • HDD: 7.2K RPM | 15K, 10K, 7.2K RPM | 10K, 7.2K RPM
    • SSD: write-intensive, read- intensive (different drive types, transfer rates and rotational speeds can be mixed in the same system)
  • Controllers: Single or dual controllers | Single or dual controllers | Dual controllers
  • Processor: Intel Xeon 4-core
  • Memory: 8GB per controller
  • Network/server connectivity (front-end):
    • 2 x 16Gb FC ports per controller
    • 4 x 8Gb FC ports per controller
    • 4 x 1Gb iSCSI (BASE-T) ports per controller
    • 2 x 10Gb iSCSI (BASE-T) ports per controller
    • 4 x 12Gb SAS ports per controller
  • Internal drive connectivity (back-end): 6Gb SAS ports (2 per controller)
  • Product OS: Storage Center 6.6.2
  • Server OS:
    • Microsoft Windows Server
    • SLES
    • VMware
    • Citrix XenServer
    • Red Hat
  • RAID: Supports RAID 5, 6, RAID 10 and RAID 10 DM (dual mirror)

Third Equallogic Disk Failure

By admin, April 30, 2015 11:38 am

This is the 3rd time a 15K SAS failed, not bad actually giving the long time period of 4 years and 6 months (The first was in December 12, 2012 and 2nd was in Jan 11, 2014), and this time the reconstruction took approximately 4 hours to complete.

12321312

First Encounter of HA Failover

By admin, April 29, 2015 5:12 pm

Found one of the hosts performed HA Failover on its own at midnight today and all its associated VMs were restarted on the other hosts automatically. The exact cause of host restart is still under investigation as there is no hardware failure found at all.

12:17 Host is Not responding to vCenter Ping, so it triggered HA Failover. In fact, vCenter took about 50 seconds to realize the actual host failure.

12:24 It took 7 Minutes for the host to re-join vCenter cluster. In fact the host & ESX restart only took about 5 minutes counting from the boot screen.

12:28 Everything went back to green, the whole process took about 10 minutes. For individual VM, there was 5-6 minutes of downtime unfortunately.

Cluster Level Event Log

cluster

Host Level Event Log

host

Equallogic Firmware Upgrade Checkup List

By admin, July 5, 2014 4:53 pm

I’ve received an alert this morning saying “Time-of-day clock battery charge is low” from my Equallogic array.

clock

Googled a bit and turned out to be a false alarm (this was also double verified with Dell Pro-Support with EQL Diag log), it only happens with older firmware prior to v5.2.10 (and currently I’m still using v5.2.2). The only way to dis-alarm such warning is to upgrade to the latest firmware release (ie, >v5.2.10).

However I do remember there are certain EQL components depend on specific firmware version. Luckily, I’ve found out this chart, it’s really a time-saver and it’s advised to check out this list every time you plan to do a major firmware upgrade.

For me, my MEM is v1.1.0, SANHQ is 2.2.0, EQL Virtual Storage Manager for VMware is v3.1.1, so my best and most safe play is upgrade to EQL firmware v5.2.11 which is the latest. If I upgrade to v6 or v7, then many related components may not work.

eql_chart
Of course this chart also shows the possible upgrade path:

upgrade_path

Extra Notes:

* Something special for latest MEM v1.2 only supports ESXi v5.0 or above, so for those of you still running ESX/ESXi 4.x, please stay with v.1.1.2.

* Also found out all Release Notes contains tips regarding upgrade path or different versions compatibilities, so better check out this document first.

Finally, saw this Unfortunate Reminder from the latest EQL v7.x firmware Release Notes…so older Equallogic arrays can loaded with the latest firmware IS NO LONGER TRUE, sigh…

ops

In additional, there is a good stuff that finally we can use 4k or 512-byte sectors, so this should save a lot of wasted space in Snapshots.

4k

Update: July 6, 2014

Problem still exists after upgrading to v5.2.11, what the hell? Local Pro-Support at XiaMen remains unprofessional probably even worst than 4 years ago as he asked me “Can you accept Downtime” when restarting the controller? $#%R$#@!@!!!!!

The other thing I found out is the Restart Button in EQL Group Manager GUI is actually a way for active controller to failover, so your volume will still remain on-line just like firmware upgrade controller restarting  process.

Update: July 7, 2014

Dell L2 support team head called me this morning, he mentioned US EQL support also noticed this specific error remains even with latest firmware v6.x applied, and it will be fixed in a later firmware hopefully.


Various Trouble Shooting Notes

By admin, January 11, 2014 7:38 pm

Yesterday was like a long fight, different parts started to fall apart within a few hours, first it was one of the ESX host, then Equallogic, finally Group Manager and iDrac problem, people say Shxt happens, this fits exactly to my case!

One thing I’m glad that Dell fulfilled it’s promise this time and fixed everything within the 4 hours pro-support contract (hardware wise of course), poor guy has to go to NOC twice and worked till almost mid-night with me working remotely.

So please let me list them accordingly:

1. ESX Host:
I suddenly received a host fail alert, vCenter shows the problem host got disconnected, all the VMs on it also went grey out. Funny thing is all VMs can be still pingable and function perfectly normal as if there is nothing wrong.

Telnet/SSH Even Console hung completely, there was no way to login using root, openmanage doesn’t load. Later I found out a 15K 146GB disk failed in a RAID1 configuration from iDrac system log.

Worst enough, the replaced disk did not start to rebuild. Later Dell’s technician went into Megaraid BIOS utility and found out he has to manually add back the disk. I suspect the problem is due to the replaced disk is a Fujisu where as the faulty disk is a Hitachi, that’s why they don’t work together initially. (they should in theory, but in reality NO)

At this stage, since there is no way to remove the live VM or do a vMotion, I have no choice but to power down the host manually. Even more strange, HA didn’t kick in, all the VM did not restart on other hosts in the cluster even after 5 mins.

The whole rebuild took about 15 minutes, thanks to RAID1. The rebuild status in Openmange shows it’s always 33% while the disk light stopped blinking (meaning completed), funny! After reboot again, the optimal status can be verified in Megaraid BIOS, also reflects in Openmanage later, so this means Openmanage takes time to fetch the status from different hardware parts.

So I still have no clue why the faulty disk in a RAID1 caused the ESX host to be non-responsive.

2. Equallogic:

I received the following notice multiple times via Email, Group Manager shows it’s Information type and it’s in Green, I’ve sensed there must be something wrong, so I called Dell EQL support, as expected, the local support knows nothing about it.

—————————————–
INFO event from storage array eql01
subsystem: SP
event: 14.2.22
time: Fri Jan 10 12:10:30 2014

I/Os containing bad blocks were read from drive 10 and successfully reconstructed in the last 8 minutes.
—————————————–

After approximately 6 hours, the following faulty alert confirmed my previous worry.

—————————————–
ERROR event from storage array eql01
subsystem: SP
event: 14.4.22
time: Fri Jan 10 20:11:15 2014

Disk drive 10 failed in RAID LUN 0.
—————————————–

So the previous notice is actually EQL’s Predictive Failure in Action!!!

SANHQ also generated the similar alert.

Warning conditions:

  • 1/10/2014 8:10:50 PM to 1/10/2014 8:12:50 PM
    • Warning: Member eql01 RAID Set Is Degraded
      • Warning: Member eql01 RAID set is degraded because a disk drive failed or was removed.
    • Warning: Member eql01 RAID More Spares Expected
      • Warning: Member eql01 The current RAID configuration requires more spare drives then are currently available.
    • Warning: Member eql01 has a failed drive in slot 10

With the replacement disk, reconstruction immediately took place, and the process took about 1 hour to complete, again, thanks to RAID1.

3. EQL Group Manager

As I need to verify if the replaced EQL disk has successfully changed to a hot spare, then I found out I can no longer login to EQL Group Manager due to some strange Java error, no matter if it’s IE or Firefox. The Java version is v7 u45, then I’ve tried different versions until I figured out only v7 u17 worked. My conclusion is EQL firmware plays a big role in this case, as I am still using v5.2.2, so EQL probably hard coded the requirement into their application, anyway, Java JRE verion always produces nasty problem in my environment one way another, so I’ve decided not to upgrade it for sure.

4. iDrac

Back to the Disconnected Host with faulty disk, I found I can no longer login to iDrac Web UI, IE works but producing all sorts of problem, not to mention the console doesn’t show up at all with its ActiveX stuff. I’ve even tried to removed the iDRAC cert from advanced option, reboot the managed machine, won’t help at all, and it turns out a simple Content Cache Clear in Firefox solved the problem completely! Ridiculous Really!

If it still doesn’t work, do a soft rest by “racadm racreset soft”

5. Veeam

Yes, it’s not finished yet, I also found Veeam’s schedule job stopped working as I am still using V5.0.1, there is a Veeam KB and an update (v5.0.2) for this issue, but I can’t explain why it’s been working for 3+ years and suddenly stopped working with no reason, so I’ve removed all the old backup and created a New Full Backup, truth will tell by tomorrow morning and I shall verify the Schedule Job again by then.

Update: I have to install the update in order to solve the schedule job doesn’t run problem. Also do remember to close all the extra TPC/UDP ports that’s been re-enabled by the upgrade of Veeam B&R program. (Potential Risk: Veeam Agent, NFS and Windows Shares in particular)

Updated:

Restarting the management agents on ESX may help:

  1. Log in to your ESX Server as root (by su -) from either an SSH session or directly from the console of the server.
  2. Type “service mgmt-vmware restart”.
    Caution: Ensure Automatic Startup/Shutdown of virtual machines is disabled before running this command or you risk rebooting the virtual machines.
  3. Press Enter.
  4. Type “service vmware-vpxa restart”.
  5. Press Enter.
  6. Type “logout” and press Enter to disconnect from the ESX host.
Pages: Prev 1 2 3 4 5 6 7 8 9 10 ...16 17 18 Next