How to Get Consumer Grade SSD to Work in Dell Poweredge R710!

By admin, April 11, 2012 10:48 pm

Around this time December last year, I’ve been searching for a reliable, cost effective SSD for vSphere ESX environment as the one from Dell is prohibitory expensive to implement.

Besides, Dell SSDs may have poor performance problem as someone described in VMTN, those 1st generation SSD either provided by Sumsung OEM (100GB and 200GB), or Pliant based (149GB) which is much faster than Sumsung ones and of course much more expensive (over priced that is) as well. Both are e-MLC NAND.

Anyway, I’ve finally purchased a Crucial M4 SSD 128GB 2.5″ SATA 3 6Gb/s (around USD200 with 3 years warranty), here is for a list of reasons.

02

Then my goal is to put this SSD to Dell Poweredge R710 2.5″ tray and see how amazing it’s going to be. I know it’s not a supported solution, but no harm to try.

The Perc H700/H800 Technical Book specifically states it supports 3Gb/s if it’s a SATA SSD, but I found out this is not true, read on.

1. First thing first, I have Upgraded Crucial firmware from 0009 to 0309 as in early Jan 2012, users found out Crucial M4 SSD has a 5200 Hours BSOD problem, it’s still better than Intel SSD’s 8MB huge bug.

Correct a condition where an incorrect response to a SMART counter will cause the m4 drive to become unresponsive after 5184 hours of Power-on time. The drive will recover after a power cycle, however, this failure will repeat once per hour after reaching this point. The condition will allow the end user to successfully update firmware, and poses no risk to user or system data stored on the drive.

Something more to notice that SandForce controller based SSD has a weakness that when more and more data is stored in SSD, it’s performance will decrease gradually. Crucial M4 is based on Marvell 88SS9174 controller and it doesn’t have this kind of problem. It is more stable and the speed is consistent even with 100% full in data.

In additional, Crucial M4 Garbage Collection runs automatically at the drive level when it is idle and it has Garbage Collection which works automatically in the background in the same way as TRIM independent of the running OS. As TRIM is an OS related command, so TRIM will not be used if there is no support in the OS (ie, VMware ESX).

2. The most difficult part is actually finding the 2.5″ tray for Poweredge R710 as Dell do not sell those separately, luckily I was able to get two of them off the auction site locally quite cheap and later found out they might be Counterfeit parts, but they worked 100% fine, only the color is a bit lighter than the original ones.

01

3. Then the next obvious thing is to insert the M4 SSD to R710 and hopefully Perc H700 will recognize this drive immediately. Unfortunately, the first run failed miserably with both drive indicator lights OFF, as if there is no drive in the 2.5″ tray.

Check the OpenManage log, found out the drive is not Certified by Dell (Huh?) and Blocked by Perc H700 right away.

Status: Non-Critical        2359        Mon Dec 5 18:37:24 2011        Storage Service        A non-Dell supplied disk drive has been detected: Physical Disk 1:7 Controller 0, Connector 1

Status: Non-Critical        2049        Mon Dec 5 18:38:00 2011        Storage Service        Physical disk  removed: Physical Disk 1:7 Controller 0, Connector 1

Status: Non-Critical        2131        Mon Dec 5 19:44:18 2011        Storage Service        The current firmware version 12.3.0-0032 is older than the required firmware version 12.10.1-0001 for a controller of model 0×1F17: Controller 0 (PERC H700 Integrated)

4. Then I found out the reason is older H700 firmware blocked the non Dell drive access, so I have to updated Perc H700 firmware to latest (v12.10.x) using USC again. Before the upgrade, I boot into H700’s ROM and found indeed the SSD drive is not presented in the dirve pool. Anyway, the whole process took about 15 minutes to complete, not bad.

03

5. After the server returns to normal, the Crucial M4 128GB SSD now has light showing in the tray indicator and working correctly partly, as the indicator on the top always blinking in amber (ie, orange),   “Not Certified by Dell” indicates in OpenManage log, and this caused the r710 front panel LCD also blinking in amber.

Besides, under Host Hardware Health in vCenter, there is one error message showing “Storage Drive 7: Drive Slot sensor for Storage, drive fault was asserted”

From Perc H700 log file:
Status: OK        2334        Mon Dec 5 19:44:38 2011        Storage Service        Controller event log:

Inserted: PD 07(e0xff/s7): Controller 0 (PERC H700 Integrated)
Status: Non-Critical        2335        Mon Dec 5 19:44:38 2011        Storage Service        Controller

event log: PD 07(e0xff/s7) is not a certified drive: Controller 0 (PERC H700 Integrated)

ssd3

I clear the log in OpenManage turns the front panel LCD returns to blue, but SSD drive top indicator light still blinks in amber, don’t worry, it’s just indicator showing it’s a non-dell drive.

Later, this was confirmed by a message in VMTN as well.

The issue is that these drives do not have the Dell firmware on them to properly communicate with the Perc Controllers. The controllers are not getting the messages they are expecting from these drives and thus throws the error.

You really won’t get around this issue until Dell releases support for these drives and at this time there does not appear to be any move towards doing this.

I was able to clear all the logs under Server Administrator.  The individual lights on the drives still blink amber but the main bevel panel blue.  The bevel panel will go back to amber again after a reboot but clearing the logs will put it back to blue again.  Minor annoyance for great performance.

Update: If you have OM version 8.5.0 or above, now you can disable the not a certified drive warning completely! Strange that Dell finally listen to their customers after years of complain.

In C:\Program Files\Dell\SysMgt\sm\stsvc.ini update the parameter to NonDellCertifiedFlag=no

6. The next most important thing is to do a VMFS ReScan, ESX 4.1 found this SSD immediately Yeah! and I added it to the Storage section for testing.

ssd

Then I tested this SSD with IOMeter, Wow…man! This SINGLE little drive blows our PS6000XV (14 x 15K RPM RAID10) away, 7,140 IOPS for real life 100% random, 65% read test, almost TWICE than PS6000XV!!! ABSOLUTELY SHOCKING!!!

What does this mean is A Single M4 = 28 x 15K RPM RAID10, absolutely crazy numbers!

##################################################################################
TEST NAME——————-Av. Resp. Time ms——Av. IOs/sek——-Av. MB/sek——
##################################################################################

Max Throughput-100%Read……1.4239………39832.88………1244.78

Max Throughput-100%Write……1.4772………37766.44………1180.20

RealLife-60%Rand-65%Read……8.1674………7140.76………55.79

EXCEPTIONS: CPU Util. 93.96%, 94.08, 30.26%
################################################################

iometer.read

iometer.write

iometer.reallife

So why would I spend 1,000 times more when I can get the result with a single SSD drive for under USD200? (later proved I was wrong as if you sustain the I/O process, Equallogic will stay at 3,500 IOPS and SSD will drop to 1/10 of it’s starting value)

Oh…one final good thing is Crucial M4 SATA SSD is recognized as 6Gbps device in H700, as mentioned in the very first lines, according to Perc H700 tech book, it said H700 SSD SATA interface only supports up to 3Gbps, I don’t know if it’s the latest Perc H700 firmware or actually the M4 SSD itself somehow breaks that limit.

ssd2

Let’s talk something more about Perc H700 itself, most people know Dell’s Raid controller cards are LSI Megaraid OEM since Poweredge 2550 (the fifth generation) and Dell Perc H700 shares many advanced feature with its LSI Megaraid ones.

Such as CacheCade, FastPath, SSD Guard, but ONLY available in the Perc H700 1GB Cache NV Ram version.

Optimum Controller Settings for CacheCade – SSD Caching
Write Policy: Write Back
IO Policy: Cached IO
Read Policy: No Read Ahead
Stripe Size: 64 KB

Cut-Through IO = Fast Path Cut-through IO (CTIO) is an IO accelerator for SSD arrays that boosts the throughput of devices connected to the PERCController. It is enabled through disabling the write-back cache (enable write-through cache) and disabling Read Ahead.

So this means you can use LSI Megaraid Storage Manager to control your Perc H700 or H800. In my case, I found my H700 does not support any of the above as it’s only a 512MB cache version. However “SSD Caching = Enable” shows in the controller property under LSI Megaraid Storage Manager and cannot be turned off as there is no such option, I am not sure what this is (definitely it’s not CacheCade), if you know what this is, please let me know.

Then let’s move into something deeper regarding Perc H700’s bandwidth as I found the card itself can reach almost 2GB/s, this is again too good to believe!

The PERC H700 Integrated card with two x4 internal mini-SAS ports supports the PCIe 2.0 x8 PCIe host interface on the riser.

PERC H700 is x8 PCIe 2.0 (bandwidth is 500MB/s per x1 lane) with TWO SAS 2.0 (6Gbps) Ports with x4 lane, so total bandwidth for EACH lane is 500MB/s x 4 = 2,000MB/s (ie, 2GB/s).

EACH SATA III or SAS 2.0 bandwidth is 6Gbps, this means EACH drive maximum speed can produce 750MB/s (if there is such SAS drive), so it will take about SIXTEEN (16) 6Gbps 15K rpm disks (each about 120MB/s) in reality to saturate ONE PERC H700’s 2GB/s theoretical bandwidth.

A single Crucial SSD M4 is able to go over 1GB/s in both Read and Write really shocked me!

This means two consumer grade Crucial SSD M4 in RAID0 should be enough to saturate Perc H700’s total 2GB/s bandwidth easily.

From ESX Storage Performance Chart, it also shows the consistent IOPS with IOMeter’s result. (ie, over 35,000 in Seq. Read/Write).

iops

From Veeam Monitor, showing 1.28GB/s Read and 1.23GB Write

veeam

In fact, not just me, in reality, I found out many people were able to achieve this maximum 1,600MB or 1.6GB/s. (yes, theoretical is 2GB/s) with two or more SSD under Perc H700.

Of course the newer PCIe 3.0 standard is 1GB/s per x1 line, so a x4 will give you 4GB/s,a 200% increase, hopefully someone will do a benchmark on Poweredge R720 with its Perc H710 shortly.

Some will say using a single SSD is not safe, OK, then let’s make it a RAID1, if not RAID10 or RAID50 with 8 SSD drives, with the newer Poweredge R720, you can put maximum 14 SSD to create a RAID10/RAID50/RAID60 with 2 hot-spare in a 2U, more than enough right?

The most important is the COST IS MUCH MUCH LOWER when using consumer grade SSD and it’s not hard to imagine 14 SSD in RAID10 will produce some incredible IOPS, I guess something in 50,000 to 100,000 should be able to achieve without much problem. So why use those sky high $$$ Fusion IO PCI-e cards forks?

Finally, I also did a few desktop benchmark from within the VM.

Atto:

atto

HD Tune:

hdtune.random

hdtune.extra

Conclusion, for a low cost consumer SSD, 100% Random RealLife-60%Rand-65%Read with 32K transfer request size, 7,000+ IOPS is simply amazing!!!

非主流的Rolls Royce Ghost

By admin, April 11, 2012 2:34 pm

今天在中環商業區看到了這麼一台與眾不同的鬼魂。

車主的品味肯定不是一般人能欣賞的那種,有人可能還覺得難看死了,完全破壞了Rolls Royce那種高貴典雅的氣質。

當然John Lennon 60年代的那台Phantom V一樣是叛經離道,但人家可是英倫Rock n Roll組合的始祖、潮流的先驅,所以此車也應該是屬於一位香港廣告界或是時尚演藝圈的名人吧。

我們姑且不談品味如何,但有一點他肯定是做到了,就是吸引了街上所有人的眼球。

ugly ghost

SmarTone CEO’s Apology Letter reminds me…

By admin, April 11, 2012 10:16 am

The incident of NWT data center in 2009. There was a sudden power surge  that broke the electric circuit and there supposed to be a diesel generator to provide the backup power, but guess what? There is not enough diesel fuel in the tank. Wow! This is so similar to SmarTone’s data center  down time and I am surprised that they don’t have auto real-time failover site, this is strange for a listed company with over a few millions of subscribers in Hong Kong, a big question to their so called Level 3 ISO approved data center.

Now the government suggested other carriers should temporary provide the 2G/3G service for any carrier facing downtime, I think it’s pure dreaming, why would they anyway?

I bet One2Fre is the most happy face this time to see SmarTone sinking, as SmarTone was not so nice to them either previously. :)

top_logo

Dear SmarTone Customer,

As a valued customer, I would like to apologise to you for Monday’s service disruption and inform you of what actually happened, and what we have done to resume services and minimise disruption.

On 9 April at around 8:00 am, the building in which one of our three switching centres is located, suffered a total power failure disrupting the power supply to the entire building. Our switching centre’s standby battery system immediately substituted the failed power source and there was no disruption to service. Our backup generator then commenced operation, successfully taking over from the standby battery system.

At approximately 10:35am a component in the starter circuit board of the backup generator broke down unexpectedly, causing the generator to stop operating. This also caused a huge surge of electrical current that triggered circuit breakers to disconnect the cellular switching system from the standby battery system. The resulting power outage disrupted cellular service in several areas in Hong Kong and selected MTR stations in Kowloon.

Emergency restoration and recovery procedures started immediately and service was restored progressively from 12:15pm onwards. Aside from a small number of selected services, voice and mobile internet services were largely back to normal at 1:00pm. From 2:30pm onwards, all SMS services had returned to normal. The Company’s two remaining switching centres operated as normal during this incident.

We carry out inspections and tests of all power generation facilities regularly. The backup generator which caused the outage successfully passed recent inspections and tests on 22 February. We will submit an incident report to the Office of the Communications Authority and will re-evaluate our procedures for informing the public and its customers.

We regret the inconvenience caused to you and we will thoroughly investigate both the reasons for the building’s power outage and the failure of our power backup systems in order to ascertain the root cause. We are determined to learn from these findings and make improvements to prevent similar incidents in future.

Yours sincerely,

Douglas Li
CEO

Equallogic PS6100 New Feature: Vertical Port Failover

By admin, April 10, 2012 4:20 pm

Finally figured out what “Vertical Port Failover” is, that’s a cool feature to maintain the original full 4Gbps bandwidth! Is it really useful? To be honest, I don’t think so, well it’s fun to have though.

According to March 2012 Dell EqualLogic Configuration Guide v13.1 12:

In PS Series controllers prior to PS4100/6100 families, a link failure or a switch failure was not recognized as a failure mode by the controller. Thus a failure of a link or an entire switch would reduce bandwidth available from the array. Referring to Figure 5, assume that CM0 is the active controller. In vertical port failover, if CM0 senses a link drop on the local ETH0 port connection path, it will automatically begin using the ETH0 port on the backup controller (CM1). Vertical port failover is bi-directional. If CM1 is the active controller then vertical port failover will occur from CM1 ports to CM0 ports if necessary.

1

With PS4100/PS6100 family controllers, vertical port failover can ensure continuous full bandwidth is available from the array even if you have a link or switch failure. This is accomplished by combining corresponding physical ports in each controller (vertical pairs) into a single logical port from the point of view of the active controller. In a fully redundant SAN configuration, you must configure the connections as shown in Figure 20.

2

In a redundant switch SAN configuration, to optimize the system response in the event you have a vertical port failover you must split the vertical port pair connections between both SAN switches. The connection paths illustrated in Figure 6 and Figure 7 show how to alternate the port connection paths between the two controllers. Also note how IP addresses are assigned to vertical port pairs.

孤獨的保時捷 (轉文)

By admin, April 10, 2012 9:45 am

許多人直覺以為保時捷是一部「媾女戰車」,最大作用在於炫耀與誘惑,把女孩子騙到車上,呼嘯而去,直駛到時鐘酒店,爆房狂歡。

錯得厲害。
媾 女如果需要動用保時捷,勝之不武,跟上床需要借用偉哥一樣,即使成事,亦毫無快感或成就感可言。跟眾人錯覺相反,保時捷其實是一部「孤獨之車」,最大的駕 駛樂趣在於把世界隔絕於外,把一己困於車內,全神貫注,享受透過兩隻手掌傳來的操控感。精準,威猛,爆發,像兩道電流源源不絕地注入身體,把你的熱情 jump start,像手機充電,使你對生命重新有了撞擊的衝動。
保時捷像一匹懂得跟你說話的野馬,一旦馴服了牠,牠便對你死心效忠,依照你的 指揮與暗示,闖前守後,陪伴你在紅塵裏殺出一條又一條的血路樂路。躍上馬背,你再也捨不得縱身下馬,轟隆隆的引擎聲響令你頓忘時間的存在,甚至感受不到自 己的存在,你只看見一道道的影子在眼前左右閃爍飄流,你是電流裏的其中一束強光,往前射去,到了目的地,腳掌一扭,鞋底從油門轉到煞車,戛然而止,你與車 子一起重歸寂靜。
所以車內有乘客是一樁頗掃興的事情。你必須分神跟她說話,你必須避免讓她受驚,你沒有法子想把音樂轉到有多響便多響,你沒辦法隨 心所欲地體會車子的每一吋威力與溫柔。吃食之事,分享能令快樂變成雙倍,但駕駛保時捷卻剛相反,唯有孤獨,始能盡情,你不會喜歡讓狹窄的車廂內「第三者」 干擾你的聚精會神,因為這是你和保時捷之間的親密約會。
孤獨之車,讓保持捷永遠留下一個空位,這才對得起保時捷教授,以及你自己。

- 馬家輝

Firmware Updates with Dell OpenManage Essentials (OME)

By admin, April 9, 2012 7:42 pm

The device discovery part turns out to be much easier than expected, probably due to the firmware on all my hardware has been updated recently and snmp has been setup correctly as well.

After you set the IP range and snmp community strings under Discovery and Inventory, OME quickly collects all the Poweredge Servers, Equallogic Arrays, PowerConnect Switches, PowerVault  Arrays as well as finding the corresponding iDrac and OpenManage address automatically.

In fact Dell OpenManage Essentials (OME) is simply a central portal that collects every device in your network, you can view and manage them from one window and best you can receive server fault alert via email, it serves the same purpose as the previous long live Dell IT Assistant but MUCH BETTER!

One of the Best Feature IMOO is System Update (BIOS/Firmware) under Manage for your ESX servers, it automatically pulls the latest firmware catalog from Dell’s FTP and compares with your server firmware level and list out all the latest one for you to choose from, it even pull the hard disk firmware to be precise!

I haven’t test and perform the update yet, but I guess it’s the same theory that System Update will ultilize iDrac and USC to patch your system’s BIOS and various firmware, but I guess it won’t automatically put your ESX to maintenance mode first, for that I think you need to manually do it yourself.

So there is no need to use Repository Manager any more, saving lots of time to build your own catalog and no need to use Dell Management Plugin for VMware vCenter (DMPVV) – that’s why I’ve been emphasizing Dell should make it free of charge.

Some other interesting features including OME can automatically pull the Warranty Information based on Service Tag it gathered from Dell and report very detail hardware information regarding the servers in its inventory.

Finally some words about License Manager that includes with OME, I found it will collect all your iDrac 7 license information and manage from one console. Howerver, it works for 12the generation Poweredge Server and iDrac 7 only which I don’t have any currently, so I’ve uninstalled it.

Btw, since when does Dell start to copy HP and charge remote console KVM license just like iLo? :)

Anyway, I do think Dell OpenManage Essentials (OME) will replace ITA shortly (in my case, I think I will get rid of ITA tomorrow) and it’s a solid management tool for every Dell server administrator.

ome

Update April 14, 2012

I’ve completely re-configured everything on OEM, it’s much easier and most importantly, it discover the inventory and SNMP traps much faster than IT Assistant, so my ITA can finally retire. Thank you very much Dell for making a great product for free!

Update April 18, 2012

How to Configure SNMP Trap Destinations on Linux Server

For OpenManage Essentials to display alerts for a device, you must configure the device to send traps to the OpenManage Essentials server.

To configure a trap destination:

1. Open the file for editing:
/etc/snmp/snmpd.conf

2. Add the following line to the file:
Trapsink <OME IP Address> <community name>

where IP_address is the IP address of the management station and community_name is the SNMP community name

3. To enable the changes, restart the SNMP agent:
/etc/init.d/snmpd restart

Note – this step is only needed once after all configuration changes are completed.

Update June 23, 2021

After almost 10 years, the latest OME Catalog.cab update cannot be applied due to format has been changed for OME version 1.01 as it always falsely shows the latest signature is 03/06/2012.

Originally I thought the only way to update it is manually download the latest update signature from https://downloads.dell.com/catalog/catalog.cab, then use Select a Catalog Source > Use repository manager file, browse to the saved catalog.cab.

But it turned out the catalog format has been changed since, so only solution is to get the latest version 2.5 (1GB size).

In fact OME (OpenManage Essentials) 2.5 is going to be the last version, and moving to OpenManage Enterprise for DellEMC devices.

HIT/VE, ASM/VE and VMware Thin Provision Stun (Equallogic FW v5.2.2) on ESX 4.1

By admin, April 9, 2012 1:11 pm

In case you don’t know, VMware Thin Provision Stun is actually the 4th VAAI feature but it was hidden during its release time in the end of 2010. In case you also don’t know (this is not directly related though), I just learn that VMware’s HA technology is actually from Legato which EMC (VMware’s parent company) acquired back end.

Since then storage vendors start to integrate this great feature with their firmware upgrade following in 2011.

Equallogic incorporated this VMware Thin Provision Stun starting in firmware v5.1, but somehow I found it was hidden even as early as in v5.0.2, it worked out of box even with ESX 4.1.

I didn’t notice much difference in the volume properties under Group Manager after I upgraded PS6000XV to firmware v5.2.2. It was rather at a later time when I added a new volume using the HIT/VE (required FW v5.1 and above as well as MEM v1.1.0), I discovered there was an extra option (Enable VMware Thin Provision Stun) to create the volume with VMware Thin Provision Stun.

03

Double checking: Log to EQL Group Manager, then volume property has a new line now

TP warning mode: Leave online, generate initiator write error

01

You can also verify this by creating a testing volume again using EQL Group Manager, Thin provisioning modes options are selectable now in the new Equallogic firmware.

02

Oh…many may ask what does this thin provision stun do exactly? Well, basically when your Equallogic thin volume runs out of space (ie, maximum in-used space reaches 100%), instead of putting the whole volume offline and letting all VMs crashed on that volume, it will now ONLY suspend those VMs requiring more space continuously. On the other hand, for those VMs on the same volume don’t require more space for the time being, they can keep on working without any problem.

Region Capture

Back to the HIT/VE installation, the whole setup process is pretty simple, only catch is you need to create a new port group on iSCSI vSwtich in order for HIT/VE to access the Equallogic array. This may consider not secure by many system administrators.

One thing I really liked besides VMware Thin Provision Stun feature is ASM/VE. It OFFLOADS the backup process from ESX or vCenter Windows Server to Equallogic array itself, think of it as some kind of VAAI offload for backup. Veeam’s approach is to offload the backup to it’s backup proxy, but Equallogic internal backup (snapshots) is still way faster, well if you have the luxury space to spare that is of course.

The result (smart copy = snapshot) still stored on your Equallogic box, to many this is not wise, as to store the cold backup data in the expensive SAN doesn’t make sense (you can’t store it outside the volume which is being backed up, so if your volume is RAID10 or SSD, you have to store the snapshots in that volume as well, is it true any more? May be I’m wrong).

There is no way to store the snapshot off host as well. (may be there is, I saw one white paper showing how to use Symantec Backup Exec to offload the snapshots from EQL box with ASM/VE)

Oh, there is another great feature of ASM/VE: Smart Clone! It does have its real unique value! It’s extremely useful for application testing or testing a major patch to your VM.

Finally I really like HIT/VE GUI which is very simple and intuitive! To config an Equallogic new volume via HIT/VE is a piece of cake now, it will create everything for you automatically, no more manually configure iSCSI initiator access, VMFS rescan and attach volumes to ESX hosts, etc. A newly created VMFS will be ready within 5 clicks, that’s the beauty of Equallogic and guess what? It’s free of charge as usual!

Running Mac OS X Lion 10.7.3 on VMware Workstation v8.0.2

By admin, April 9, 2012 1:50 am

The installation is pretty easy once you found the correct link, also here. In fact, there is no installation involved as the VMFS has already been fully configured. All you do need to search for VMware Tools for Mac OS X, put it to USB stick and install from within the VM.

I gave the maximum configuration for Mac OS X: 2 vCPUs with 2 cores each, so total 4 vCPUs and 8GB ram, it runs so smooth almost like native, and lightning fast on SSD, it only took 5 seconds to boot into the following screen. The good thing is I can run it in full screen, watch HD movies, play games, download apps. I noticed CPU loading on my Optiplex 990 SFF i5-550 is almost 90% across all 4 cores when full loading the VM.

I haven’t touch Apple’s OS for almost 12 years, it’s exciting to see the familiar face again in a virtual world.

I really start to love VMware Workstation as the unlimited possibilities it can do, next targets will be nested hypervisors, esx5 cluster, view 5 and plugin my iPhone to this Mac VM and use iCloud to sync stuffs, Cool!

macosx

Thoughts about Dell Management Plug-In for VMware vCenter (DMPVV)

By admin, April 9, 2012 12:12 am

Honestly, I have repetitively deployed DMPVV mutiple times in order to get it right.

Region-Capture

1. You really need to make sure you have read Dell OpenManage Software Compatibility Matrix before installing any OM software because you need to upgrade the firmware for BIOS/iDrac/Lifecycle to their respective minimum requirement as stated in the guide.

2. DMPVC DOES NOT NEED to use a DHCP server

I even created a W2K8 R2 DHCP server, but found there is a place in menu for me to configurate the fix IP for the appliance.

3. DMPVC cannot start in vCenter saying some wired permission problem, things like Access Denied!

The problem is because I registered the vCenter using IP in DMPVV, but used host name when I login with vSphere client, after I changed hostname to IP, everything worked. Probably it’s my DNS not working properly, anyway, IP is fine for my case.

4. Connection Profile doesn’t work, then I found out you have to turn on Remote Enablement during the installation of OMSA on ESX.

The reason you see that error message about “OMSA is not installed” could be due to that when you installed the OMSA, you didn’t install it with -c option which installs the “Remote Enablement” component of OMSA. And our appliance talks to OMSA thought its remote enablement layer. Without successfully connect to OMSA, the iDrac connection will fail too as we correlate the correct iDrac IP with the server by getting the iDrac IP from OMSA first. Please reinstall the OMSA with –c option and that should solve your issue. Once you pass the connection test from the connection profile, please make sure to run inventory from the Job Queue by clicking the “Run Now”.

That indeed was the fix, and the -c is listed in the user guide for the command line, however there is no explanation in the user guide why it needs to be there, so I did not re-run the OMSA 6.4 installer. Perhaps the OMSA team could add the -c switch to the -x (for express) switch for OMSA, so that it is automatically included? Also according to the OMSA 6.4 install manual, if you run the -x switch it runs the express setup with all options included and ignores any other switches, apparently this is not true.

For OMSA 7.0

Run the following command to perform an express install with Remote Enablement parameters:
sh linux/supportscripts/srvadmin-install.sh -c -x

-c is for Remote Enablement
-x is for Express

Then start the applicable services by running the following command:
sh linux/supportscripts/srvadmin-services.sh start

5. License Disappeared

Sometimes after a reboot or DMPVV reset to factory default, license disappear, I have to re-deploy the whole thing again, well, the last successful re-install only took me 5 minutes as I have done it over 6 times. :)

6. CANNOT contact iDrac (SOLVED, but UNSOLVED for the time being)

My iDRAC subnet is on a separate switch, so COS Service Console obviously won’t work, this was by design as I want to physically separate all network segments, and it’s not routable. I knew in order for default DMPVC to work is to put COS and DRAC network on the same network segment which is NOT SECURE as far as I concern. Why doesn’t DMPVV give us an option to specify the subnet for iDrac and add another network adapter for this purpose?

During the research, I also found out by using Alt+F2, login as readonly with default admin password, then you can perform some network trouble shooting such as ping and tracepath.

Anyway, I still can’t figure out a way to route the traffic from DMPVC to iDrac via vCenter server without using a L3 router or firewall device. Is it possible to use route add on vCenter Windows server to redirect the DMPVV traffic to iDrac? If you know, please let me know.

So I was not able to test the firmware upgrade feature, but I am 100% sure it’s utilizing iDrac’s USC firmware updating feature to fetch firmware from ftp.dell.com and then perform the upgrade on the background, it’s the same as if you reboot the server and press F10 USC.

7. Service Temporarily Unavailable

DMPVC web server always crashes, probably due to I gave it 1GB (reduced from 3GB), after I changed to 2GB, it stopped crashing, but still loading the host page is extremely slow, around 2 minutes. Oh, DMPVV is a resource eater,taking up 2GB of ram fully and 100% CPU cycle when it’s connecting to the host.

The server is temporarily unable to service your request due to maintenance downtime or capacity problems. Please try again later.

Apache/2.2.3 (CentOS) Server at xxx.xxx.xxx.xxx Port 443

8. Proxy to get firmware

This is the same issue as the above pt 6, see it’s going to bite back anyway, there is no direct connecting to Internet from within DMPVV, so you have to use a proxy server to download firmware, a better way would be adding a 3rd network adapter and connecting to your External port, hope this can be changed in the next release.

Finally, some good readings (total 3 parts) can be found on Virtual Life Style web site.

One of the coolest features I want to highlight is the PXE-less provisioning of the hypervisor to a physical server. This uses a combination of the Lifecycle Controller and iDRAC to deploy an installation ISO to the server. And since it is really tightly integrated with the VMware stack, the host is added to vCenter and configured using Host Profiles automatically, resulting in a true zero-touch deployment of a server. How cool!

In fact, there is a How to video regarding “Auto Discovery & Hypervisor Deployment – Dell Management Plug-In for VMware vCenter ”

One last word to add, I do think Dell Management Plug-In for VMware vCenter (DMPVV) is simply a proxy between ESX host and vCenter, still it should be made free as all the features of DMPVV can be achieved using different Dell server management products together. DMPVV is just a fancy toy that made all of them into a single product instead.

FYI, 12th generation servers such as R720/R620 doesn’t have to use OMSA as it it’s completely agent-less and no longer depends on OMSA agents within ESX hosts in order for DMPVV to work.

Poweredge R710 Firmware Update Causes PCI Interrupt Conflict, H700 Raid Disappeared!

By admin, April 8, 2012 11:33 pm

Dell Management Plugin for vCenter (DMPVC) v1.5 has been released on April 5th with 1 Free Host, so I couldn’t wait to test this management software. I’ve read through the prerequisites and found out many components such as BIOS, iDrac, Raid Controller, LifeCycle Controller, etc are required to be updated to the latest version before DMPVC can properly work.

So the next obvious thing is to go ahead and update my Poweredge R710 server. Of course there are many ways to update the Poweredge firmware running ESX, Repository Manager is one of them, probably the best one, but today I found out the connection to ftp.dell.com is extremely fast (about 30-50M/s), so I took the short cut and use F10 USC (Unified Server Configurator) to perform the update via ftp.

After I pulled the firmware catalog (again, it’s 2 months outdated, but new enough for DMPVC) and selected everything (BIOS, iDrac, H700, Broadcom, LC) but iDrac (v1.8) as I want to do that at last while I am using iDrac Remote Console to perform the updates.

The whole thing took about 1 hour to complete the 12 steps. I noticed something funny when the last reboot started, R710 complained Plug and Play Configuration Error and H700 Raid Menu has completely gone! I booted into BIOS and found H700 was no longer listed in the available storage devices, what?

OK, was it something to do with old iDrac version again? I then use USC to update iDrac to v1.8 and I lost the console when the iDrac restarted. After iDrac came back, I couldn’t login to iDrac web GUI, Damn! I called up data center and asked the staff to change the iDrac password (Ctrl + E), guess what? I still couldn’t login!

Babe! This leave me no choice but physically travel to data center, after 30 minutes, I was sitting in front of the LCD besides my rack and scratching my head, couldn’t figure out a bit why my H700 disappeared and I couldn’t login to iDrac.

Called local Pro-Support (HK) and as usual (this really let me think NO TO PURCHASE ANY MORE PRO-SUPPORT LEVEL SERVICE IN THE FUTURE), useless advice, asking me to pull out the Drac card and insert again, and finally ordered a replacement H700 (to arrive in 1-2 hours) and support staff to come within 3-4 hours.

During the chatting with Pro-Support staff, I googled a bit and found someone had EXACTLY THE SAME SYMPTOM as mine! The problem is after updating the extra 2 Quad Port 5709c Broadcom cards, it reset the Broadcom card to default with ROM Enabled, LOM (on-board Broadcom remains Disabled though, strange), so this ROM CONFLICTS with H700’s ROM, leading to PCI Interrupt Conflicts and blocked H700 Raid Configuration ROM to show up! After Disabled the Broadcom ROM one by one (has to do it 8 times), H700 was back online and the local Pro-Support staff told me they have never encountered such thing before, ok, I must say over the last 10 years, I solved my own hardware problem more than they do, I can do a MUCH BETTER JOB and more knowledge than they are, and why I am paying a premium for the Pro Support service level? @#$@#!!!@!!! No more! Lost for Dell!

For the strange iDrac problem, the solution is to reset it to factory default and reconfigure IP and password again.

So this is my 3rd time nightmare experience when performing firmware upgrade on Poweredge 11th generation server. Never thought a NIC ROM will conflict with Raid card’s ROM and never thought iDrac will block access after upgrading the firmware.

Later, I was able to reproduce the same symptom on another Poweredge R710 with same hardware configuration, ie, H700 disappeared, no more Ctrl + R menu due to Broadcom 5709 ROM conflicts with H700 ROM!

The good thing after all the troubles is I was finally able to use Dell Management Plugin for vCenter to discover the host and do things as it should be.

One thing I noticed R710 Firmware v6.0.7 has a new virtualization feature called SR-IOV, good for 10G/s cards for DCB, but I don’t have those, it will be useful if I upgrade to 10G/s later.

Last but not least, I found vMotion no longer works after the host exit Maintenance Mode, it complains CPU are not compatible with the source host as I tried to vMotion back the VMs, so I can upgrade the next ESX host. It turned out to be the latest BIOS v6.0.7 also updated the CPU Microprocessor Code to Step B, so it’s not the same as v2.0.9, so how am I going to migrate my VMs to this new host and performed the upgrade on my existing ESX host? Luckily, I have the luxury to power down those few VMs for a short period of time, and cold migrate them to the next host, so problem solved, haha…Ultimately it would be great if my client has a 3 nodes cluster of course.

Finally, I forgot iDrac is tightly integrated with LifeCycle Controller, so I should upgrade iDrac first, then BIOS with all the other components such as H700 Raid and Broadcom NICs depends if your iDrac still allows you to login to the web GUI that is, huh?!

In fact, I have updated all the iDrac firmware again to v1.85 today by uploading the .bin file directly via iDrac’s web GUI from a Windows host, it’s much simpler and clean to do that way.

One more thing to take care of is you can’t upgrade the iDrac while you are still connecting to remote console using USC method, it will perform the upgrade but saying something got conflict, so use the above alternative method will solve this problem.

On Windows system, it’s a much easier job, just apply the firmware update one by one at the same time, after everything has been updated, then reboot the server.

Anyhow, I would suggest using Dell Repository Manager to do the job for your next BIOS upgrade on Poweredge 11th generation servers.

Pages: Prev 1 2 3 4 5 6 7 ...255 256 257 ...328 329 330 Next