VMware and IBM xSeries 3850 M2 Support Tickets

I have had two recent support tickets that just exhibit the state of coding in the networking world. The first one is in the case of VMware. We were getting this error message ever since upgrading to ESX 3.5 Update 2 from 3.0.2. The error message was “Unable to apply DRS resource settings on host (Reason: A general system error occurred: Invalid Fault).” We tried removing the host from the cluster, shutting off DRS on the particular cluster and then re-enabling it, and verifying the DRS settings for the pool. It effectively was stopping many DRS recommendations from taking effect in this cluster and entering maintenance mode.

We searched the VMware Knowledge-base and online communities, and this post was as close as we came to a solution. This problem continued over many days with many return calls to VMware. Finally, we hit the right person who knew that message related to in progress VMware tools installations. The solution was to restart mgmt-vmware service and then cancel all pending VM Tools installations. If that didn’t work, we were left with reinstalling the vpxa service on the host. The first one solved the problem, but my frustration was why it took nearly a week and nowhere online was the error message found. I wish the VMware KB just simply worked better and was more complete.

The other issue, which vendor support let us down or wasted our time, was also related to our recent upgrade to ESX 3.5u2. Update 2 brought with it our freedom from IBM Director agents on our ESX servers, which never worked well in our environment. We were therefore excited to get upgraded. Well after the install we were exposed to the “wonderful world” of CIM alerts. On all three of big VMware ESX servers, IBM xSeries 3850 M2, we saw alerts on the ServeRAID batteries. Two of them were in learning phase, and the other one alerted that it was fully charged – a really helpful health alert.

I called IBM Tech Support that night and was told that Learning mode was a normal thing, and despite not knowing when it started, I had to wait 24 hours from when I recognized the problem before IBM would work on it. 24 hours for the battery to “learn” how fully charged it is seems excessive, but I waited and was alerted — over and over by VMware CIM alerts. 24 hours later the alerts were still present, so I thought I might as well update the fw (a couple months old) on the card and Base Management Controller (BMC) where I believe the CIM gets its information.

Neither resolved the problem, and 72 hours after initially recognizing the problem, I was finally granted access to the inner sanctum of IBM support. That was short lived as the tech admitted there was a bug in the code of the ServeRAID card. It prevents the battery from ever learning its state. The only fix was to effectively disable the battery and memory and only use “write through” cache to prevent data loss in power outage situation. This would destroy the performance as disk writes are suddenly back to disk speed as opposed to DRAM speed. The other would be to accept the risk of data loss.

If this wasn’t an ESX host and really an appliance with all the real data on the SAN, I might be upset a $30-40,000 server couldn’t out perform my desktop with data safety. The real issue however again is why I couldn’t be told the problem on Tuesday as opposed to Saturday. Is it any wonder I have been working almost every night this week until 1 or AM?


Technorati : , ,

Share

7 comments to VMware and IBM xSeries 3850 M2 Support Tickets

  • This is a very beautiful website, I have enjoyed my visit here very much. I’m very honoured to sign in your guestbook. Thanking you for the great work that you are doing here.

  • We have purchased two of these servers and noticed this problem as soon as I installed VMWARE ESX Server 3.5.0 Builds 110181 and 110268 but could not get the health state correct. Did you find a real solution?(besides disabling the battery and memory and taking the performance hit)

  • Ron Pettit

    hi, we also have this problem, i don’t want to perform the fix above, are there any other fixes? I will log a call with VMWare. Surely they can perform a fix or delete the alarm?

  • Donald

    I to have the exact same issue (and a few more using the 3850′s).

    Were any of you guys able to get any support out of IBM/VMWARE, as I certainly have not.

  • At this point there is no fix to the above issues. I am still have the issues despite fully patching the CIMServer, ServeRAID patches, BMC patches, and BIOS patches. I am rather disgusted with the lack of IBM support to the aesthetic issue. From my discussions with VMware back level support, the solution will come from IBM when it comes, and it simply has not been made a priority yet (IMO).

  • Fuervo

    New FW for RAID boards have been out since Mar 20th. Battery issues are addressed in it.

    However, I’m not sure if it fixes the problem described here, as the exact Model/Type of the 3850 M2 was not specified, neither was the description of the RAID card mentioned.

    Anyway, http://www.ibm.com/support/fixcentral should get you going down the right road.

Statistics