Troubleshooting - VMware infrastructure

Blog Post created by steve.senecal on Jun 27, 2014

This morning around 10 AM, I was asked to join a "HURT" bridge due to an application product that experienced mainframe cpu response time issues. This application product is virtualized in our Cisco UCS infrastructure running VMware vSphere 5.0. When I joined the call I asked for a brief overview. The duty manager provided the following, "yesterday around 1700 a drop in cpu "meaning mainframe com's" occurred. The duty manager wanted to know if there was any impacts, problems, errors within UCS and VMware infrastructure.


Now that you have a reference point, as IT managers we get these calls "a lot", but mine seem to be more in the 2 AM to 4 AM time slot. I guess I got luck this morning.


So, I asked the 15 or so folks on the trouble bridge what product was impacted and can someone provide me VM names that were experiencing these time outs. The team quickly sent via IM a list of VM names. While they were sending me those VM names, I was establishing connectivity with our VMTurbo Operations Manager web console. I logged into VMT and immediately went to the supply chain tab... I wanted to start looking at a high-level birds-eye view if there was a problem in my area of responsibility (which is all Cisco UCS and VMware for the company). I copied one of the VMs in question and searched by selecting the "virtual machine" within the supply chain navigation section. I quickly identify what VMware ESXi host the VM was running on and what the "current" utilization was. This helped me to determine what Cisco UCS POD this VM was running in (as we have 80 UCS domains running within our data center's). Meanwhile, the duty manager was commenting that Tivoli and Indicative where seeing errors coming from other Cisco UCS domains "thermal errors in POD35, error accessing shared storage in POD48". I quickly reported back on the bridge that these errors did not contribute to the issues at hand. I also reported that POD25 was the UCS domain where all of the VMs were hosted. I went back to June 26 @ 1700 within VMT and looked at what the power and cooling utilization was and reported this was normal. All network northbound Ethernet/FC and southbound DCE bandwidth was less than 2% utilization. As I reported back all of the vital UCS and VMware infrastructure was normal during the time period in question. We team started looking at other areas (F5 load balancers and mainframe comms). The "HURT" bridge closed down within the hour (which was a first, as these things tend to drag on and on...).


In summary - I have used VMTurbo for several of these trouble calls and I'm able to quickly navigate through all of the confusion "on the bridge" and drill down into the "data" or the facts. VMTurbo has proven itself again this morning, that its not just a VM capacity management tool, but the virtualization infrastructure tool you must have in your IT virtualization toolbox!