Performance and other monitoring is an important issue from an operational point of view. Many customers monitor the health of their hardware and servers by monitoring hardware and performance agents. Although hardware agents monitor the health of the ESX Server, they should not monitor the health of a VM, because the virtual hardware is truly dependent on the physical hardware. In addition, most agents are talking to specific chips, and these do not exist inside a VM. So using hardware agents will often slow down your VM.
Because these agents will adversely affect performance, measuring performance now is a very important tool for the Virtual Infrastructure and will tell you when to invest in a new ESX Server and to how to balance load among the ESX Servers. Although there are automated ways to balance the load among ESX Servers (they are covered in Chapter 11, “Dynamic Resource Load Balancing”), most if not all balancing of VM load across hosts is performed by hand, because there are more than just a few markers to review when moving VMs from host to host.
The first item to understand is that the addition of a VM to a host will impact the performance of the ESX Server; sometimes in small ways, and sometimes in other ways that are more noticeable. The second item to understand is that the performance tools that run within a VM depend on real clock cycles to determine the performance of the VM and that a VM is not always given full clock cycles. Because there are often more VMs than CPUs or cores, a VM will share a CPU with others, and as more VMs are added the slice of time the VM gets to run on a CPU is reduced even further. Therefore, there is a greater time lag between each usage of the CPU and thus a longer CPU cycle. Because performance tools use the CPU cycle to measure performance and to keep time, the data received is relatively inaccurate. However, the experimental descheduler VMware tool can be used to counteract this effect. When the system is loaded to the desired level, a set of baseline data should be discovered.
Once a set of baseline data is available, the internal to the VM performance tools can determine whether a change in performance has occurred, but it cannot give you raw numbers, just a ratio of change from the baseline. For example, if the baseline for CPU utilization is roughly 20% measured from within the VM and suddenly shows 40%, we know that there was a 2x change from the original value. The original value is not really 20%, but some other number. However, even though this shows 2x more CPU utilization for the VM, it does not imply a 2x change to the actual server utilization. Therefore, other tools need to be used to gain performance data for a VM that do not run from within the VM. Although useful for baselines, they are not useful overall. In this case, VC, a third-party VM manager, ESXCharter, and the use of esxtop from the command line are the tools to use. These all measure the VM and ESX Server performance from outside the VM, to gain a clearer picture of the entire server. The key item to realize is that when there is a sustained over 80% utilization of an ESX Server as measured by VC or one of the tools, a new ESX Server is warranted and the load on the ESX Server needs to be rebalanced.
Balancing of ESX Servers can happen daily or even periodically during the day by using the VMotion technology to migrate running VMs from host to host with zero downtime. Although this can be dynamic (covered in Chapter 11, “Dynamic Resource Load Balancing”), using VMotion by hand can give a better view of the system and the ability to rebalance as necessary. For example, if an ESX Server’s CPU utilization goes to 95%, the VM that is the culprit needs to be found using one of the tools; once found, the VM can be moved to an unused or lightly used ESX Server using VMotion. If this movement becomes a normal behavior, it might be best to place the VM on a lesser-used machine permanently. This is often the major reason an N+1 host configuration is recommended.
Another item that can increase CPU utilization is the deployment of VMs. Deployment is discussed in detail in a later chapter, but the recommendation is to create a deployment server that can see all LUNs. This server would be responsible for deploying any new VM, which allows the VM to be tested on the deployment server until it is ready to be migrated to a true production server using VMotion.
For example, a customer desired to measure the performance of all VMs to determine how loaded the ESX Server could become with their current networking configuration. To do so, we explained the CPU cycle issues and developed a plan of action. We employed three tools in this example, VC, vmkusage (now rolled into VC 2.0), and esxtop running from the service console. We found that the granularity of VC was less than vmkusage, which was less than esxtop. For performance-problem resolution, esxtop is the best tool to use, but it spits out reams of data for later graphing. VC averages things over 30-minute or larger increments unless the graph is exported then the raw data can be manipulated in Excel. vmkusage averages all data over 5-minute distributions. esxtop uses real and not averaged data. The plan was to measure performance using each tool as each VM was running its application. Performance of ESX truly depends on the application within each VM. It is extremely important to realize this, and when discussing performance issues to not localize to just a single VM, but to look at the host as a whole. This is why VMware generally does not allow performance numbers to be published, because without a common set of applications running within each VM, there is no way to compare all the apples to oranges. It is best to do your own analysis using your applications, because one company’s virtualized application suite has nothing to do with another companies, and therefore will have major implications on performance.