VMware ESX Server's Effects on Operations
ESX creates a myriad of problems for administrators, specifically problems having to do with the scheduling of various operations around the use of normal tools and other everyday activities such as deployments, VMotion to balance nodes, and backups. Most, if not all, the limitations revolve around issues related to performance gathering and the data stores upon which VMs are placed, whether SCSI, including iSCSI, or non-VMDK files accessed from NFS shared off a NAS or some other system.
The performance-gathering issues dictate which tools to use to gather performance data and how to use the tools that gather this data. A certain level of understanding is required to interpret the results, and this knowledge will assist in balancing the VMs across multiple ESX Servers.
The data store limitations consist of bandwidth issues; each has a limited pipe between the ESX Server and the remote storage and reservation or locking issues. These two issues dictate quite a bit how ESX should be managed. As discussed in Chapter 5, “Storage with ESX,” SCSI reservations will occur whenever the metadata of the VMFS is changed and the reservation happens for the whole LUN and not an extent of the LUN. This also dictates the layout of VMFS on each LUN; specifically, a VMFS should take up a whole LUN and not a part of the LUN.
This chapter covers data store performance or bandwidth issues, SCSI-2 reservation issues, and performance-gathering agents, and then finishes with some other issues and a discussion of the impact of Sarbanes-Oxley. Note that some of the solutions discussed within this chapter are utopian and not easy to implement within large-scale ESX environments. These are documented for completeness and to give information that will aid in debugging these common problems.
Data Store Performance or Bandwidth Issues
Because bandwidth is an issue, it is important to make sure that all your data stores have as much bandwidth as possible and to use this bandwidth sparingly for each data store. Normal operational behavior of a VM often includes such things as full disk virus scans, backups, spyware scans, and other items that are extremely disk-intensive activities. Although none of these activities will require any form of locking of the data store on which the VMDK resides, they all take a serious amount of bandwidth to accomplish. The bandwidth requirements for a single VM are not very large compared to an ESX Server with more VMs. Staggering the activities in time will greatly reduce the strain on the storage environment, but remember that staggering across ESX Servers is a good idea as long as different data stores are in use on each ESX Server. For example, it would cause locking issues for VMs that reside on the same LUN but different ESX Servers to be backed up at the same time. This should be avoided. However, virus scans will not cause many issues when done from multiple VMs on the same LUN from multiple ESX Servers, because operations on the VMDK do not cause locks at the LUN level. It is possible that running of disk-intensive tools within a VM could cause results similar to those that occur with SCSI Reservations, but are not reservations. Instead, they are load issues that cause the SAN or NAS to be overworked and therefore present failures similar to SCSI-2 Reservations.