VMware ESX and ESXi in the Enterprise: Effects on Operations
The introduction of virtualization using VMware ESX and ESXi creates a myriad of operational problems for administrators, specifically problems having to do with the scheduling of various operations around the use of normal tools and other everyday activities, such as deployments, antivirus and other agent and agentless operational tasks (performance gathering, and so forth), virtual machine agility (vMotion and Storage vMotion), and backups. In the past, prior to quad-core CPUs, many of these limitations were based on CPU utilization, but now the limitations are in the areas of disk and network throughput.
The performance-gathering issues dictate which tools to use to gather performance data and how to use the tools that gather this data. A certain level of understanding is required to interpret the results, and this knowledge will assist in balancing the VMs across multiple ESX or ESXi hosts.
The disk throughput issues are based on the limited pipe between the virtualization host and the remote storage, as well as reservation or locking issues. Locking issues dictate quite a bit how ESX should be managed. As discussed in Chapter 5, "Storage with ESX," SCSI reservations occur whenever the metadata of the VMFS is changed and the reservation happens for the whole LUN and not just an extent of the VMFS. This also dictates the layout of VMFS on each LUN; specifically, a VMFS should take up a whole LUN and not a part of the LUN. Disk throughput is becoming much more of an issue and will continue to be. Which is why with vSphere 4.1, Storage IO Control (SIOC) was introduced to traffic shape egress from the ESX host to Fibre Channel arrays. SIOC comes into play if the LUN latency is greater than 20ms. SIOC should improve overall throughput for those VMs marked as needing more of the limited pipe between the host and remote storage.
The network throughput issues are based on the limited pipes between the virtual machines and the outside physical network. Because these pipes are shared among many VMs, and most likely networks, via the use of VLANs, network I/O issues come to the forefront. This is especially true when discussing operational issues such as when to run network intensive tasks: VM backups, antivirus scans, and queries against other agents within VMs.
Virtual machine agility has its own operational and security concerns. Basically, the question is, "Can you ever be sure where your data is at any time?" Outside of the traditional operational concerns, virtual machine agility adds complexity to your environment.
Note that some of the solutions discussed within this chapter are utopian and not easy to implement within large-scale ESX environments. These are documented for completeness and to provide information that will aid in debugging these common problems. In addition, in this chapter unless otherwise mentioned we use the term ESX to also imply ESXi.
SCSI-2 Reservation Issues
With the possibility of drastic failures during crucial operations, we need to understand how we can alleviate the possibility of SCSI Reservation conflicts. We can eliminate SCSI Reservations by changing our operational behaviors to cover the possibility of failure. But what is a SCSI Reservation?
SCSI Reservations occur when an ESX host attempts to write to a LUN on a remote storage array. Because the VMFS is a clustered file system, there needs to be a way to ensure that when a write is made, that all previous writes have finished. In the simplest sense, SCSI Reservation is a lock that allows one write to finish before the next. We discuss this in detail in Chapter 5.
Although the changes to operational practices are generally simple, they are nonetheless fairly difficult to implement unless all the operators and administrators know how to tell whether an operation is occurring and whether the new operation would cause a SCSI Reservation conflict if it were implemented. This is where monitoring tools make the biggest impact.
VMware has made two major changes within the VMFS v3.31 to alleviate SCSI-2 Reservation issues. The first change was to raise the number of SCSI-2 Reservation retries that occur before a failure is reported. The second change was to allocate to each ESX host within a cluster a section of a VMFS so that simple updates do not always require a SCSI-2 Reservation. Even with these changes, SCSI-2 Reservations still occur, and we need to consider how to alleviate them.
The easiest way to alleviate SCSI-2 Reservations is to manage your ESX hosts using a common interface such as VMware vCenter Server, because vCenter has the capability to limit some actions that impact the number of simultaneous LUN actions. However, with the proliferation of PowerShell scripts, other vCenter management entities, and direct to host actions, this becomes much more difficult. Therefore, as we discussed in Chapter 4, "Auditing and Monitoring," it behooves you to perform adequate logging so that you can determine what caused the SCSI-2 Reservation, and then work to alleviate this from an operational perspective.
The primary way to avoid SCSI-2 Reservations is to verify in your management tool that all operations upon a given LUN or set of LUNs have been completed before proceeding with the next operation. In other words, serialize your actions per LUN or set of LUNs. In addition to checking your management tools, check the state of your backups and whether any current open service console operations have also completed. If a VMDK backup is running, let that take precedence and proceed with the next operation after the backup has completed. The easiest way to determine if a backup is running is to look on your backup tool's management console. However, you can also check for a snapshot that is created by your backup software using the snapshot manager that is part of the vSphere client or one of the snapshot hunter tools available. Most snapshots created by backup tools will have a very specific snapshot name. For example, if you use VCB, the snapshot will be named "_VCB-BACKUP_."
Multiple concurrent vMotions or Storage vMotions are a common cause for SCSI-2 Reservations, and this is why VMware has limited the number of simultaneous vMotions and Storage vMotions that can take place to six (increased to 8 in vSphere 4.1). Note that although vSphere will allow this number of migrations take place concurrently, it is not recommended for all arrays. For high-end arrays, the maximum can be performed simultaneously.
To check to see whether service console operations that could affect a LUN or set of LUNs have completed, judicious use of sudo is recommended. sudo can log all your operations to a file called /var/log/secure that you can peruse for file manipulation commands (cp, rm, tar, mv, and so on). Hopefully, this is being redirected to your log server, which has a script written to tell you if any LUN operations are taking place. Additionally, as the administrator, you can check the process lists for all servers for similar operations. No VMware user interface combines backups, vMotion, and service console actions. However, the HyTrust appliance is one such device that does provide a central place to audit for LUN requests (but not the completion of such requests).
When you work with ESXi, filesystem actions can still take place via Tech Support Mode, VMware Management Appliance (vMA) and the use of the vifs command. Even for ESXi, logging will be required.
For example, let's look at a system of three ESX hosts with five identical LUNs presented to the servers via Hitachi storage. Because each of the servers shares LUNs we need, we should limit our LUN activity to one operation per LUN at any given time. In this case, we could perform five operations simultaneously as long as those operations were LUN specific. After LUN boundaries are crossed, the number of simultaneous operations drops. To illustrate the second case, consider a VM with two disk files, one for the C: drive and one for the D: drive. Normally in ESX, we would place the C: and D: drives on separate LUNs to improve performance, among other things. In this case, because the C: and D: drives live on separate LUNs, manipulation of this VM, say with vMotion, counts as four simultaneous VM operations. This count is due to one operation affecting two LUNs, and the locks need to be set up on both the source and target of the vMotion. Therefore, five LUN operations could equate to fewer VM operations.
This leads to a set of operational behaviors with respect to SCSI Reservations.
Using the preceding examples as a basis, the suggested operational behaviors are as follows:
- Simplify deployments so that a VM does not span more than one LUN. In this way, operations on a VM are operations on a single LUN. This may not be possible because of performance requirements of the LUNs.
- Determine whether any operation is happening on the LUN you want to operate on. If your VM spans multiple LUNs, check the full set of LUNs by visiting the management tools in use and making sure that no other operation is happening on the LUN in question.
- Choose one ESX host as your deployment server. In this way, it is easy to limit deployment operations, imports, or template creations to only one host and LUN at a time.
- Use a naming convention for VMs that also tells what LUN or LUNs are in use for the VM. This way it is easy to tell what LUN could be affected by VM operation. This is an idealistic solution to the problem, given the possible use of Storage vMotion, but at least label VMs as spanning LUNs.
- Inside vCenter or any other management tool, limit access to the administrative operations so that only those who know the process can enact an operation. In the case of vCenter, only the administrative users should have any form of administrative privileges. All others should have only VM user or read-only privileges.
- Only administrators should be allowed to power on or off a VM. A power-off and power-on are considered separate operations unrelated to a reboot or reset from within the Guest OS. Power on and off operations open and close files on the LUN. However, more than just SCSI Reservation concerns exist with this case—there are performance concerns. For example, if you have 80 VMs across 4 hosts, rebooting all 80 at the same time would create a performance issue called a boot storm, and some of the VMs could fail to boot. The standard boot process for an ESX host is to boot the next VM only after VMware Tools is started, guaranteeing that there is no initial performance issue. However, this does not happen if VMware Tools is not installed or does not start. The necessary time of the lock for a power-on or -off operation is less than 7 microseconds, so many can be done in the span of a minute. However, this is not recommended, because the increase in load on ESX could adversely affect your other VMs. Limiting this is a wise move from a performance viewpoint.
- Use care when scheduling VMDK-level backups. It is best to have one host schedule all backups and to have one script to start backups on all other hosts. In this way, backups can be serialized per LUN. The serialization problem is solved by using the VMware Consolidated Backup, VMware Data Recovery, and many third-party tools such as Veeam Backup, Vizioncore vRangerPro, and Symantec BackupExpress. It is better for performance reasons to have each ESX host doing backups on a different LUN at any given time. For example, our three machines can each do a backup using a separate LUN. Even so, the activity is still controlled by only one host or tool so that there is no mix up or issue with timing so that each per LUN operation is serialized for a given LUN. Let the backup process limit and tell you what it is doing. Find tools that will
- Never start a backup on a LUN while another is still running.
- Signal the administrators that backups have finished either via email, message board, or pager(s). This way there is less to check per operation.
- Limit vMotion (hot migrations), fast migrates, cold migrations, and Storage vMotions to one per LUN. If you must do a huge number of vMotion migrations at the same time, limit this to one per LUN. With our example, there are five LUNs, so there is the possibility of five simultaneous vMotions, each on its own LUN, at any time. This assumes the VMs do not cross LUN boundaries.
- vMotion needs to be fast, and the more you attempt to do vMotions at the same time, the slower all will become. The slower the vMotion process, the higher the chance of the Guest OS having issues such as a blue screen of death for Windows. Using vMotion on 10 VMs at the same time could be a serious issue for the performance and health of the VM regardless of SCSI Reservations. Make sure the VM has no active backup snapshots before invoking vMotion.
- Use only the default VM disk modes. The nondefault persistent disk modes lead to not being able to perform snapshots and use the consolidated backup tools. Nonpersistent modes such as read-only create snapshot files on LUNs during runtime and remove them on VM power-off so as to not affect the master disk file.
- Do not suspend VMs, because this also creates a file and therefore requires a SCSI Reservation.
- Do not run vm-support requests unless all other operations have completed.
- Do not use the vdf service console tool when any other modification operation is being performed. Although vdf does not normally force a reservation, it could experience one if another host, because of a metadata modification, locked the LUN.
- Do not rescan storage subsystems unless all other operations have completed.
- Limit use of vmkmultipath, vmkfstools, and other VMware-specific service console and remote CLI commands until all other operations have completed.
- Create, modify, or delete a VMFS only when all other operations have completed.
- Be sure no third-party agents are accessing your storage subsystem via vdf, or direct access to the /vmfs directory.
- Do not run scripts that modify VMFS ownership, permissions, access times, or modification times from more than one host. Localize such scripts to a single host. It is suggested that you use the deployment server as the host for such scripts.
- Run all scripts that affect LUNs from a management node that can control when actions can occur.
- Stagger the running of disk-intensive tools within a VM, such as virus scan. The extra load on your SAN could cause results similar to those that occur with SCSI Reservations but which are instead queue-full or unavailable-target errors.
- Use only one file system per LUN.
- Do not mix file systems on the same LUN.
What this all boils down to is ensuring that any possible operation that could somehow affect a LUN is limited to only one operation per LUN at any given time. The biggest hitters of this are automated power operations, backups, vMotion, Storage vMotion, and deployments. A little careful monitoring and changes to operational procedures can limit the possibility of SCSI Reservation conflicts and failures to various operations.
A case in point follows: One company under review because constant, debilitating SCSI Reservation conflicts reviewed the list of 23 items and fixed one or two possible items but missed the most critical item. This customer had an automated tool that ran simultaneously on all hosts at the same time to modify the owner and group of every file on every VMFS attached to the host. The resultant metadata updates caused hundreds of SCSI-2 Reservations to occur. The solution was to run this script from a single ESX host for all LUNs. By limiting the run of the script to a single host, all the reservations disappeared, because no two hosts were attempting to manipulate the file systems at the same time, and the single host, in effect, serialized the actions.
Hot and cold migrations of VMs can change the behavior of automatic boot methodologies, which can affect LUN locking. Setting a dependency on one VM or a time for a boot to occur deals with a single ESX host where you can start VMs at boot of ESX, after VMware Tools starts in the previous VM, after a certain amount of time, or not at all. This gets much more difficult with more than one ESX host, so a new method has to be used. Although starting a VM after a certain amount of time is extremely useful, what happens when three VMs start almost simultaneously on the same LUN? Remember, we want to limit operations to just one per LUN at any time. We have a few options:
- Stagger the boot or reboot of your ESX host and ensure that your VMs start only after the previous VMs' VMware Tools start, to ensure that all the disk activity associated with the boot sequence finishes before the next VM boots, thereby helping with boot performance and eliminating conflicts. VM boots are naturally staggered by ESX when it reboots anyway if the VM is auto-started.
- Similar to doing backups, have one ESX host that controls the boot of all VMs, guaranteeing that you can boot multiple VMs but only one VM per LUN at any time. If you have multiple ESX hosts, more than one VM can start at any time on each LUN, one per LUN. In essence, we use the VMware vSphere SDK to gather information about each VM from each ESX host and correlate the VMs to a LUN and create a list of VMs that can start simultaneously; that is, each VM is to start on a separate LUN. Then we wait a set length of time before starting the next batch of VMs. This method is not needed when VMware Fault Tolerance fires because the shadow VM is already running. Also, VMware HA uses its own rules for starting VMs in specific orders.
All the listed operational changes will limit the number of SCSI subsystem errors that will be experienced. Although it is possible to implement more than one operation per LUN at any given time, we cannot guarantee success with more than one operation. This depends on the type of operation, the SAN, settings, and most of all, timings for operations.
Yet you may ask yourself, "Wouldn't using ESXi solve many of these issues because there is no service console?" The answer is, "Partially." Many of the "scripting" issues that occur within a service console are no longer a concern. Scripting issues can come up using the new VMware Virtual Management Appliance (vMA) or by using the remote CLI directly if there is not a single control mechanism for when these scripts run against all LUNs in question. So the problems can still occur even with ESXi. On top of this, it is still possible to run scripts directly within the ESXi Posix environment that comprises the ESXi management console. Granted, it is much harder, but not impossible.
There are several other considerations, too. Most people want to perform multiple operations simultaneously, and this is possible as long as the operations are on separate LUNs or the storage array supports the number of simultaneous operations. Because many simultaneous operations are storage array specific, it behooves you to run a simple test with the array in question to determine how many simultaneous operations can happen per LUN. As ESX improves, arrays improve, vStorage API for Array Integration is used within arrays, and transports improve in performance the number of simultaneous operations per LUN will increase.
With vSphere, the number of SCSI Reservations have dropped drastically but they still occur; when they do, this section will help you to track down the reasons and provide you the necessary information to test your arrays. You should also test to determine how many hosts can be added to a given cluster before SCSI Reservations start occurring. On low-end switches, this value may just be 2, whereas on others it could be 4.