ESXi Slow Boot if having RDMs mapped

This is an issue that has come up time and time again and faced me a lot of times when there is an ESXi hosting virtual machines with RDMs mapped.Once a time, i waited around 1 hour and half for an ESXi host to boot up!!!!! 🙂

The basic gist of the problem is that when there are Microsoft Cluster Service (MSCS) virtual machines deployed across ESXi hosts (commonly referred to as Cluster Across Boxes or CABs), the virtual machines are sharing access to disks which are typically Raw Device Mappings or RDMs. RDMs are LUNs presented directly to virtual machines. Because we are rebooting an ESXi host which one assumes now has the passive virtual machines/cluster nodes, the other ESXi host or hosts therefore have the active virtual machines/cluster nodes. Since the active nodes have SCSI reservations on the shared disks/RDMs, this slows up the boot process of the ESXi as it tries to interrogate each of these disks during storage discovery. So what can you do to alleviate it? Read on and find out.

As mentioned, this issue has been around for a while. The resolution is basically to get the ESXi host storage discovery process to skip over the LUNs on which a SCSI Reservation has been detected. On ESXi/ESX 4.0, we recommended changing the advanced option Scsi.UWConflictRetries to 80. In ESX/ESXi 4.1, a new advanced option called Scsi.CRTimeoutDuringBoot was introduced (CR is short for Conflict Retries), and the recommendation was to set this value to 1 to speed up the boot process. What these settings did was effectively get the discovery process to move on as quickly as possible once a SCSI reservation was detected.

In ESXi 5.0 & 5.1, a new setting was introduced to make this whole process much smarter. A new flag was introduced to allow an administrator to mark a LUN/RDM as ‘perennially reserved’. This is an indication to the SCSI mid-layer of the VMkernel to not to try to query this device during a ‘discovery’ process. This speeds up the boot process when you have MSCS running in virtual machines and there is a need to boot any of the ESXi hosts which own these virtual machines and associated storage.

ESXi/ESX 4.x and ESXi 5.x hosts take a long time to start. This time depends on the number of RDMs that are attached to the ESXi/ESX host.

Note: In a system with 10 RDMs used in an MSCS cluster with two nodes, a restart of the ESXi/ESX host with the secondary node takes approximately 30 minutes. In a system with less RDMs, the restart time is less. For example, if only three RDMs are used, the restart time is approximately 10 minutes.

ESXi 5.0

ESXi 5.0 uses a different technique to determine if Raw Device Mapped (RDM) LUNs are used for MSCS cluster devices, by introducing a configuration flag to mark each device as perennially reserved that is participating in an MSCS cluster. During the start of an ESXi host, the storage mid-layer attempts to discover all devices presented to an ESXi host during the device claiming phase. However, MSCS LUNs that have a permanent SCSI reservation cause the start process to lengthen as the ESXi host cannot interrogate the LUN due to the persistent SCSI reservation placed on a device by an active MSCS Node hosted on another ESXi host.

Configuring the device to be perennially reserved is local to each ESXi host, and must be performed on every ESXi 5.0 host that has visibility to each device participating in an MSCS cluster. This improves the start time for all ESXi hosts that have visibility to the device(s).

There is no support to apply this setting using vSphere host profiles. ESXi 5.0 hosts deployed using vSphere Auto Deploy cannot take advantage of this feature.

Using Host profiles:

To mark the MSCS LUNs as perennially reserved on an already upgraded ESXi 5.1/5.5 host, set the perennially reserved flag in Host Profiles.

Using CLI:

To mark the MSCS LUNs as perennially reserved on an already upgraded ESXi 5.0 host, run the esxcli command from Already upgraded ESXi 5.1/5.5 hosts section of this article and all subsequent rescans/starts at normal speed.
  1. Determine which RDM LUNs are part of an MSCS cluster.
  2. From the vSphere Client, select a virtual machine that has a mapping to the MSCS cluster RDM devices.
  3. Edit your virtual machine settings and navigate to your Mapped RAW LUNs.
  4. Select Manage Paths to display the device properties of the Mapped RAW LUN and the device identifier (that is, the naa ID).
  5. Take note of the naa ID, which is a globally unique identifier for your shared device.
  6. Use the esxcli command to mark the device as perennially reserved:

    esxcli storage core device setconfig -d naa.id --perennially-reserved=true

  7. To verify that the device is perennially reserved, run this command:

    esxcli storage core device list -d naa.id

    In the output of the esxcli command, search for the entry Is Perennially Reserved: true. This shows that the device is marked as perennially reserved.

  8. Repeat the procedure for each Mapped RAW LUN that is participating in the MSCS cluster.

Note: The configuration is permanently stored with the ESXi host and persists across restarts. To remove the perennially reserved flag, run this command:

esxcli storage core device setconfig -d naa.id --perennially-reserved=false

Leave a Reply