Highlighted

Why do I have trespassed LUNs or all LUNs on one SP?

This post is going to focus on discussing different trespassing in a mid-range array and how it relates to ensuring an optimal storage configuration for ESX/i.  In a VMware environment, storage is a critical component to the mix.  There are different protocols in which allow a datastore gets advertised to a host, we are going to focus on ensuring mid-range block storage array (FC/iSCSI/FCoE) datastores/LUNs are balanced properly.  The post is more explanatory to allow for a better understanding of multi-pathing rather than definitive guide to any specific situation.

We are going to make a big assumption here, this being that there has been planning ahead of time of a pre-decided balance of LUNs on a storage array that we would like to maintain in order to have an optimal environment.  So basically the datastores (LUNs) are set for “default owner” on specific SPs in a balanced/predicted way.  Keep an eye on the ownership discussion below as they are ultra-critical in seeing how VMware and an EMC mid-range (CX/NS) storage array play together.

Active-Passive/Active

The default owner refers to which storage processor owns a LUN and is responsible for sending IOs to the backend disk.  This is a critical differentiator to mention behind an enterprise and mid-range array.  Simply put on an enterprise active-active array any storage processor can send IO to any backend disk.  Whereas under a mid-range active-passive array only one SP can send IO to a disk at a time.  These SPs can however run active-passive passive-active for different LUNs which allows you to balance IO out across the different SPs.

In comes a bit of complication however, due to something called ALUA, there is now the ability for a mid-range array to work similar to an enterprise active-active array.  Note, ALUA is enabled by default when registering ESX hosts to vSphere 4.1 and EMC CX/NS flare 29+.  It allows ESX/i to send IO down any path to a storage processor and that processor will accept the IO either send it to backend disk directly or internally redirect the IO to its partner SP that then sends the IO to backend disk.  Without ALUA this situation where IO was being received down different paths would cause a ping-pong effect where a LUN would change owners constantly in order to service IO (very bad).  With ALUA, this does not happen.  However what will happen is after a certain threshold the SP will recognize that it would be more efficient to have its partner SP own the LUN and thus the backend disk for the LUN and will pass its “current owner” status to its partner.

So in summary ALUA enables a mid-range array to behave more like an active-active array from ESX/i’s perspective, but let’s look at some more details in the stack..

Pathing choices

The goal as I see it for running an efficient storage stack via block protocols is to maximize IO capabilities and minimize management overhead and risk.  So in simple terms, EMC has software called PowerPath/VE that in my opinion allows alleviates most of what I will describe next.

Trespasses

We’ve described a few things, and the most critical thing that we are going to move forward with is around the lun owernship; this being current or default.  Current referring to what SP can send IO to a LUN and a LUNs backend disk and default referring to what SP is set to be the owner on array bootup or a non-trespassed situation.  A LUN being in a trespass state simply means the LUN is currently owned by the SP that isn’t it's default owner.

The question at this point is, why would a LUN enter a trespass state?  There is a list of reasons why this may happen, let’s start from a general point and move into more cluster and VMware specific reasons.  Trespasses can happen manually, by the array, or caused due to ESX/i operations.  From a manual perspective, at any point you can force a trespass (move to opposite SP) from the GUI/CLI of the storage array.  What’s important to get here is that even if I manually make trespassing decisions they can be overridden by the array or the hypervisor right away.  So from a manual perspective, I force trespasses from the GUI/CLI.  I may decide to do this to untrespass a LUN, or if I know I will be removing paths and want to force IO down a certain SP.  On the more automatic side, a LUN can be trespassed by an SP for many conditions.  An NDU “non-disruptive upgrade” to an array will cause LUNs to be trespassed from one SP to the other to ensure data access LUN access during the whole upgrade.  After the upgrade, the LUNs may exist solely on one SP and those LUNs with current owner not equal to default owner are considered  trespassed.  There are many other reasons from an array perspective. .

Moving more into the VMware world, we are talking about a cluster of hosts, each making their own decision as to what path should be used.  These hosts are using pre-determined decision trees to set these paths.  However, the decision trees are all happening among the hosts at different times.  For example, when using FIXED ALUA pathing and an ESX/i server boots up it will assign the active path to be based on what the current owner of a LUN is.  There may be a trespass of this LUN for some reason after this, and another ESX/i host then applies the same decision tree.  It then decides its active path will go down the new current owner SP.  At this point you have two ESX/i servers that made pathing choices down different SPs.  Problem here?  Not a huge one since ALUA allows for both SPs to receive IOs.  Problem comes in where the SPs may in the backend trespass the LUN whenever thresholds are met and it decides it’s more efficient to service the LUN from partner SP and you predetermined balance of LUNs per SP is thrown off.  As well, there is extra overhead in sending IO through more channels, ie. the internal link between SP is an extra hop (more CPU cycles, less use of cache).

Other situations that you may run in to where this would happen..  An array just did an NDU, all LUNs are currently owned (not default) by the secondary SP (it is first to upgrade, so will own LUNs at end of NDU).  If you then bootup your ESX/i hosts they will set active paths to one SP.  So some of your LUNs will be trespassed and stuck as trespassed if pathing is FIXED.  One more example, and this one may actually be the most relevant..  If there is one host in the cluster of any size that does not have established pathing to an SP, all IO will attempt to traverse the path it has for that LUN.  Sounds bad?  It is, in reality all it takes is one misconfigured host to throw a cluster out of balance.  This is why we suggest reviewing best practices and ensure you have 4 paths to a mid-range array, 2 to each SP.  So there we have it, a few situations where due to systems making pathing decisions independently of some authoritative source, causes inconsistent pathing to mid-range arrays.

Ok, so I have a lot of trespassing going on, how do I fix it?

It depends =).. The larger the ESX/i environment, the more challenging this can be to fix.  Meaning, if I have a cluster of four hosts I can pretty easily go through and adjust active paths to match via the vCenter GUI.  However, the larger the cluster the more difficult it is.  For example, 4 hosts times 4 datastores yields 16 checks.  Scale that up to 30 hosts and 30 datastores, that’s 900 checks.. Ouch!

The Easy Solution

PowerPath/VE is the slam dunk for this.  If we are hard set on using block protocols for datastore access and we don’t want to think about the management of the paths then PP/VE is your software.  We describe it as path management software that adaptively manages paths.  Simply put, VMware’s NMP (native multipathing) does not make any decisions based on authoritative information.  I believe it’s critical in larger environments to do this.  PPVE uses array side information to make pathing decisions dynamically for you.  So in essence, paths that are uncongested and available are used at all times and are adapted to as things change on the fan-in/target array port.  In my opinion, this is far and away the most comprehensive way to ensure optimal block access to a storage array for VMware.

The reboot it method

Since the decision tree happens when a ESX/i server first boots an option to fix your active pathing without using a GUI/CLI would be to just place the host into maintenance mode and reboot.  Not a bad method since there is no effect to VMs due to ESX/i migrating VMs online to another host due to entering maintenance mode.  When rebooting however, you need to ensure that LUN is owned by the default owner so the pathing decision can be correct.  And to mention to before booting any ESX/i server up, make sure your LUNs are currently owned by the right SP!  A manual trespass of a LUN or all LUNs to their default owner SP can be done via CLI/GUI. 

The scripting method

Not for the faint of heart, and I really can’t support it.  See the following link for a script I wrote to choose active paths across an ESX/i cluster based on authoritative information from the array (default owner) https://community.emc.com/thread/113885?tstart=0.

Round robin

It is possible to fix the trespassing conditions by putting a host in maintenance mode, switching to round robin pathing, and then trespassing luns to appropriate owners.  When round robin uses its decision tree for pathing choices it chooses the current owner as its active paths.  So it is susceptible to not being balanced similar to FIXED.  It however acts a bit differently under certain conditions.  Under FIXED, there is no time the active path will change.  Under RR the active paths will change when a LUN is trespassed via user intervention.  So if you’re digging into ALUA and RR, the thought may be that if I trespass a LUN from the array, the ESX/i server will still send traffic down what is set as ACTIVE paths.  This is not the case.  ALUA will keep active paths static only during array initiated trespass conditions.  So all of the manual/scripting work to balance paths can be achieved by just enabling RR and ensuring there aren’t trespasses on the array.  So all in all, a pretty easy solution if you’re willing to go to RR!

So what’s EMC’s official pathing stance for the mid-range?

The discussion above was mostly focused discussing the storage stack and aimed at FIXED ALUA and PPVE around a problem and solution.  It is important to note that FIXED ALUA is currently what will be chosen automatically by ESX/i 4/4.1 when a hypervisor first boots for EMC CX/NS arrays.  EMC’s best practices are to use ROUND ROBIN (only caveat is for arrays running multiple iSCSI initiators and pre flare 30 code).  RR is a common best practice among array vendors, and EMC is no different.  We do however highly suggest using PPVE instead of round robin to attain adaptive load balancing.