Unsolved

This post is more than 5 years old

2859

January 31st, 2011 11:00

Unable to service IO due to a storage pool problem

I have a NS-960 running Flare 30, have issues where vsphere hosts is unable to access a lun (in my case 2 of 30, all FC). These are thick pool luns. The lun was trespassed, then became unowned by any SP. We ended up rebooting SPA (it was out of resources, memory I believe). Having the same issues today.

This is one of the errors:

LUN 60060160d0582900:d88b7a0d081de011 is unable to service IO due to a storage pool problem. Please resolve any hardware issues. If the problem persists, please contact your service provider.  Internal Information only: LUN OID A0000000A.   000004000300

Has anyone seen this issue before?

Thanks.

474 Posts

January 31st, 2011 12:00

Check the release notes for the lastest version of FLARE 30...(available on PowerLink)  If you are running Flare 30 on a system, you should be working with your account team and support to get updated to patch 509.

Platforms

Brief description

Symptom details

Solution (or workaround)

CX4-120

CX4-240

CX4-480

CX4-960

During a storage processor reboot, LUNs may not be accessible to VMware hosts from the peer storage processor.

34841844/369464

Frequency of occurrence:

Likely under a specific set of circumstances.

Severity: Medium

During a storage processor reboot, LUNs may not be accessible to VMware hosts from the peer storage processor.

KnowledgeBase ID:

emc227055

Fixed in code.

Exists in versions:

04.30.000.5.004

04.30.000.5.005

04.30.000.5.507

04.30.000.5.508

Fixed in versions:

04.30.000.5.509


Open a case with EMC Support though, and work with them to resolve the issue.  In some cases there are actions that must be taken before upgrading to prevent more problems.  Here's the KB article referenced in the release notes..

Product: CLARiiON All
Persistent reservation conflicts cause VMware  ESX 3.5 server to lose access to some LUNs only from SP A.
LUN is not accessible from SP A. Trespassing LUN to SP B allows access to the LUN , but trespassing back to SP A results in VMware server losing access to the LUNs.
Error messages like these on the VMware server:

Oct 27 09:59:01 xxxx335 vmkernel: 0:16:39:53.731 cpu3:1040)SCSI: vm 1040: 109: Sync CR at 0
Oct 27 09:59:01 xxxx335 vmkernel: 0:16:39:53.731 cpu3:1040)WARNING: SCSI: 119: Failing I/O due to too many reservation conflicts
Oct 27 09:59:01 xxxx335 vmkernel: 0:16:39:53.731 cpu3:1040)WARNING: FS3: 2913: reservation error: SCSI reservation conflict
Oct 27 09:59:01 xxxx335 vmkernel: 0:16:39:53.731 cpu3:1040)WARNING: FS3: 3370: Failed with bad0022
Oct 27 09:59:01 xxxx335 vmkernel: 0:16:39:53.731 cpu3:1040)FSS: 390: Failed with status SCSI reservation conflict for f530 28 1 4a588dc4 6c9974b4 1        e00d266 7e9ecd0b 0 0 0 0 0 0 0

From CLARiiON ktrace Logs

<<<<<<<< LUN 29 owned by SPA and we see reservation conflicts >>>>>>>

A 10/28/09 17:13:31 TCD4 fd7619d0 CC 02\04\03 LUN 0x0 Initiator 10000000C9627F5C OpCode 0x00
A 10/28/09 17:13:31 TCD4 fd7619d0 CC 02\04\03 LUN 0x0 Initiator 10000000C9627F5C OpCode 0x00
A 10/28/09 17:13:31 TCD4 fd7619d0 CC 02\04\03 LUN 0x0 Initiator 10000000C9627F5C OpCode 0x00
A 10/28/09 17:13:31 TCD4 fd7619d0 CC 02\04\03 LUN 0x0 Initiator 10000000C9627F5C OpCode 0x00
A 10/28/09 17:13:31 TCD4 fd7619d0 CC 02\04\03 LUN 0x0 Initiator 10000000C9627F5C OpCode 0x00
A 10/28/09 17:13:31 TCD4 fd7619d0 CC 06\29\00 LUN 0xE Initiator 10000000C9627F5C OpCode 0x00

A 10/28/09 17:13:39 TDD 9f4836c0 Rsv6 Persistent Reservation conflict for Initiator 10000000C9627F5C LUN E Tag 32.
A 10/28/09 17:13:39 TDD 9f4836c0 Rsv6 Persistent Reservation conflict for Initiator 10000000C9627F5C LUN E Tag 345.
A 10/28/09 17:13:40 TDD 9f4836c0 Rsv6 Persistent Reservation conflict for Initiator 10000000C9627F5C LUN E Tag 157.

<<<<<<<< LUN 29 (FLU 259) trespassed to SPB and no conflicts observed >>>>>>>

A 10/28/09 17:14:39 LUSM ff24d040 Enter 259 LU_ENABLED op=LUSM_RELEASE_FOR_TRESPASS el.st=0x1901 [ShutdownRelease.ForTrespass]
A 10/28/09 17:14:39 LUSM ff24d040 Exit 259 LU_SHUTDOWN_TRESPASS op=LUSM_RELEASE_FOR_TRESPASS el.st=0x1901 [ShutdownRelease.ForTrespass]
A 10/28/09 17:14:40 LUSM ff24d040 Enter 259 LU_PEER_ASSIGN op=LUSM_ASSIGN_PEER_DONE el.st=0x0
A 10/28/09 17:14:40 LUSM ff24d040 Exit 259 LU_PEER_ENABLED op=LUSM_ASSIGN_PEER_DONE el.st=0x0
A 10/28/09 17:14:40 LUSM ff24d040 Enter 259 LU_PEER_ENABLED op=LUSM_ASSIGN_PEER_DONE el.st=0x0
A 10/28/09 17:14:40 LUSM ff24d040 Exit 259 LU_PEER_ENABLED op=LUSM_ASSIGN_PEER_DONE el.st=0x0


<<<<<<<< LUN 29 (FLU 259) trespassed back to SPA from SPB we see reservation conflicts >>>>>>>

A 10/28/09 17:21:41 LUSM ff24d040 Enter 259 LU_PEER_ENABLED op=LUSM_RELEASE_FOR_TRESPASS el.st=0x2 [Assign.StartAssign]
A 10/28/09 17:21:41 LUSM ff24d040 Exit 259 LU_PEER_SHUTDOWN_TRESPASS op=LUSM_RELEASE_FOR_TRESPASS el.st=0x2 [Assign.StartAssign]
A 10/28/09 17:21:41 LUSM ff24d040 Enter 259 LU_PEER_SHUTDOWN_TRESPASS op=LUSM_RELEASE_FOR_TRESPASS_DONE el.st=0x2 [Assign.StartAssign]
A 10/28/09 17:21:41 LUSM ff24d040 Exit 259 LU_ASSIGN op=LUSM_PROCEED_WITH_ASSIGN el.st=0x2 [Assign.StartAssign]
A 10/28/09 17:21:41 LUSM ff24d040 Enter 259 LU_ASSIGN op=LUSM_PROCEED_WITH_ASSIGN el.st=0x2 [Assign.StartAssign]
A 10/28/09 17:21:41 LUSM ff24d040 Exit 259 LU_ASSIGN op=LUSM_PROCEED_WITH_ASSIGN el.st=0x3 [Assign.GlutRead]
A 10/28/09 17:21:41 CACHE a4bb2040 Starting assignment of LUN 259
A 10/28/09 17:21:41 LUSM ff24d040 Enter 259 LU_ASSIGN op=LUSM_ASSIGN_DONE el.st=0xd [Assign.Done]
A 10/28/09 17:21:41 LUSM ff24d040 Exit 259 LU_ENABLED op=LUSM_ASSIGN_DONE el.st=0xd [Assign.Done]

A 10/28/09 17:22:13 TCD4 fd7619d0 CC 02\04\03 LUN 0x0 Initiator 10000000C9627F5C OpCode 0x00
A 10/28/09 17:22:13 TCD4 fd7619d0 CC 02\04\03 LUN 0x0 Initiator 10000000C9627F5C OpCode 0x00

A 10/28/09 17:22:21 TDD 9f4836c0 Rsv6 Persistent Reservation conflict for Initiator 10000000C9627F5D LUN E Tag 256.
A 10/28/09 17:22:21 FCDMTL 3 (FE2) fd762bf0 Target command error: loopID = 33., SCSI status = 18, instance 0
A 10/28/09 17:22:21 TDD 9f4836c0 Rsv6 Persistent Reservation conflict for Initiator 10000000C9627F5D LUN E Tag 257.
A 10/28/09 17:22:21 FCDMTL 3 (FE2) fd762bf0 Target command error: loopID = 33., SCSI status = 18, instance 0
A 10/28/09 17:22:21 TDD 9f4836c0 Rsv6 Persistent Reservation conflict for Initiator 10000000C9627F5D LUN E Tag 258.
A 10/28/09 17:22:21 TDD 9f4836c0 Rsv6 Persistent Reservation conflict for Initiator 10000000C9627F5D LUN E Tag 259.

From available logs and ktraces, the reservation conflicts messages can be seen for Initiators 10000000C9627F5D and 10000000C9627F5C that are connected to host "xxx.local," which is part of Storage Group "xxxx335" ONLY when that LUN [ALU 29; HLU 14 (hex E)] is owned (or trespassed) to SP A.

A reservation conflict means that an initiator trying to access a LUN is not being allowed to, since another initiator has already reserved the LUN for its own access. Reservations are not created by the array in itself. These are in response to the commands sent by the initiator. The reservations are indeed something that the host would have control over, but then these will clear when the SP reboots. SP A is the default and the current owner of ALU 29 and hence reboot of SP A should clear the reservation conflicts.

From provided traces and logs, when LUN is on SP B, no reservation conflicts are seen in traces, the moment LUN is trespassed to SP A, traces report reservation conflicts for that LUN with error code "02\04\03" - LUN unavailable on requested SP.

A reboot of owning SP (that is, SP A) that is also reporting reservation conflicts should clear the conflicts and the LUN should be accessible upon reboot.

The problem will be fixed in future revisions of FLARE Release 26 and 29. However, a type-3 patch is now available that can be installed if the array is running  Release 29.12 or later.

This Type 3 patch can be obtained by contacting Technical Support and citing this solution ID (emc227055).  EMC Technical Support Level 2 personnel can download this Type 3 patch from the Patches page of the EMC Services Partner Web via Powerlink.  Access to Type 3 patches is restricted.  If you believe that you should have access to this area, send an email to CLARiiONType3PatchAccess@emc.com for assistance.

February 1st, 2011 06:00

Thanks for the response.

Flare to 509 was always recommended but never found smoking gun. It was not until we submitted a second spa panic dump in which engineering found the issue. They are recommending a Type III patch on top of 509. They are onsite now, let's see what happens.

No Events found!

Top