How to quickly troubleshoot a hardware problem of VNXe

Question

How to quickly troubleshoot a hardware problem of VNXe

How to quickly troubleshoot a hardware problem of VNXe

Introduction

This article will introduce you to quick troubleshooting tips and resolutions for hardware problem with VNXe series storage. It will help users to check on the VNXe device timely.

This will also help you understand the hardware differences of VNXe3100, VNXe3150 and VNXe3300.

Detailed Information

VNXe series storage has three models, VNXe3100, VNXe3150 and VNXe3300. The hardware specifications of three models are somewhat different.

I. The difference of the appearance of VNXe3100, VNXe3150 and VNXe3300.

VNXe3100:

Front view:

The front view of the VNXe3100 platform having a 2U,12 (3.5-inch) disk drive DPE:

Rear view:

The rear view of a VNXe3100 platform having a 2U DPE with a Cache Protection Module and a single storage processor (SP A), respectively:

The rear view of a VNXe3100 platform having a 2U DPE with two storage processors(SP B and A), respectively:

VNXe3150:

Front view:

The front view of the VNXe3150 platform having a 2U, 12 (3.5-inch) disk drive DPE:

The front view of the VNXe3150 platform having a 2U, 25 (2.5-inch) disk drive DPE:

Rear view:

The rear view of a VNXe3150 platform having a 2U DPE with a Cache Protection Module and a single storage processor (SP A), respectively:

The rear view of a VNXe3150 platform having a DPE with two storage processors (SP B and A), respectively:

VNXe3300:

Front view:

The front view of the VNXe3300 platform having a 3U, 15 (3.5-inch) disk drive DPE (DPE7):

The front view of the VNXe3300 platform having a 3U, 25 (2.5-inch) disk drive DPE (DPE8):

Rear view:

The rear view of the VNXe3300 platform having a DPE with two storage processors(SP B and A), respectively:

II. Check and resolve hardware problems

Once you have checked the hardware and a component fails, you need to resolve it timely.

To restore your system to full operation, you need to replace the faulted hardware component. For example: if a disk has faulted, replace it immediately.

Before replacing a hardware component, you need to identify the faulted part. Follow the steps below:

1. Click System > System Health

2. The SP or DAE containing the faulted part will be marked with a health icon, or the top level component (SP or DAE) will already be expanded. If marked with one of the icons in the following table, expand the list by clicking the next to the SP or DAE. The faulted part is highlighted in the graphical display.

Icon	Label
	Warning
	Major
	Critical

3. In the System Components list, select the faulted part to view a description of the part's properties.

4. You need the following information to order a replacement part:

· VNXe serial number - located in the System Info section

· Product ID - located in the Component Description section

· Serial Number (SN) - located in the Component Description section

Note: If the System Health page cannot determine the S/N and P/N of a part, you need to look at the labels on the part in order to get the S/N and P/N. After confirming the replacement parts needed information, then you can order a replacement part.

Note: Before you order a replacement part, you can try power-cycle the entire VNXe system to attempt to resolve low level problems with the Storage Processors (SPs), I/O connections, disk-array enclosures, the system software, and other system components and returns the system to an operational state.

Next, this procedure involves placing the SPs in Service Mode.

All hosts will lose access to the system. Ensure all host operations that require the VNXe system have completed to prevent data loss.

Follow the steps below (in exact order):

1. Place both SPs in Service Mode. If the SP is already in Service Mode, you do not need to perform this action.

2. Disconnect the power cables from the disk-processor enclosure (DPE) to power down the SPs.

3. Disconnect the power cables from the power supplies on each disk-array enclosure (DAE) to power them down.

4. Reconnect the power cables to the power supplies on each DAE to power them up.

5. Reconnect the power cables to the DPE to power up the SPs.

6. Reboot each SP to return them to Normal Mode.

Note: When both Storage Processors (SPs) are in Service Mode, always return SPA to normal operation first, to avoid management software conflicts. Once SPA is operating normally, you can return SPB to normal operation.

If the problem persists, you need to contact EMC support and replace the component.

III. Summary

The EMC VNXe series is a unified storage solution, it addresses the challenges mentioned above. Designed for IT generalists with limited storage expertise, the VNXe abstracts the implementation of advanced storage functionality through an application-driven approach to managing shared storage. You will often use the hardware problem troubleshooting guide mentioned above on VNXe daily maintenance.

It is strongly recommended to save this information for reference.

Author: Leo Li

iEMC APJ

Please click here for for all contents shared by us.

asafayan1 · Answer

Great document. But what if you want to view troubleshoot the VNXe at the CLI?  What commands would you use? Thanks, Amir

ECN-APJ · Answer

EMC VNXe series storage is affordable unified storage platform with solution-focused software that’s easy to manage, provision, and protect. In addition, VNXe Unisphere is a graphical, application-oriented model with a web-familiar look and feel. Customer can easily manage and use VNXe storage through Unisphere. So, you only need to check the hardware status by Unisphere. If you are interesting with VNXe CLI and service commands, there are VNXe Unisphere CLI User Guide and VNXe Service Commands Technical Notes for your reference. You can find detailed command from these guides. But in general, these two documents are most use for EMC employee and partner engineers.

asafayan1 · Answer

From a monitoring perspective, one significant problem with the VNXe is it's lack of SNMP polling support. It is not practical to access the GUI to know the health / environmental status of the various components of the platform. That health / environmental status is a fundamental aspect of regular SNMP polling.

The VNXe platforms support SNMP traps - but on a very generic / high level basis as shown here:

MIB Details
TABLE 1:
The vnxe_alert.mib contains the following 8 traps:

EVENT	OID	DESCRIPTION
vnxeGenericTrapEmergency	.1.3.6.1.4.1.1139.18.1.18.2.0	This trap is generated when the system is unusable.
vnxeGenericTrapAlert	.1.3.6.1.4.1.1139.18.1.18.2.1	This trap is generated when action needs to be taken immediately.
vnxeGenericTrapCritical	.1.3.6.1.4.1.1139.18.1.18.2.2	This trap is generated when the system is in critical condition.
vnxeGenericTrapError	.1.3.6.1.4.1.1139.18.1.18.2.3	This trap is generated when there is an error in the system.
vnxeGenericTrapWarning	.1.3.6.1.4.1.1139.18.1.18.2.4	This trap is generated when there is a warning condition in the system.
vnxeGenericTrapNotice	.1.3.6.1.4.1.1139.18.1.18.2.5	This trap is generated when there is a normal but significant condition in the system.
vnxeGenericTrapInformational	.1.3.6.1.4.1.1139.18.1.18.2.6	This trap is generated when there is an informational message.
vnxeGenericTrapDebug	.1.3.6.1.4.1.1139.18.1.18.2.7	This trap is generated when there is a debug-level message.

TABLE 2:
The vnxe_alert.mib contains the following 5 Trap Variables:

VARIABLE	DESCRIPTION
::= { vnxeTrapVariable 1 }	"This is node/IP address of the system that causes the trap."
::= { vnxeTrapVariable 2 }	"This is the component that causes the trap."
::= { vnxeTrapVariable 3 }	"This is the symptom ID that causes the trap."
::= { vnxeTrapVariable 4 }	"This is the symptom description for SymptomID."
::= { vnxeTrapVariable 5 }	"This is the timestamp of the trap."

As a result, I'm developing a script to leverage the CLI to gain insight into the health / environmental values typically ascertained via SNMP polling. It will leverage the following CLI commands:

1. ssh to "Management IP" of EMC VNXe Platform
2. Issue the following CLI command:

service@(none) spa:~> svc_diag -state=spinfo | less

The output, of the above referenced command, will provide the health status and environmental values of the DPE and DAE components and the status of the individual disks as shown below:
COMMAND: svc_diag -state=spinfo | less
SAMPLE OUTPUT:

service@(none) spa:~> svc_diag -state=spinfo | less This SP's system type is: EMCHW SENTRY DUAL This SP's ID is: SPA Displaying all FRU statuses: dpe: OK temp: 18 spa: OK : (0x2d) O/S running dimm0: OK dimm1: OK dimm2: OK ps: OK 229 bbu: OK fan: OK slic0: OK POSEIDON slic1: REMOVED UNKNOWN sas0: CONNECTED sas1: DISCONNECTED sasxp: OK 0144 spb: OK : (0x2d) O/S running dimm0: UNKNOWN dimm1: UNKNOWN dimm2: UNKNOWN ps: OK 215 bbu: OK fan: OK slic0: OK POSEIDON slic1: REMOVED UNKNOWN sas0: UNKNOWN sas1: UNKNOWN sasxp: OK 0144 dae_0_1: OK temp: 18 lcca: OK 0144 psa: OK 40 lccb: OK 0144 psb: OK 29 Displaying backend status: disk state vendor type capacity bsize speed ---- ----- ------ ---- -------- ----- ----- 0_0_00 OK SEAGATE SAS 0x218ceece 520 6GB 0_0_01 OK SEAGATE SAS 0x218ceece 520 6GB 0_0_02 OK SEAGATE SAS 0x218ceece 520 6GB 0_0_03 OK SEAGATE SAS 0x218ceece 520 6GB 0_0_04 OK SEAGATE SAS 0x218ceece 520 6GB 0_0_05 OK SEAGATE SAS 0x218ceece 520 6GB 0_0_06 OK SEAGATE SAS 0x218ceece 520 6GB 0_0_07 OK SEAGATE SAS 0x218ceece 520 6GB 0_0_08 OK SEAGATE SAS 0x72a5d655 520 6GB 0_0_09 OK SEAGATE SAS 0x72a5d655 520 6GB 0_0_10 OK SEAGATE SAS 0x72a5d655 520 6GB 0_0_11 OK SEAGATE SAS 0x72a5d655 520 6GB 0_0_12 OK SEAGATE SAS 0x72a5d655 520 6GB 0_0_13 OK SEAGATE SAS 0x72a5d655 520 6GB 0_0_14 OK SEAGATE SAS 0x72a5d655 520 6GB 0_0_15 REMOVED 0_0_16 REMOVED 0_0_17 REMOVED 0_0_18 REMOVED 0_0_19 REMOVED 0_0_20 REMOVED 0_0_21 REMOVED 0_0_22 REMOVED 0_0_23 REMOVED 0_0_24 REMOVED

NOTE: When issuing the above referenced command, you will be connected to the primary SP (Service Processor). In the example output case - we are connected to the SP"A". The dimmX values (where X is the number of dimm modules present) and the sas0 value will show as "UNKNOWN" for SP"B". In order to view these values for the secondary SP, you must issue the "ssh peer" command and once again, issue the "svc_diag -state=spinfo | less" command - once connected to the secondary SP as follows:

service@(none) spa:~> ssh peer Last login: Thu Sep 18 22:42:03 2014 from peer service@(none) spb:~> svc_diag -state=spinfo | less ======== Now executing spinfo state ======== This SP's system type is: EMCHW SENTRY DUAL This SP's ID is: SPB Displaying all FRU statuses: dpe: OK temp: 18 spa: OK : (0x2d) O/S running dimm0: UNKNOWN dimm1: UNKNOWN dimm2: UNKNOWN ps: OK 226 bbu: OK fan: OK slic0: OK POSEIDON slic1: REMOVED UNKNOWN sas0: UNKNOWN sas1: UNKNOWN sasxp: OK 0144 spb: OK : (0x2d) O/S running dimm0: OK dimm1: OK dimm2: OK ps: OK 218 bbu: OK fan: OK slic0: OK POSEIDON slic1: REMOVED UNKNOWN sas0: CONNECTED sas1: DISCONNECTED sasxp: OK 0144 dae_0_1: OK temp: 18 lcca: OK 0144 psa: OK 40 lccb: OK 0144 psb: OK 29

<<<<<<<<<<>>>>>>>>>>>

The following list contains all the DPE and DAE components and environmentals generated from the command:
DPE Status and Environmental Values
1.   DPE Status
2.   DPE Temperature
3.   SPA / SPB Status
4.   SPA / SPB DIMM Status
5.   SPA / SPB PSU Status
6.   SPA / SPB Battery Backup Unit Status
7.   SPA / SPB Fan Status
8.   SPA / SPB slic0 Status - IO Modules with dual 10Gbps northbound uplinks to the CORE
9.   SPA / SPB sas0 Status- 6Gbps port that provides interconnect to DAE

DAE Status and Environmental Values
1. DAE Status
2. DAE Temperature
3. DAE LCC A / LCC B Status - Line Control Cards / 6 Gbps port that provides interconnect to DPE
4. DAE PSUs Status

Step 2

1. ssh to "Management IP" of EMC VNXe Platform
2. Issue the following CLI command on the primary SP: svc_storagecheck --sizes | less

service@(none) spb:~> svc_storagecheck --sizes | less <<<<<<<<>>>>>>>>>>>>>>>>>>>>> ======================= Now running ./server_df ALL ... ======================= server_2 : Filesystem kbytes used avail capacity Mounted on vol_2_1405717341 531776624 201771512 330005112 38% /vol_2_1405717341 vol_1_1405710374 2167267416 144432720 2022834696 7% /vol_1_1405710374 NFS00_15K_SPA 1354475408 1145093152 209382256 85% /NFS00_15K_SPA NFS01_7K_SPA 2114715632 1842266968 272448664 87% /NFS01_7K_SPA root_fs_common 15368 5272 10096 34% /.etc_common root_fs_2 129056 8504 120552 7% / server_3 : Filesystem kbytes used avail capacity Mounted on NFS02_7K_SPB 2114715632 351572240 1763143392 17% /NFS02_7K_SPB NFS00_7K_SPB 2114715632 1064917728 1049797904 50% /NFS00_7K_SPB root_fs_common 15368 5272 10096 34% /.etc_common root_fs_3 129056 8248 120808 6% / ======================= [Fri Sep 19 00:07:48 UTC 2014] End of Run =======================

NOTE: If you run the command on the secondary SP - you will receive the following error message:

service@(none) spa:~> svc_storagecheck --sizes | less ======================= [Fri Sep 0:12:45 UTC 2014] End of Run ======================= --- ERROR: this utility can only be run on the master SP.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

I would love to know what other commands engineers are using to "poll" their VNXe platforms for health/environmentals and IO performance metrics. There seems to be a complete lack of performance data available from the GUI and the CLI on the VNXe platforms.

Regards,

Amir

ECN-APJ · Answer

VNXe engineer also use some commands which are listed on VNXe Unisphere CLI User Guide and VNXe Service Commands Technical Notes. There is no other official command. But if you are interested on VNXe hidden performance data, you can refer to document How to check VNXe performance statistics data . You may get some information from this document what you want.

VNX

How to quickly troubleshoot a hardware problem of VNXe

Introduction

Detailed Information

Step 2

Was this post helpful?