Unsolved

This post is more than 5 years old

17 Posts

2565

December 22nd, 2008 22:00

Error Stats counter in powermt display

I get a lot of complaints from the db/apps team that they are seeing scsi errors, performance and other I/O related issues on the host and they feel its got somthing to do with the Error Stats counter in powermt display.

I have been seeing a lot of hosts wherin the "Stats Errors" colunmn in powermt display is non zero. I understand that this is the total number of times any logical I/O paths on this bus transitioned from alive to dead. This is always equal to or less than the total number of HBA I/O path errors. It is cleared at boot time
or when powermt restore executes
.

The troubleshooring steps that i normally take is:
1) Check if the errors are seen on all the luns on a particular FA / HBA.
2) Check if their are errors reported by other hosts connected to same FA.
3) Check for any link issue on the switch port ( logging log--cisco environment)
4) Host event log.

What I wanted to know is "Is their any any way by which I can know when the error counter was incremented (I/O paths on this bus transitioned from alive to dead) thought any powermt command or their any powermt logs that I can check to determine when it occred or to detremine it is just a stale entry.

Also please let me know Is their anything else that I need to check apart from the above listed steps.

We are running PowerPath (c) Version 4.3.1 (build 40).

Storage class = Symmetrix
==============================================================================
----------- Storage System --------------- -- I/O Paths -- --- Stats ---
ID Interface Wt_Q Total Dead Q-IOs Errors
==============================================================================
000xxxxxxxxx FA 9bA 256 16 0 1 16
000xxxxxxxxx FA 8bB 256 16 0 1 16

341 Posts

December 22nd, 2008 23:00

Hi Chiren,

PowerPath only logs entries to the syslog(HP-UX), messages (Solaris/Linux, and errpt (AIX) files. As you have said the error counter on the powermt display commands are historical representation of the number of path status changes. Usually when I'm troubleshooting my steps are as follows:

1. check powermt outputs for the error counters and see if there is a commonality, (all errors on one HBA, for one FA etc)
2. Check the uptime on the host so you can get an idea of the frequency of these errors.
3. Search the syslog/messages/errpt for the sting "dead" or "path state change" starting from the bottom (most recent) and working backwards to see what the cause of the path dying was.

From the above information, you can usually identify the failing component, if there is still doubt, move onto the switches and work your way back to the FA port on the array through the fabric.

For some more real-time monitoring, perhaps you could ask if the Sys Admins could monitor the host's syslog for strings such as "Killing Bus" as this is the message you see when all paths to a FA/SP die, or all paths from a HBA die. And alert real-time if these occur.
No Events found!

Top