Odd lun behavior I/O errors

Question

Greetings,

I'm not sure if this is a PowerPath issue or not but wasn't sure where to start. We have a RHEL4 server connected via SAN to a CX3-40. It run PP 5.1.0. All of this generally works fine.

The other day the admin created a new lun (55 gig) from RG3, assigned it to a server, rescanned the server bus, and then got I/O errors when running a pvscan. Navisphere and host agents were updated. An error like this shows up:
/dev/emcpowerci: read failed after 0 of 4096 at 37580898304: Input/output error

I removed the lun, destroyed it, recreated it and did the same steps. The same error occurs. I took a different lun from RG3 that wasn't assigned to anything, added it to the host and everything was happy. I destroyed the problem lun again, recreated a 1 gig lun, and then a 50 gig lun then showed them to the same host. The 1 gig lun was fine and we got I/O errors on the 50 gig lun. We have tried varying sizes of lun creation. The last was as a 35 gig lun. These errors show up at the end of the dmesg output:

SCSI device sdb: 73400320 512-byte hdwr sectors (37581 MB)
SCSI device sdb: drive cache: write through
sdb: unknown partition table
emcpowerci: unknown partition table

I'm at a loss to figure out what is going on. Any help would be appreciated.

Thanks,

Jeff

dynamox · Answer

interesting indeed, Jeff can you take this 35G LUN and take it out of the storage group, hit apply. Put it back into storage group but before you click apply scroll the left and you will see a column called 'Host id'. By default Navisphere will try to use the lowest number available but you go ahead and set it to something high, whatever the highest Host id is plus 1. I am just thinking that maybe something is stuck in the kernel and moving it to a higher host id will let it come in ok. Worth a shot.

jeffc2 · Answer

I'll give it a try and see. It doesn't appear to be tied to the lun ID for the host though. The original 58 gig lun was lun ID 44 and had the errors. When it was destroyed and the 1 gig lun created first, the 1 gig lun became lun ID 44 and that worked. The remaining piece became lun 45 and both size creations of that have failed. The lun migrated from the other server worked on server A with the array ID and got a new lun ID when it went to server B. It has I/O errors on server B. When it was put back to server A, there were no errors for that same lun back on host A.

I've been more of a hpux guy so I don't know if the device file creation is correct for linux. I've noticed that each time we have introduced a lun to server B, it gets a new emcpowerxx device name. On hpux it would assign a new disk to the lowest available cxtxdx device name.

Is it possible that whatever is managing the lun mapping for PowerPath is somehow corrupted? If its not updating mappings correctly I can see problems occurring. I just don't know how to trace or fix something like this

Jeff

dynamox · Answer

remove this lun from server B ,on server B run 'powermt check' and remove any dead luns/paths it finds, then present the lun back (rescan HBAs to make sure that LUN is seen by linux) and then run powermt config, powermt save.

jeffc2 · Answer

The admin had already removed the lun from the server and run the powermt check. That worked fine and the pvscan showed no errors on any disks. I added a problem lun back in, ran the ql-scan-lun.sh script so the host and adapter could find the lun. I then ran the powermt config and the save which ran with no errors. When I ran a pvscan I get this error on the new lun:

/dev/emcpowercg: read failed after 0 of 4096 at 104152891392: Input/output error

This is one of the luns that was working fine on server A previously. The local admin used another disk from server A that was previously in the same file system as this lun and it works fine. I guess I need to get a call open and have someone look at the SP collects. Nothing shows up in the SP event log.

Jeff

SKT2 · Answer

first of all which exact version of RHEL you have ( which update) , 32 or 64 bit?ALso make sure the qlogic script you are running supports . We have seen our LINUX RHEL AS4 U6 64bit serevrs rebooting when qlogic script is used.(what we learned is that script WE USED supports 32bit)

What you reported from dmesg is quite common and it comes when new devices are scaned.But why you get a I/O error during pvscan is a mystery.

jeffc2 · Answer

The qlogic scripts we use do seem to work fine. It is the 1.6 release version of their tools. The OS on this system is RHEL WS 4.5.

The admin at that office is going to open a call with EMC and see if they can get to the bottom of it. We'll see if they are able to figure anything out.

Thanks,

Jeff

PowerPath

Odd lun behavior I/O errors

Was this post helpful?