Unsolved
This post is more than 5 years old
4 Posts
0
11688
June 20th, 2012 08:00
Multipath (native) issues on CentOS 6.2 with CX700 (random I/O errors)
Hi there,
Well, I am quite aware that CentOS is *not* a supported OS, but perhaps there'll be someone out there that might at least throw a hint at the possible cause of the problem.
Having spent several days debuging this error, I'll try to be as systematic in its description as possible - presenting the description of the problem and system diagnostic outputs (that I feel) are related to it in some way. If there is more information needed, please let me know!
My sincere appologies for breaking the netiquette with the length of the post - but I'm rather desperate at this point
Did my best to at least keep it formated in an easy-to-read way...
Thank you for taking a look at this.
Best regards,
Matt
TABLE OF CONTENTS:
----------------------------------------------------------
1.) Problem description
2.) Server basic info
3.) OS, boot and kernel configuration
4.) Diagnostic command results
4.1) Filesystem
4.2) Devices
4.3) LVM
4.4) Multipath
4.5) Modules
----------------------------------------------------------
1.) Problem description:
After multipath loads at system startup and partitions get remounted to mpathX devs, I am getting random buffer and r/w I/O errors from what I assume to be attempts to access data via passive paths. This happens randomly for all devices from what I've seen.
System boots entirely from the CX700 SAN storage and as such has no localy attached drives.
Error cutout from /var/log/messages (this stuff really spams so i'm cutting only a single entry)
----------------------------------------------------------------------------------------------------
Jun 20 15:11:59 server kernel: sd 1:0:1:0: [sdf] Device not ready
Jun 20 15:11:59 server kernel: sd 1:0:1:0: [sdf] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jun 20 15:11:59 server kernel: sd 1:0:1:0: [sdf] Sense Key : Not Ready [current]
Jun 20 15:11:59 server kernel: sd 1:0:1:0: [sdf] 80Add. Sense: Logical unit not ready, manual intervention required
Jun 20 15:11:59 server kernel: sd 1:0:1:0: [sdf] CDB: Read(10): 28 00 00 00 00 00 00 00 08 00
Jun 20 15:11:59 server kernel: 00
Jun 20 15:11:59 server kernel: Buffer I/O error on device sdf, logical block 0
Jun 20 15:11:59 server kernel: 00 08 00
Jun 20 15:11:59 server kernel: Buffer I/O error on device sdf, logical block 0
Jun 20 15:11:59 server kernel:
Jun 20 15:11:59 server kernel: end_request: I/O error, dev sdf, sector 41942912
Jun 20 15:11:59 server kernel: end_request: I/O error, dev sdf, sector 8
Jun 20 15:11:59 server kernel: end_request: I/O error, dev sdf, sector 8
Jun 20 15:11:59 server kernel: end_request: I/O error, dev sdf, sector 41943024
2.) Server and storage basic info:
Server: IBM eServer BladeCenter LS20,
BIOS [BKE121AUS-1.08]- 01/12/2006
Dedicated LUNS: 3 luns (default owner SP-B, tresspassing is not enabled, prefetch set to variable / favor prefetch) - 1xRAID10 (20gb boot lun), 1xRAID10 (30gb data lun), 1xRAID5 (60gb data lun)
Storage: SAN CX700
Paths: 4 FC paths, two processors.
3.) OS, boot and kernel configuration:
OS: CentOS 6.2
Kernel: 2.6.32-220.23.1.el6.x86_64 (unchanged)
InitramFS: initramfs-2.6.32-220.23.1.el6.x86_64_MULTIPATH.img (dracut generated initramfs file - which contains multipath. Verbose check during creation showed no errors. Whenever some changes are made to the multipath.conf, the initramfs file is recreated via dracut so that server will have the same changes apply at the startup time.)
Grub kernel boot line: kernel /vmlinuz-2.6.32-220.23.1.el6.x86_64 ro root=/dev/mapper/vg_root-lv_root rd_NO_LUKS rd_LVM_LV=vg_root/lv_root rd_LVM_LV=vg_root/lv_home rd_LVM_LV=vg_root/lv_swap LANG=en_US.UTF-8 quiet SYSFONT=latarcyrheb-sun16 crashkernel=auto KEYBOARDTYPE=pc KEYTABLE=us rdloaddriver=scsi_dh_emc
4.) Diagnostics command results:
##################
4.1 FILESYSTEM #
#################
-------------------------------------------------------------------------
result of df -kh (shows that it has successfuly mounted all mpath groups)
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg_root-lv_root 8.9G 3.1G 5.4G 37% / (this would be on mpathcp2)
tmpfs 3.9G 0 3.9G 0% /dev/shm
/dev/mapper/vg_root-lv_home 9.2G 3.8G 4.9G 44% /home (this would be on mpathcp2)
/dev/mapper/mpathcp1 485M 113M 347M 25% /boot
/dev/mapper/mpathap1 30G 17G 11G 62% /mnt/one
/dev/mapper/mpathbp1 59G 26G 30G 47% /mnt/two
-------------------------------------------------------------------------
##############
4.2 DEVICES #
############
-------------------------------------------------------------------------
result of ls /dev/sd* (shows all devices from mpath[a-c] groups)
/dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk /dev/sdl
result of ls /dev/mapper/* (shows LVM and mpath[a-c][1-2] devices)
/dev/mapper/control /dev/mapper/mpathap1 /dev/mapper/mpathbp1 /dev/mapper/mpathcp1 /dev/mapper/vg_root-lv_home /dev/mapper/vg_root-lv_swap
/dev/mapper/mpatha /dev/mapper/mpathb /dev/mapper/mpathc /dev/mapper/mpathcp2 /dev/mapper/vg_root-lv_root
-------------------------------------------------------------------------
##########
4.3 LVM #
########
-------------------------------------------------------------------------
result of vgscan
Reading all physical volumes. This may take a while...
Found volume group "vg_root" using metadata type lvm2
result of lvscan
ACTIVE '/dev/vg_root/lv_root' [9.00 GiB] inherit
ACTIVE '/dev/vg_root/lv_swap' [1.00 GiB] inherit
ACTIVE '/dev/vg_root/lv_home' [9.51 GiB] inherit
key config parts of lvm file:
dir = "/dev"
scan = [ "/dev/mapper","/dev" ]
preferred_names = [ "^/dev/mpath", "^/dev/mapper/mpath" ]
filter = ["a/mpath*/", "r/sd.*/", "r/disk.*/", "r/.*/" ]
-------------------------------------------------------------------------
#################
4.4 MULTIPATH #
###############
-------------------------------------------------------------------------
result of multipath -ll command
mpathc (3600601607d3817002efecece5adbda11) dm-0 DGC,RAID 10
size=20G features='1 queue_if_no_path' hwhandler='1 emc' wp=rw
|-+- policy='round-robin 0' prio=1 status=active
| |- 1:0:0:0 sda 8:0 active ready running
| `- 2:0:0:0 sdb 8:16 active ready running
`-+- policy='round-robin 0' prio=0 status=enabled
|- 1:0:1:0 sdf 8:80 active ready running
`- 2:0:1:0 sdh 8:112 active ready running
mpathb (3600601603e74160068c3fe709e2bdc11) dm-1 DGC,RAID 5
size=60G features='1 queue_if_no_path' hwhandler='1 emc' wp=rw
|-+- policy='round-robin 0' prio=1 status=active
| |- 1:0:0:1 sdc 8:32 active ready running
| `- 2:0:0:1 sdd 8:48 active ready running
`-+- policy='round-robin 0' prio=0 status=enabled
|- 1:0:1:1 sdi 8:128 active ready running
`- 2:0:1:1 sdk 8:160 active ready running
mpatha (3600601607d381700f91e24a1d9e0da11) dm-2 DGC,RAID 10
size=30G features='1 queue_if_no_path' hwhandler='1 emc' wp=rw
|-+- policy='round-robin 0' prio=1 status=active
| |- 1:0:0:2 sde 8:64 active ready running
| `- 2:0:0:2 sdg 8:96 active ready running
`-+- policy='round-robin 0' prio=0 status=enabled
|- 1:0:1:2 sdj 8:144 active ready running
`- 2:0:1:2 sdl 8:176 active ready running
content of the multipath.conf file (loading it via multipath -v2 gives out no errors. I've tried various iterations of this configuration, but no luck so far. An old CentOS and Redhat 4.x could run simliar config with no sweat - but back then I could select qla2xxx argument ql2failover=0 which seemed to solve a lot of my problems at the days - however, this argument has been deprecated since 8.0.3.x version of the driver if I recall correctly).
defaults {
find_multipaths yes
user_friendly_names yes
path_checker emc-clariion
path_selector "round-robin 0"
prio emc
path_grouping_policy group_by_prio
features "1 queue_if_no_path"
no_path_retry 300
failback immediate
rr_weight priorities
}
blacklist {
devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
devnode "^hd[a-z]"
# devnode "^sd[a-z]" ### commented out
devnode "^dcssblk[0-9]*"
devnode "vg_*"
wwid "*"
}
blacklist_exceptions {
wwid "3600601607d381700f91e24a1d9e0da11"
wwid "3600601603e74160068c3fe709e2bdc11"
wwid "3600601607d3817002efecece5adbda11"
}
multipaths {
multipath {
uid 0
gid 0
wwid "3600601607d381700f91e24a1d9e0da11"
mode 0600
# alias mpath-web - this is mpatha
}
multipath {
uid 0
gid 0
wwid "3600601603e74160068c3fe709e2bdc11"
mode 0600
# alias mpath-cache -this is mpathb
}
multipath {
uid 0
gid 0
wwid "3600601607d3817002efecece5adbda11"
mode 0600
# alias mpath-system -this is mpathc
}
}
}
# EOF
###############
4.5 MODULES #
#############
-------------------------------------------------------------------------
result of modinfo qla2xxx command ( 8.03.7.05.06.2-k version of qla2xxx - this is the out-of-the-box driver that comes with kernel. I have experienced the same problem even with the newest driver version compiled from qla2xxx-src-v8.03.07.14.5.6-k )
filename: /lib/modules/2.6.32-220.23.1.el6.x86_64/kernel/drivers/scsi/qla2xxx/qla2xxx.ko
firmware: ql2500_fw.bin
firmware: ql2400_fw.bin
firmware: ql2322_fw.bin
firmware: ql2300_fw.bin
firmware: ql2200_fw.bin
firmware: ql2100_fw.bin
version: 8.03.07.05.06.2-k
license: GPL
description: QLogic Fibre Channel HBA Driver
author: QLogic Corporation
srcversion: 267851F905EAF9E3EEB2F2E
result of lsmod command:
Module Size Used by
autofs4 26888 3
nfs 398769 2
lockd 74270 1 nfs
fscache 46859 1 nfs
nfs_acl 2647 1 nfs
auth_rpcgss 44895 1 nfs
sunrpc 243822 16 nfs,lockd,nfs_acl,auth_rpcgss
bonding 126087 0
iptable_nat 6158 0
nf_nat 22726 1 iptable_nat
iptable_mangle 3349 0
ipt_REJECT 2351 2
nf_conntrack_ipv4 9506 14 iptable_nat,nf_nat
nf_defrag_ipv4 1483 1 nf_conntrack_ipv4
xt_multiport 2700 1
iptable_filter 2793 1
ip_tables 17831 3 iptable_nat,iptable_mangle,iptable_filter
ip6t_REJECT 4628 2
nf_conntrack_ipv6 8748 2
nf_defrag_ipv6 12182 1 nf_conntrack_ipv6
xt_state 1492 13
nf_conntrack 79453 5 iptable_nat,nf_nat,nf_conntrack_ipv4,nf_conntrack_ipv6,xt_state
ip6table_filter 2889 1
ip6_tables 19458 1 ip6table_filter
ipv6 322029 40 bonding,ip6t_REJECT,nf_conntrack_ipv6,nf_defrag_ipv6
sg 30124 0
tg3 140883 0
microcode 112594 0
serio_raw 4818 0
k8temp 3901 0
amd64_edac_mod 21461 0
edac_core 46773 4 amd64_edac_mod
edac_mce_amd 15488 1 amd64_edac_mod
i2c_amd756 8058 0
amd_rng 1781 0
shpchp 33482 0
ext4 364410 5
mbcache 8144 1 ext4
jbd2 88866 1 ext4
dm_round_robin 2717 6
sd_mod 39488 12
crc_t10dif 1541 1 sd_mod
qla2xxx 366555 24
scsi_transport_fc 52241 1 qla2xxx
scsi_tgt 12173 1 scsi_transport_fc
mptspi 17051 0
mptscsih 36732 1 mptspi
mptbase 93845 2 mptspi,mptscsih
scsi_transport_spi 26151 1 mptspi
radeon 1023359 1
ttm 70328 1 radeon
drm_kms_helper 33236 1 radeon
drm 230707 3 radeon,ttm,drm_kms_helper
i2c_algo_bit 5762 1 radeon
i2c_core 31276 5 i2c_amd756,radeon,drm_kms_helper,drm,i2c_algo_bit
dm_multipath 17649 4 dm_round_robin
dm_mirror 14101 0
dm_region_hash 12170 1 dm_mirror
dm_log 10122 2 dm_mirror,dm_region_hash
dm_mod 81692 31 dm_multipath,dm_mirror,dm_log
scsi_dh_rdac 8804 0
scsi_dh_emc 8157 12


Matt_J1
4 Posts
0
June 20th, 2012 08:00
Forgot to add multipathd -k - stats and status command results:
multipathd> show maps stats
multipathd> show maps status
iop2go
1 Rookie
•
54 Posts
0
June 20th, 2012 12:00
Have you looked at EMC kb emc18763
"Buffer I/O errors occurring on CLARiiON devices presented to Linux host using Linux native multipathing (DM-MPIO)" ? It seems like your problem. In essence, use PowerPath, or filter out theses messages from syslog is what they are saying.
I can only tell you I'm not experiencing these on SLES 11 SP1.
Matt_J1
4 Posts
0
June 20th, 2012 13:00
Yes, I've seen similar advice elsewhere - but, alas, filtering error messages at syslog level when it comes to storage devices is somewhat of a gamble in the long run... so gona try find a different way around it if its anyhow possible.
I've installed the newest Powerpath EMCPower.LINUX-5.6.0.00.00-143.RHEL6.x86_64.rpm on the Centos 6.2 system couple of days - that is before considering to give it a try with DM-MPIO instead.
The installation of Powerpath itself was pretty much straightforward since I've had experience setting it up on several older RHLES4 systems - minor differences in config files, otherwise everything ran as advertised. The only problem there were the few I/O errors during initial initramfs udevadm triggering phase, but those could be safely ignored.
Btw, the reason why I'd even consider using DM-MPIO is the ease of implementing kernel updates - which is somewhat of a hassle when it comes to Powerpath. I'd prefer a near-automated updating/upgrading process if possible as having to reinstall powerpath every time someone performs kernel update is somewhat of an unnecessary incovenience...
I'm still roaming the net in the search of answers, but most solutions unfortunately refer to the older RHLES4/5 problems, and good deal of them quote the qla2xxx module ql2failover=0 option in modprobe.conf (which no longer exists in new versions of the qla2xxx driver - to my knowlege its deprecated) or LVM and DM-MPIO/PowerPath device filtering/blacklisting/whitelisting as solution to the similar problems.
I can't be 100% sure my configs are 100% correct... so I'm still hoping that perhaps someone can point me to an error I've made and fix this problem.
Thank you for the reply - hopefuly it will be useful to others who read this post
Anonymous User
63 Posts
0
July 20th, 2012 01:00
According to latest EMC Support Policy, particular CentOS versions are supported and even supported by particular PowerPath versions.
Matt_J1
4 Posts
0
July 20th, 2012 02:00
Thanks for the tip Baif - I'll check it out.
Btw, I've been doing some more research... and this problem might be caused by a rather old version of FLARE OS on the Clariion.
I'm currently writing down an upgrade plan to get the OS upgraded to a newer version... one that actually supports ALUA.
Yeh... it seems to be -that- old.
Anonymous User
63 Posts
0
July 20th, 2012 06:00
Yes, bit old.
I believe the EMC Primus emc18763 answer your concern about I/O Buffer. Since it is not random sector.
Meanwhile, for your configuration about PowerPath or Linux Native Multipathing, please refer to latest EMC Host Connectivity for Linux. It should answer all your question, and I notice that there are few new updates for RHEL/CentOS 6.2 Kernels/Native Multipathing.