Unsolved

This post is more than 5 years old

4 Posts

11688

June 20th, 2012 08:00

Multipath (native) issues on CentOS 6.2 with CX700 (random I/O errors)

Hi there,

Well, I am quite aware that CentOS is *not* a supported OS, but perhaps there'll be someone out there that might at least throw a hint at the possible cause of the problem.

Having spent several days debuging this error,  I'll try to be as systematic in its description as possible - presenting the description of the problem and system diagnostic outputs (that I feel) are related to it in some way. If there is more information needed, please let me know!

My sincere appologies for breaking the netiquette with the length of the post - but I'm rather desperate at this point

Did my best to at least keep it formated in an easy-to-read way...

Thank you for taking a look at this.

Best regards,

Matt

TABLE OF CONTENTS:

----------------------------------------------------------

1.) Problem description

2.) Server basic info

3.) OS, boot and kernel configuration

4.) Diagnostic command results

    4.1) Filesystem

    4.2) Devices

    4.3) LVM

    4.4) Multipath

    4.5) Modules
----------------------------------------------------------

1.) Problem description:

After multipath loads at system startup and partitions get remounted to mpathX devs, I am getting random buffer and r/w I/O errors from what I assume to be attempts to access data via passive paths. This happens randomly for all devices from what I've seen.

System boots entirely from the CX700 SAN storage and as such has no localy attached drives.

Error cutout from /var/log/messages (this stuff really spams so i'm cutting only a single entry)

----------------------------------------------------------------------------------------------------

Jun 20 15:11:59 server kernel: sd 1:0:1:0: [sdf] Device not ready

Jun 20 15:11:59 server kernel: sd 1:0:1:0: [sdf] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE

Jun 20 15:11:59 server kernel: sd 1:0:1:0: [sdf] Sense Key : Not Ready [current]

Jun 20 15:11:59 server kernel: sd 1:0:1:0: [sdf]  80Add. Sense: Logical unit not ready, manual intervention required

Jun 20 15:11:59 server kernel: sd 1:0:1:0: [sdf] CDB: Read(10): 28 00 00 00 00 00 00 00 08 00

Jun 20 15:11:59 server kernel: 00

Jun 20 15:11:59 server kernel: Buffer I/O error on device sdf, logical block 0                 

Jun 20 15:11:59 server kernel: 00 08 00

Jun 20 15:11:59 server kernel: Buffer I/O error on device sdf, logical block 0

Jun 20 15:11:59 server kernel:

Jun 20 15:11:59 server kernel: end_request: I/O error, dev sdf, sector 41942912

Jun 20 15:11:59 server kernel: end_request: I/O error, dev sdf, sector 8

Jun 20 15:11:59 server kernel: end_request: I/O error, dev sdf, sector 8

Jun 20 15:11:59 server kernel: end_request: I/O error, dev sdf, sector 41943024

2.) Server and storage basic info:

Server:                     IBM eServer BladeCenter LS20,

BIOS                        [BKE121AUS-1.08]- 01/12/2006

Dedicated LUNS:     3 luns (default owner SP-B, tresspassing is not enabled, prefetch set to variable / favor prefetch) - 1xRAID10 (20gb boot lun), 1xRAID10 (30gb data lun), 1xRAID5 (60gb data lun) 

Storage:                  SAN CX700

Paths:                      4 FC paths, two processors.

3.) OS, boot and kernel configuration:

OS:                 CentOS 6.2

Kernel:          2.6.32-220.23.1.el6.x86_64 (unchanged)

InitramFS:     initramfs-2.6.32-220.23.1.el6.x86_64_MULTIPATH.img (dracut generated initramfs file - which contains multipath. Verbose check during creation showed no errors. Whenever some changes are made to the multipath.conf, the initramfs file is recreated via dracut so that server will have the same changes apply at the startup time.)

Grub kernel boot line: kernel /vmlinuz-2.6.32-220.23.1.el6.x86_64 ro root=/dev/mapper/vg_root-lv_root rd_NO_LUKS rd_LVM_LV=vg_root/lv_root rd_LVM_LV=vg_root/lv_home rd_LVM_LV=vg_root/lv_swap LANG=en_US.UTF-8 quiet SYSFONT=latarcyrheb-sun16 crashkernel=auto KEYBOARDTYPE=pc KEYTABLE=us rdloaddriver=scsi_dh_emc

4.) Diagnostics command results:

##################

4.1 FILESYSTEM   #

#################

-------------------------------------------------------------------------

result of df -kh    (shows that it has successfuly mounted all mpath groups)

Filesystem                         Size  Used Avail Use% Mounted on

/dev/mapper/vg_root-lv_root        8.9G  3.1G  5.4G  37% /             (this would be on mpathcp2)

tmpfs                              3.9G     0  3.9G   0% /dev/shm

/dev/mapper/vg_root-lv_home        9.2G  3.8G  4.9G  44% /home         (this would be on mpathcp2)

/dev/mapper/mpathcp1               485M  113M  347M  25% /boot

/dev/mapper/mpathap1                30G   17G   11G  62% /mnt/one

/dev/mapper/mpathbp1                59G   26G   30G  47% /mnt/two

-------------------------------------------------------------------------

##############

4.2 DEVICES  #

############

-------------------------------------------------------------------------

result of ls /dev/sd* (shows all devices from mpath[a-c] groups)

/dev/sda  /dev/sdb  /dev/sdc  /dev/sdd  /dev/sde  /dev/sdf  /dev/sdg  /dev/sdh  /dev/sdi  /dev/sdj  /dev/sdk  /dev/sdl

result of ls /dev/mapper/*   (shows LVM and mpath[a-c][1-2] devices)

/dev/mapper/control  /dev/mapper/mpathap1  /dev/mapper/mpathbp1  /dev/mapper/mpathcp1  /dev/mapper/vg_root-lv_home  /dev/mapper/vg_root-lv_swap

/dev/mapper/mpatha   /dev/mapper/mpathb    /dev/mapper/mpathc    /dev/mapper/mpathcp2  /dev/mapper/vg_root-lv_root

-------------------------------------------------------------------------

##########

4.3 LVM  #

########

-------------------------------------------------------------------------

result of vgscan

  Reading all physical volumes.  This may take a while...

  Found volume group "vg_root" using metadata type lvm2

result of lvscan

  ACTIVE            '/dev/vg_root/lv_root' [9.00 GiB] inherit

  ACTIVE            '/dev/vg_root/lv_swap' [1.00 GiB] inherit

  ACTIVE            '/dev/vg_root/lv_home' [9.51 GiB] inherit

key config parts of lvm file:

   dir = "/dev"

   scan = [ "/dev/mapper","/dev" ]

   preferred_names = [ "^/dev/mpath", "^/dev/mapper/mpath" ]

   filter = ["a/mpath*/", "r/sd.*/", "r/disk.*/", "r/.*/" ]                         

-------------------------------------------------------------------------

#################

4.4 MULTIPATH   #

###############

-------------------------------------------------------------------------

result of multipath -ll command

mpathc (3600601607d3817002efecece5adbda11) dm-0 DGC,RAID 10

size=20G features='1 queue_if_no_path' hwhandler='1 emc' wp=rw

|-+- policy='round-robin 0' prio=1 status=active

| |- 1:0:0:0 sda 8:0   active ready running

| `- 2:0:0:0 sdb 8:16  active ready running

`-+- policy='round-robin 0' prio=0 status=enabled

  |- 1:0:1:0 sdf 8:80  active ready running

  `- 2:0:1:0 sdh 8:112 active ready running

mpathb (3600601603e74160068c3fe709e2bdc11) dm-1 DGC,RAID 5

size=60G features='1 queue_if_no_path' hwhandler='1 emc' wp=rw

|-+- policy='round-robin 0' prio=1 status=active

| |- 1:0:0:1 sdc 8:32  active ready running

| `- 2:0:0:1 sdd 8:48  active ready running

`-+- policy='round-robin 0' prio=0 status=enabled

  |- 1:0:1:1 sdi 8:128 active ready running

  `- 2:0:1:1 sdk 8:160 active ready running

mpatha (3600601607d381700f91e24a1d9e0da11) dm-2 DGC,RAID 10

size=30G features='1 queue_if_no_path' hwhandler='1 emc' wp=rw

|-+- policy='round-robin 0' prio=1 status=active

| |- 1:0:0:2 sde 8:64  active ready running

| `- 2:0:0:2 sdg 8:96  active ready running

`-+- policy='round-robin 0' prio=0 status=enabled

  |- 1:0:1:2 sdj 8:144 active ready running

  `- 2:0:1:2 sdl 8:176 active ready running

content of the multipath.conf file   (loading it via multipath -v2 gives out no errors. I've tried various iterations of this configuration, but no luck so far. An old CentOS and Redhat 4.x could run simliar config with no sweat - but back then I could select qla2xxx argument ql2failover=0 which seemed to solve a lot of my problems at the days - however, this argument has been deprecated since 8.0.3.x version of the driver if I recall correctly).

defaults {

        find_multipaths yes

        user_friendly_names yes

        path_checker emc-clariion

        path_selector "round-robin 0"

        prio emc

        path_grouping_policy group_by_prio

        features "1 queue_if_no_path"

        no_path_retry 300

        failback immediate

        rr_weight priorities

}

blacklist {

        devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"

        devnode "^hd[a-z]"

#       devnode "^sd[a-z]"   ### commented out

        devnode "^dcssblk[0-9]*"

        devnode "vg_*"

        wwid "*"

}

blacklist_exceptions {

        wwid "3600601607d381700f91e24a1d9e0da11"

        wwid "3600601603e74160068c3fe709e2bdc11"

        wwid "3600601607d3817002efecece5adbda11"

}

multipaths {

        multipath {

                uid 0

                gid 0

                wwid "3600601607d381700f91e24a1d9e0da11"

                mode 0600

#                alias mpath-web    - this is mpatha

        }

        multipath {

                uid 0

                gid 0

                wwid "3600601603e74160068c3fe709e2bdc11"

                mode 0600

#                alias mpath-cache   -this is mpathb

        }

        multipath {

                uid 0

                gid 0

                wwid "3600601607d3817002efecece5adbda11"

                mode 0600

#                alias mpath-system -this is mpathc

        }

}

}

# EOF

###############

4.5 MODULES  #

#############

-------------------------------------------------------------------------

result of modinfo qla2xxx  command  (  8.03.7.05.06.2-k version of qla2xxx - this is the out-of-the-box driver that comes with kernel. I have experienced the same problem even with the newest driver version compiled from qla2xxx-src-v8.03.07.14.5.6-k )

filename:       /lib/modules/2.6.32-220.23.1.el6.x86_64/kernel/drivers/scsi/qla2xxx/qla2xxx.ko

firmware:       ql2500_fw.bin

firmware:       ql2400_fw.bin

firmware:       ql2322_fw.bin

firmware:       ql2300_fw.bin

firmware:       ql2200_fw.bin

firmware:       ql2100_fw.bin

version:        8.03.07.05.06.2-k

license:        GPL

description:    QLogic Fibre Channel HBA Driver

author:         QLogic Corporation

srcversion:     267851F905EAF9E3EEB2F2E

result of lsmod command:

Module                  Size  Used by

autofs4                26888  3

nfs                   398769  2

lockd                  74270  1 nfs

fscache                46859  1 nfs

nfs_acl                 2647  1 nfs

auth_rpcgss            44895  1 nfs

sunrpc                243822  16 nfs,lockd,nfs_acl,auth_rpcgss

bonding               126087  0

iptable_nat             6158  0

nf_nat                 22726  1 iptable_nat

iptable_mangle          3349  0

ipt_REJECT              2351  2

nf_conntrack_ipv4       9506  14 iptable_nat,nf_nat

nf_defrag_ipv4          1483  1 nf_conntrack_ipv4

xt_multiport            2700  1

iptable_filter          2793  1

ip_tables              17831  3 iptable_nat,iptable_mangle,iptable_filter

ip6t_REJECT             4628  2

nf_conntrack_ipv6       8748  2

nf_defrag_ipv6         12182  1 nf_conntrack_ipv6

xt_state                1492  13

nf_conntrack           79453  5 iptable_nat,nf_nat,nf_conntrack_ipv4,nf_conntrack_ipv6,xt_state

ip6table_filter         2889  1

ip6_tables             19458  1 ip6table_filter

ipv6                  322029  40 bonding,ip6t_REJECT,nf_conntrack_ipv6,nf_defrag_ipv6

sg                     30124  0

tg3                   140883  0

microcode             112594  0

serio_raw               4818  0

k8temp                  3901  0

amd64_edac_mod         21461  0

edac_core              46773  4 amd64_edac_mod

edac_mce_amd           15488  1 amd64_edac_mod

i2c_amd756              8058  0

amd_rng                 1781  0

shpchp                 33482  0

ext4                  364410  5

mbcache                 8144  1 ext4

jbd2                   88866  1 ext4

dm_round_robin          2717  6

sd_mod                 39488  12

crc_t10dif              1541  1 sd_mod

qla2xxx               366555  24

scsi_transport_fc      52241  1 qla2xxx

scsi_tgt               12173  1 scsi_transport_fc

mptspi                 17051  0

mptscsih               36732  1 mptspi

mptbase                93845  2 mptspi,mptscsih

scsi_transport_spi     26151  1 mptspi

radeon               1023359  1

ttm                    70328  1 radeon

drm_kms_helper         33236  1 radeon

drm                   230707  3 radeon,ttm,drm_kms_helper

i2c_algo_bit            5762  1 radeon

i2c_core               31276  5 i2c_amd756,radeon,drm_kms_helper,drm,i2c_algo_bit

dm_multipath           17649  4 dm_round_robin

dm_mirror              14101  0

dm_region_hash         12170  1 dm_mirror

dm_log                 10122  2 dm_mirror,dm_region_hash

dm_mod                 81692  31 dm_multipath,dm_mirror,dm_log

scsi_dh_rdac            8804  0

scsi_dh_emc             8157  12

4 Posts

June 20th, 2012 08:00

Forgot to add multipathd -k - stats and status command results:

multipathd> show maps stats

name   path_faults switch_grp map_loads total_q_time q_timeouts

mpatha 0           0          1         0            0        

mpathb 0           0          1         0            0        

mpathc 0           0          1         0            0   

multipathd> show maps status

name   failback  queueing paths dm-st  write_prot

mpatha immediate 60 chk   4     active rw       

mpathb immediate 60 chk   4     active rw       

mpathc immediate 60 chk   4     active rw       

1 Rookie

 • 

54 Posts

June 20th, 2012 12:00

Have you looked at EMC kb emc18763spacer"Buffer I/O errors occurring on CLARiiON devices presented to Linux host using Linux native multipathing (DM-MPIO)" ? It seems like your problem. In essence, use PowerPath, or filter out theses messages from syslog is what they are saying.

I can only tell you I'm not experiencing these on SLES 11 SP1.

Root Cause: Errors occur when a command is run which results in attempted access to the CLARiiON native devices. The access can be a read or write attempt, and may occur with commands such as "fdisk -l". When using PowerPath, these errors are suppressed.  However, in the case where Linux native multipathing is used, there is no automatic provision for filtering these messages.
Fix: This is normal behavior for Linux native multipath, and the errors do not indicate an array issue. The errors can safely be filtered through the OS logging configuration or the user can avoid access to native devices (as opposed to using /dev/mapper devices). Alternatively, a qualified version of PowerPath may be installed, which will automatically filter these errors.
Notes: On systems with a large number of devices, repeated use of the commands that generate these errors can cause intermittent slow access. Through the generation of large messages files, it can also make other issues more difficult to diagnose.

4 Posts

June 20th, 2012 13:00

Yes, I've seen similar advice elsewhere - but, alas, filtering error messages at syslog level when it comes to storage devices is somewhat of a gamble in the long run... so gona try find a different way around it if its anyhow possible.

I've installed the newest Powerpath EMCPower.LINUX-5.6.0.00.00-143.RHEL6.x86_64.rpm on the Centos 6.2 system couple of days - that is before considering to give it a try with DM-MPIO instead.

The installation of Powerpath itself was pretty much straightforward since I've had experience setting it up on several older RHLES4 systems - minor differences in config files, otherwise everything ran as advertised. The only problem there were the few I/O errors during initial initramfs udevadm triggering phase, but those could be safely ignored.

Btw, the reason why I'd even consider using DM-MPIO is the ease of implementing kernel updates - which is somewhat of a hassle when it comes to Powerpath. I'd prefer a near-automated updating/upgrading process if possible as having to reinstall powerpath every time someone performs kernel update is somewhat of an unnecessary incovenience...

I'm still roaming the net in the search of answers, but most solutions unfortunately refer to the older RHLES4/5 problems, and good deal of them quote the qla2xxx module ql2failover=0 option in modprobe.conf (which no longer exists in new versions of the qla2xxx driver - to my knowlege its deprecated)  or LVM and DM-MPIO/PowerPath device filtering/blacklisting/whitelisting as solution to the similar problems.

I can't be 100% sure my configs are 100% correct... so I'm still hoping that perhaps someone can point me to an error I've made and fix this problem.

Thank you for the reply - hopefuly it will be useful to others who read this post

July 20th, 2012 01:00

According to latest EMC Support Policy, particular CentOS versions are supported and even supported by particular PowerPath versions.

4 Posts

July 20th, 2012 02:00

Thanks for the tip Baif - I'll check it out.

Btw, I've been doing some more research... and this problem might be caused by  a rather old version of FLARE OS on the Clariion.

I'm currently writing down an upgrade plan to get the OS upgraded to a newer version... one that actually supports ALUA.

Yeh... it seems to be -that- old.

July 20th, 2012 06:00

Yes, bit old.

I believe the EMC Primus emc18763 answer your concern about I/O Buffer. Since it is not random sector.

Meanwhile, for your configuration about PowerPath or Linux Native Multipathing, please refer to latest EMC Host Connectivity for Linux. It should answer all your question, and I notice that there are few new updates for RHEL/CentOS 6.2 Kernels/Native Multipathing.

Top