Start a Conversation

Unsolved

This post is more than 5 years old

4091

March 14th, 2017 09:00

ScaleIO host with IO Error

ScaleIO deployed in VMware environment. On one host operation with datastore (migrate VM, create vm and other) a long time.

All physical ssd devices working are fine. ScaleIO system without errors.

Smart from virtual device:

eui.7b850322485d5c3051e837c300000000

   Device: eui.7b850322485d5c3051e837c300000000

   Successful Commands: 82256182

   Blocks Read: 1484018027

   Blocks Written: 4402672521

   Read Operations: 12992279

   Write Operations: 68603358

   Reserve Operations: 0

   Reservation Conflicts: 0

   Failed Commands: 1902

   Failed Blocks Read: 0

   Failed Blocks Written: 790144

   Failed Read Operations: 0

   Failed Write Operations: 1376

   Failed Reserve Operations: 0

VMkernel.log:

2017-03-14T13:59:42.752Z cpu2:858185)J3: 3302: Aborting txn (0x43070883c2b0) callerID: 0xc1d00006 due to failure pre-committing: I/O error

2017-03-14T14:00:02.871Z cpu25:33353)WARNING: [972927942] IO-ERROR comb: 2da880000381. offsetInComb 2324480. SizeInLB 2048. SDS_ID 9a2ad19900000002. Comb Gen 28. Head Gen 5374.

2017-03-14T14:00:02.871Z cpu25:33353)WARNING: Vol ID 0x51e837c300000000. Last fault Status IO_HARD_ERROR(20).Last error Status NOT_CONN(4) Reason (ERROR) Retry count (20) chan (1)

2017-03-14T14:00:02.871Z cpu25:33353)scini: blkScsi_PrintIOInfo:3274: ScaleIO R2_01:hCmd 0x439d98caaa40, OpCode 0x93, rc 1 scsiStat 2, senseCode 4, asc 0, ascq 0

2017-03-14T14:00:02.871Z cpu7:33507)ScsiDeviceIO: 2651: Cmd(0x439d98caaa40) 0x93, CmdSN 0x7afd7d from world 858816 to dev "eui.7b850322485d5c3051e837c300000000" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x0 0x0.

2017-03-14T14:00:04.935Z cpu1:33406)NMP: nmp_ResetDeviceLogThrottling:3349: last error status from device eui.7b850322485d5c3051e837c300000000 repeated 2 times

2017-03-14T14:00:23.064Z cpu16:33353)WARNING: [972948135] IO-ERROR comb: 2da880000381. offsetInComb 2324480. SizeInLB 2048. SDS_ID 9a2ad19900000002. Comb Gen 28. Head Gen 5374.

2017-03-14T14:00:23.064Z cpu16:33353)WARNING: Vol ID 0x51e837c300000000. Last fault Status IO_HARD_ERROR(20).Last error Status NOT_CONN(4) Reason (ERROR) Retry count (20) chan (1)

2017-03-14T14:00:23.064Z cpu16:33353)scini: blkScsi_PrintIOInfo:3274: ScaleIO R2_01:hCmd 0x439d98caaa40, OpCode 0x2a, rc 1 scsiStat 2, senseCode 4, asc 0, ascq 0

2017-03-14T14:00:23.064Z cpu3:33507)NMP: nmp_ThrottleLogForDevice:3298: Cmd 0x2a (0x439d98caaa40, 858816) to dev "eui.7b850322485d5c3051e837c300000000" on path "vmhba64:C0:T0:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x0 0x0. Act:NONE

2017-03-14T14:00:23.064Z cpu3:33507)ScsiDeviceIO: 2651: Cmd(0x439d98caaa40) 0x2a, CmdSN 0x7afd9b from world 858816 to dev "eui.7b850322485d5c3051e837c300000000" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x0 0x0.

2017-03-14T14:00:23.064Z cpu8:858816)FS3DM: 2767: status I/O error zeroing 1 extents (1048576 each)

How to fix this problem or how to localize this problem?

Full log: http://pastebin.com/raw/tGqVghZq

After start backup proccess with Veeam Backup & Replication i see this error. This problem appears only on one node, the rest are all working well.

March 15th, 2017 04:00

Is this problem with only one ScaleIO volume or all the ScaleIO volumes on which you have created datastores?

16 Posts

March 15th, 2017 05:00

SanjeevMalhotra написал(а):

Is this problem with only one ScaleIO volume or all the ScaleIO volumes on which you have created datastores?

One volume in ScaleIO and one ScaleIO datastore for ESXi on all nodes (5 nodes).

Problem with IO ERROR on one node from all nodes (4 nodes working fine).

306 Posts

March 16th, 2017 07:00

Hi Amama,

Try to check you SDC->SDS connectivity status with 'esxcli ip neighbor list' and ping from the ESXi console every SDS IP in your environment - it most cases one (or more) SDS IP addresses are not accessible via the SDC, so it tries to connect to some SDS to retrieve a particular chunk and fails. Since all the other SDCs work fine, I would presume there are no errors on the SDS side, that would suggest this particular SDC->SDS communication problems.

Cheers,

Pawel

16 Posts

April 2nd, 2017 01:00

All connection are working.

ARP table in actually state.

Log:

2017-04-02T06:29:32.948Z cpu16:32831)ScsiDeviceIO: 2595: Cmd(0x439d959c81c0) 0x2a, CmdSN 0x80000008 from world 1065468 to dev "eui.7b850322485d5c3051e837c300000000" failed H:0x8 D:0x0 P:0x0

....

2017-04-02T06:29:32.948Z cpu20:32835)ScsiDeviceIO: 2595: Cmd(0x439d8ff28f40) 0x2a, CmdSN 0x80000025 from world 37109 to dev "eui.7b850322485d5c3051e837c300000000" failed H:0x8 D:0x0 P:0x0

2017-04-02T06:29:32.948Z cpu9:32824)ScsiDeviceIO: 2651: Cmd(0x43a593ad2d40) 0x89, CmdSN 0x889a75 from world 32806 to dev "eui.7b850322485d5c3051e837c300000000" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

2017-04-02T06:29:32.948Z cpu20:32835)ScsiDeviceIO: 2595: Cmd(0x439d80011e40) 0x2a, CmdSN 0x889a72 from world 32806 to dev "eui.7b850322485d5c3051e837c300000000" failed H:0x8 D:0x0 P:0x0

2017-04-02T06:29:32.948Z cpu20:32835)ScsiDeviceIO: 2595: Cmd(0x439d9a78dd40) 0x2a, CmdSN 0x889a73 from world 32806 to dev "eui.7b850322485d5c3051e837c300000000" failed H:0x8 D:0x0 P:0x0

2017-04-02T06:29:32.948Z cpu20:32835)ScsiDeviceIO: 2595: Cmd(0x439d8096aa40) 0x2a, CmdSN 0x889a74 from world 32806 to dev "eui.7b850322485d5c3051e837c300000000" failed H:0x8 D:0x0 P:0x0

2017-04-02T06:29:32.948Z cpu20:32835)ScsiDeviceIO: 2595: Cmd(0x439d8ff64240) 0x2a, CmdSN 0x800e000c from world 512010 to dev "eui.7b850322485d5c3051e837c300000000" failed H:0x8 D:0x0 P:0x0

2017-04-02T06:29:32.949Z cpu20:32835)ScsiDeviceIO: 2595: Cmd(0x43a5808f39c0) 0x28, CmdSN 0x889a76 from world 34151 to dev "eui.7b850322485d5c3051e837c300000000" failed H:0x8 D:0x0 P:0x0

2017-04-02T06:29:33.888Z cpu15:88644)HBX: 2802: 'scaleio01': HB at offset 3313664 - Waiting for timed out HB:

2017-04-02T06:29:33.888Z cpu15:88644)  [HB state abcdef02 offset 3313664 gen 39 stampUS 1541732280530 uuid 58c913fc-7eaac31d-56ea-002590f469b6 jrnl drv 14.61 lockImpl 4]

2017-04-02T06:29:35.492Z cpu16:34151)HBX: 2802: 'scaleio01': HB at offset 3313664 - Waiting for timed out HB:

2017-04-02T06:29:35.492Z cpu16:34151)  [HB state abcdef02 offset 3313664 gen 39 stampUS 1541732280530 uuid 58c913fc-7eaac31d-56ea-002590f469b6 jrnl drv 14.61 lockImpl 4]

2017-04-02T06:29:41.689Z cpu29:33353)WARNING: [1541753022] IO-ERROR comb: 2da880000321. offsetInComb 2690048. SizeInLB 128. SDS_ID 9a2ad19900000002. Comb Gen 25. Head Gen 5565.

2017-04-02T06:29:41.689Z cpu29:33353)WARNING: Vol ID 0x51e837c300000000. Last fault Status IO_HARD_ERROR(20).Last error Status SUCCESS(65) Reason (ERROR) Retry count (20) chan (1)

2017-04-02T06:29:41.689Z cpu29:33353)scini: blkScsi_PrintIOInfo:3274: ScaleIO R2_01:hCmd 0x439d939658c0, OpCode 0x2a, rc 1 scsiStat 2, senseCode 4, asc 0, ascq 0

2017-04-02T06:29:41.689Z cpu1:33507)ScsiDeviceIO: 2651: Cmd(0x439d939658c0) 0x2a, CmdSN 0x889934 from world 36004 to dev "eui.7b850322485d5c3051e837c300000000" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x0 0x0.

.....

2017-04-02T06:29:41.691Z cpu1:33507)ScsiDeviceIO: 2651: Cmd(0x439d96158ac0) 0x2a, CmdSN 0x889938 from world 36004 to dev "eui.7b850322485d5c3051e837c300000000" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x0 0x0.

2017-04-02T06:29:41.691Z cpu1:33507)NMP: nmp_ThrottleLogForDevice:3231: last error status from device eui.7b850322485d5c3051e837c300000000 repeated 40 times

2017-04-02T06:29:41.691Z cpu1:33507)ScsiDeviceIO: 2651: Cmd(0x439d93534b80) 0x2a, CmdSN 0x889927 from world 36004 to dev "eui.7b850322485d5c3051e837c300000000" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x0 0x0.

.....

2017-04-02T06:29:41.730Z cpu1:33507)ScsiDeviceIO: 2651: Cmd(0x439d80972bc0) 0x2a, CmdSN 0x889a6d from world 36004 to dev "eui.7b850322485d5c3051e837c300000000" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x0 0x0.

2017-04-02T06:29:41.730Z cpu0:36004)HBX: 2802: 'scaleio01': HB at offset 3313664 - Waiting for timed out HB:

2017-04-02T06:29:41.730Z cpu0:36004)  [HB state abcdef02 offset 3313664 gen 39 stampUS 1541732280530 uuid 58c913fc-7eaac31d-56ea-002590f469b6 jrnl drv 14.61 lockImpl 4]

2017-04-02T06:29:41.757Z cpu20:32857)HBX: 276: 'scaleio01': HB at offset 3313664 - Reclaimed heartbeat [Timeout]:

2017-04-02T06:29:41.757Z cpu20:32857)  [HB state abcdef02 offset 3313664 gen 39 stampUS 1541753089373 uuid 58c913fc-7eaac31d-56ea-002590f469b6 jrnl drv 14.61 lockImpl 4]

2017-04-02T06:29:49.948Z cpu9:32824)ScsiDeviceIO: 2651: Cmd(0x43a594d8d800) 0x89, CmdSN 0x889bf9 from world 32806 to dev "eui.7b850322485d5c3051e837c300000000" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

How to localize and fix problem?

16 Posts

April 3rd, 2017 03:00

After reboot SVM are working fine.

How to localize and fix problem?

306 Posts

April 5th, 2017 01:00

Try to see if there is anything in the ESXi logs and/or SVM messages - there might have been a disk problem on that SVM?

No Events found!

Top