PowerFlex SDCs Have IO Errors After Adding New SDS To Cluster
Summary: After adding a new SDS to the PowerFlex cluster, and rebalancing data to the new SDS, some SDCs show I/O errors
Symptoms
- SDS was added to the cluster (from MDM events)
2022-03-07 18:28:07.080 MDM_CLI_CONF_COMMAND_RECEIVED INFO Command add_sds received, User: 'admin'. [4576114] protection_domain with ID a876543f00000000, fault_set N/A, New SDS name: sds12, Hostnames: 172.1.0.12, Port: 7072 2022-03-07 18:28:07.324 CLI_COMMAND_SUCCEEDED INFO Command add_sds succeeded. [4576114] ID: 7d120650000002f
- SDC logs show kernel stack traces in trying to connect to the new SDS. These can show multiple times as the SDC tries to connect to the new SDS:
022-03-07T18:29:02.880040-05:00 sdc32 kernel: net_sched: page allocation failure: order:4, mode:0x104020 2022-03-07T18:29:02.880170-05:00 sdc32 kernel: CPU: 16 PID: 5662 Comm: net_sched Kdump: loaded Tainted: P OE ------------ 3.10.0-1062.el7.x86_64 #1 2022-03-07T18:29:02.880242-05:00 sdc32 kernel: Hardware name: Dell Inc. PowerEdge R740xd/5XXXXX, BIOS 2.4.8 11/26/2019 2022-03-07T18:29:02.880355-05:00 sdc32 kernel: Call Trace: 2022-03-07T18:29:02.880453-05:00 sdc32 kernel: [<ffffffffbb179262>] dump_stack+0x19/0x1b 2022-03-07T18:29:02.880553-05:00 sdc32 kernel: [<ffffffffbabc23d0>] warn_alloc_failed+0x110/0x180 2022-03-07T18:29:02.880623-05:00 sdc32 kernel: [<ffffffffbb1747fc>] __alloc_pages_slowpath+0x6b6/0x724 2022-03-07T18:29:02.880709-05:00 sdc32 kernel: [<ffffffffc09c021e>] ? netAddress_ToStrCopy+0x2de/0x520 [scini] 2022-03-07T18:29:02.880787-05:00 sdc32 kernel: [<ffffffffbabc6b84>] __alloc_pages_nodemask+0x404/0x420 2022-03-07T18:29:02.880851-05:00 sdc32 kernel: [<ffffffffbac14c68>] alloc_pages_current+0x98/0x110 2022-03-07T18:29:02.880922-05:00 sdc32 kernel: [<ffffffffbabc117e>] __get_free_pages+0xe/0x40 2022-03-07T18:29:02.880995-05:00 sdc32 kernel: [<ffffffffbac2052e>] kmalloc_order_trace+0x2e/0xa0 2022-03-07T18:29:02.881053-05:00 sdc32 kernel: [<ffffffffbac24571>] __kmalloc+0x211/0x230 2022-03-07T18:29:02.881110-05:00 sdc32 kernel: [<ffffffffc0988274>] mapClass_AllocAndInitObj+0x44/0x140 [scini] 2022-03-07T18:29:02.881167-05:00 sdc32 kernel: [<ffffffffc098959e>] mapClass_UpdateAll+0x40e/0x9b0 [scini] 2022-03-07T18:29:02.881229-05:00 sdc32 kernel: [<ffffffffc09c9994>] ? mosMitSchedThrd_CurThrdOurs+0x64/0x90 [scini] 2022-03-07T18:29:02.881298-05:00 sdc32 kernel: [<ffffffffc099aa07>] mapMdm_HandleObjUpdate_CK+0x327/0x730 [scini] 2022-03-07T18:29:02.881360-05:00 sdc32 kernel: [<ffffffffc099c070>] ? mapMdm_SendUpdateReq_CK+0xf0/0xed0 [scini] 2022-03-07T18:29:02.881419-05:00 sdc32 kernel: [<ffffffffc099c4b0>] mapMdm_SendUpdateReq_CK+0x530/0xed0 [scini] 2022-03-07T18:29:02.881480-05:00 sdc32 kernel: [<ffffffffc09a704f>] ? netSock_DoIO+0xef/0x7d0 [scini] 2022-03-07T18:29:02.881543-05:00 sdc32 kernel: [<ffffffffc09aaf47>] ? netChanThrottler_TryToWakeupWaiterOfCond+0x27/0x100 [scini] 2022-03-07T18:29:02.881606-05:00 sdc32 kernel: [<ffffffffc09ab118>] ? netChanThrottler_TryToWakeupWaiter+0x58/0x80 [scini] 2022-03-07T18:29:02.881660-05:00 sdc32 kernel: [<ffffffffc095cc80>] ? netChan_SendReq_CK+0x90/0xa70 [scini] 2022-03-07T18:29:02.881716-05:00 sdc32 kernel: [<ffffffffc095cdd0>] netChan_SendReq_CK+0x1e0/0xa70 [scini] 2022-03-07T18:29:02.881775-05:00 sdc32 kernel: [<ffffffffc09699c5>] netCon_SendReq_CK+0x175/0x590 [scini] 2022-03-07T18:29:02.881835-05:00 sdc32 kernel: [<ffffffffc0963cf7>] ? netRPC_SendDone_CK+0x47/0xab0 [scini] 2022-03-07T18:29:02.881890-05:00 sdc32 kernel: [<ffffffffc0963dd3>] netRPC_SendDone_CK+0x123/0xab0 [scini] 2022-03-07T18:29:02.881959-05:00 sdc32 kernel: [<ffffffffc09c9c24>] mosMit_RunWithTLS+0x54/0x60 [scini] 2022-03-07T18:29:02.882017-05:00 sdc32 kernel: [<ffffffffc09cbb92>] mosMitSchedThrd_ThrdEntry+0x1a2/0x500 [scini] 2022-03-07T18:29:02.882071-05:00 sdc32 kernel: [<ffffffffc09c82f0>] ? mosTicks_DestroyEnvSpecific+0x10/0x10 [scini] 2022-03-07T18:29:02.882126-05:00 sdc32 kernel: [<ffffffffc09c8310>] mosOsThrd_Entry+0x20/0x50 [scini] 2022-03-07T18:29:02.882185-05:00 sdc32 kernel: [<ffffffffbaac50d1>] kthread+0xd1/0xe0 2022-03-07T18:29:02.882240-05:00 sdc32 kernel: [<ffffffffbaac5000>] ? insert_kthread_work+0x40/0x40 2022-03-07T18:29:02.882294-05:00 sdc32 kernel: [<ffffffffbb18bd37>] ret_from_fork_nospec_begin+0x21/0x21 2022-03-07T18:29:02.882351-05:00 sdc32 kernel: [<ffffffffbaac5000>] ? insert_kthread_work+0x40/0x40 2022-03-07T18:29:02.891431-05:00 sdc32 kernel: ScaleIO mapClass_AllocAndInitObj:1301 :Error: Failed to allocate memory 38480.Cannot process MDM response
- SDC logs then show I/O errors:
2022-03-07T18:49:03.652009-05:00 sdc32 kernel: ScaleIO mapMultiHead_UpdateInPlace:561 :Warning: Invalid primaryTgtIdx 47 at 2. Multi-head ID dd990042. 2022-03-07T18:49:03.652130-05:00 sdc32 kernel: ScaleIO mapClass_UpdateAll:614 :Error: Object ffff948c7f350000 (class multi_head) failed to update in place.status NO_RESOURCES (67) 2022-03-07T18:49:04.662144-05:00 sdc32 kernel: ScaleIO mapVolIO_ReportIOErrorIfNeeded:491 :[67641433832] IO-ERROR Type READ. comb: 6ecc80210154. offsetInComb 14536800. SizeInLB 256. SDS_ID 7dd0d24700000033. Comb Gen 21c1. Head Gen 222c. StartLB 5fb1ac860. 2022-03-07T18:49:04.662268-05:00 sdc32 kernel: blk_update_request: I/O error, dev scinic, sector 25687672672
Impact
I/O errors on some SDCs
Cause
Resolution
The IO errors can be stopped immediately by placing the new SDS into Instant Maintenance Mode. The SDCs can get their data from the remaining SDSs. Bring the SDS out of maintenance when the SDC memory issue is resolved.
There are a few things to do here, all should be considered for the action plan:
- If a sure method is needed to fix it, a reboot of the SDC host fixes the issue. It reinitializes the OS and the memory fragmentation that was there prior will be gone
-
If a reboot is not possible, use the following command. It tries to reclaim and compact the fragmented memory as much as possible. This is not guaranteed to work and can be disruptive to current applications for the few seconds it reclaims/compacts memory fragments:
sync; echo 1 > /proc/sys/vm/drop_caches; echo 1 > /proc/sys/vm/compact_memory
Check memory usage with "cat /proc/buddyinfo" to see if the above command helped or not. Use it before and after running the above command to see if it rises above zero. We are looking at the fourth column in the "Node 0, zone Normal" row, which is a count of 64 KB contiguous chunks of memory available. In this instance, it shows before running the above command to reclaim/compact:sdc32:~ # cat /proc/buddyinfo Node 0, zone DMA 1 0 0 1 2 1 1 0 1 1 2 Node 0, zone DMA32 9 5 7 5 9 6 8 6 7 5 295 Node 0, zone Normal 909921 1263646 202478 0<=== 0 0 0 0 0 0 0 Node 1, zone Normal 39085 0 0 0 0 0 0 0 0 0 0
- Check the memory overcommit settings on the kernel using "sysctl -a." If the kernel is using the default as follows:
vm.overcommit_memory = 0
vm.overcommit_ratio = 50
It would be advantageous to talk to the OS vendor about changing these parameters as follows, to give the OS more memory to work with after a reboot and avoid this issue in the future:
vm.overcommit_memory = 2
vm.overcommit_ratio = 98
- The Linux Kernel may need to be upgraded if this becomes a regular pattern of OS memory fragmentation
Impacted Versions
All PowerFlex SDCs can be impacted by OS memory fragmentation issues.