PowerFlex SDCs Have IO Errors After Adding New SDS To Cluster

Summary: After adding a new SDS to the PowerFlex cluster, and rebalancing data to the new SDS, some SDCs show I/O errors

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Symptoms

- SDS was added to the cluster (from MDM events)

2022-03-07 18:28:07.080 MDM_CLI_CONF_COMMAND_RECEIVED INFO     	 Command add_sds received, User: 'admin'. [4576114] protection_domain with ID a876543f00000000, fault_set N/A, 

New SDS name: sds12, Hostnames: 172.1.0.12, Port: 7072
2022-03-07 18:28:07.324 CLI_COMMAND_SUCCEEDED         INFO     	 Command add_sds succeeded. [4576114] ID: 7d120650000002f

 - SDC logs show kernel stack traces in trying to connect to the new SDS. These can show multiple times as the SDC tries to connect to the new SDS:

 
022-03-07T18:29:02.880040-05:00 sdc32 kernel: net_sched: page allocation failure: order:4, mode:0x104020
2022-03-07T18:29:02.880170-05:00 sdc32 kernel: CPU: 16 PID: 5662 Comm: net_sched Kdump: loaded Tainted: P           OE  ------------   3.10.0-1062.el7.x86_64 #1
2022-03-07T18:29:02.880242-05:00 sdc32 kernel: Hardware name: Dell Inc. PowerEdge R740xd/5XXXXX, BIOS 2.4.8 11/26/2019
2022-03-07T18:29:02.880355-05:00 sdc32 kernel: Call Trace:
2022-03-07T18:29:02.880453-05:00 sdc32 kernel: [<ffffffffbb179262>] dump_stack+0x19/0x1b
2022-03-07T18:29:02.880553-05:00 sdc32 kernel: [<ffffffffbabc23d0>] warn_alloc_failed+0x110/0x180
2022-03-07T18:29:02.880623-05:00 sdc32 kernel: [<ffffffffbb1747fc>] __alloc_pages_slowpath+0x6b6/0x724
2022-03-07T18:29:02.880709-05:00 sdc32 kernel: [<ffffffffc09c021e>] ? netAddress_ToStrCopy+0x2de/0x520 [scini]
2022-03-07T18:29:02.880787-05:00 sdc32 kernel: [<ffffffffbabc6b84>] __alloc_pages_nodemask+0x404/0x420
2022-03-07T18:29:02.880851-05:00 sdc32 kernel: [<ffffffffbac14c68>] alloc_pages_current+0x98/0x110
2022-03-07T18:29:02.880922-05:00 sdc32 kernel: [<ffffffffbabc117e>] __get_free_pages+0xe/0x40
2022-03-07T18:29:02.880995-05:00 sdc32 kernel: [<ffffffffbac2052e>] kmalloc_order_trace+0x2e/0xa0
2022-03-07T18:29:02.881053-05:00 sdc32 kernel: [<ffffffffbac24571>] __kmalloc+0x211/0x230
2022-03-07T18:29:02.881110-05:00 sdc32 kernel: [<ffffffffc0988274>] mapClass_AllocAndInitObj+0x44/0x140 [scini]
2022-03-07T18:29:02.881167-05:00 sdc32 kernel: [<ffffffffc098959e>] mapClass_UpdateAll+0x40e/0x9b0 [scini]
2022-03-07T18:29:02.881229-05:00 sdc32 kernel: [<ffffffffc09c9994>] ? mosMitSchedThrd_CurThrdOurs+0x64/0x90 [scini]
2022-03-07T18:29:02.881298-05:00 sdc32 kernel: [<ffffffffc099aa07>] mapMdm_HandleObjUpdate_CK+0x327/0x730 [scini]
2022-03-07T18:29:02.881360-05:00 sdc32 kernel: [<ffffffffc099c070>] ? mapMdm_SendUpdateReq_CK+0xf0/0xed0 [scini]
2022-03-07T18:29:02.881419-05:00 sdc32 kernel: [<ffffffffc099c4b0>] mapMdm_SendUpdateReq_CK+0x530/0xed0 [scini]
2022-03-07T18:29:02.881480-05:00 sdc32 kernel: [<ffffffffc09a704f>] ? netSock_DoIO+0xef/0x7d0 [scini]
2022-03-07T18:29:02.881543-05:00 sdc32 kernel: [<ffffffffc09aaf47>] ? netChanThrottler_TryToWakeupWaiterOfCond+0x27/0x100 [scini]
2022-03-07T18:29:02.881606-05:00 sdc32 kernel: [<ffffffffc09ab118>] ? netChanThrottler_TryToWakeupWaiter+0x58/0x80 [scini]
2022-03-07T18:29:02.881660-05:00 sdc32 kernel: [<ffffffffc095cc80>] ? netChan_SendReq_CK+0x90/0xa70 [scini]
2022-03-07T18:29:02.881716-05:00 sdc32 kernel: [<ffffffffc095cdd0>] netChan_SendReq_CK+0x1e0/0xa70 [scini]
2022-03-07T18:29:02.881775-05:00 sdc32 kernel: [<ffffffffc09699c5>] netCon_SendReq_CK+0x175/0x590 [scini]
2022-03-07T18:29:02.881835-05:00 sdc32 kernel: [<ffffffffc0963cf7>] ? netRPC_SendDone_CK+0x47/0xab0 [scini]
2022-03-07T18:29:02.881890-05:00 sdc32 kernel: [<ffffffffc0963dd3>] netRPC_SendDone_CK+0x123/0xab0 [scini]
2022-03-07T18:29:02.881959-05:00 sdc32 kernel: [<ffffffffc09c9c24>] mosMit_RunWithTLS+0x54/0x60 [scini]
2022-03-07T18:29:02.882017-05:00 sdc32 kernel: [<ffffffffc09cbb92>] mosMitSchedThrd_ThrdEntry+0x1a2/0x500 [scini]
2022-03-07T18:29:02.882071-05:00 sdc32 kernel: [<ffffffffc09c82f0>] ? mosTicks_DestroyEnvSpecific+0x10/0x10 [scini]
2022-03-07T18:29:02.882126-05:00 sdc32 kernel: [<ffffffffc09c8310>] mosOsThrd_Entry+0x20/0x50 [scini]
2022-03-07T18:29:02.882185-05:00 sdc32 kernel: [<ffffffffbaac50d1>] kthread+0xd1/0xe0
2022-03-07T18:29:02.882240-05:00 sdc32 kernel: [<ffffffffbaac5000>] ? insert_kthread_work+0x40/0x40
2022-03-07T18:29:02.882294-05:00 sdc32 kernel: [<ffffffffbb18bd37>] ret_from_fork_nospec_begin+0x21/0x21
2022-03-07T18:29:02.882351-05:00 sdc32 kernel: [<ffffffffbaac5000>] ? insert_kthread_work+0x40/0x40
2022-03-07T18:29:02.891431-05:00 sdc32 kernel: ScaleIO mapClass_AllocAndInitObj:1301 :Error: Failed to allocate memory 38480.Cannot process MDM response



 - SDC logs then show I/O errors:
 

2022-03-07T18:49:03.652009-05:00 sdc32 kernel: ScaleIO mapMultiHead_UpdateInPlace:561 :Warning: Invalid primaryTgtIdx 47 at 2. Multi-head ID dd990042.
2022-03-07T18:49:03.652130-05:00 sdc32 kernel: ScaleIO mapClass_UpdateAll:614 :Error: Object ffff948c7f350000 (class multi_head) failed to update in place.status NO_RESOURCES (67)
2022-03-07T18:49:04.662144-05:00 sdc32 kernel: ScaleIO mapVolIO_ReportIOErrorIfNeeded:491 :[67641433832] IO-ERROR Type READ. comb: 6ecc80210154. offsetInComb 14536800. SizeInLB 256. SDS_ID 7dd0d24700000033. Comb Gen 21c1. Head Gen 222c. StartLB 5fb1ac860.
2022-03-07T18:49:04.662268-05:00 sdc32 kernel: blk_update_request: I/O error, dev scinic, sector 25687672672


 

Impact

I/O errors on some SDCs



 

Cause

The host OS where the SDC runs is out of contiguous memory chunks in the 64 KB range. The "page allocation failure: order: 4" message tells us that it is a 64 KB chunk that is not available. The SDC is asking the OS for a memory allocation to create the new socket and connect to the newly added SDS, which is what is failing. If the SDC cannot connect to the new SDS and data is rebalanced to the new SDS, then IO errors on the SDC host will be the result.

Resolution

The IO errors can be stopped immediately by placing the new SDS into Instant Maintenance Mode. The SDCs can get their data from the remaining SDSs. Bring the SDS out of maintenance when the SDC memory issue is resolved. 

There are a few things to do here, all should be considered for the action plan:

  1. If a sure method is needed to fix it, a reboot of the SDC host fixes the issue. It reinitializes the OS and the memory fragmentation that was there prior will be gone
     
  2. If a reboot is not possible, use the following command. It tries to reclaim and compact the fragmented memory as much as possible. This is not guaranteed to work and can be disruptive to current applications for the few seconds it reclaims/compacts memory fragments: 

    sync; echo 1 > /proc/sys/vm/drop_caches; echo 1 > /proc/sys/vm/compact_memory



    Check memory usage with "cat /proc/buddyinfo" to see if the above command helped or not. Use it before and after running the above command to see if it rises above zero. We are looking at the fourth column in the "Node 0, zone Normal" row, which is a count of 64 KB contiguous chunks of memory available. In this instance, it shows before running the above command to reclaim/compact:

    sdc32:~ # cat /proc/buddyinfo
    Node 0, zone      DMA        1        0       0      1      2      1      1      0      1      1      2
    Node 0, zone    DMA32        9        5       7      5      9      6      8      6      7      5    295
    Node 0, zone   Normal   909921  1263646  202478      0<===  0      0      0      0      0      0      0
    Node 1, zone   Normal	 39085	      0	      0      0      0      0      0      0      0      0      0

     

  3. Check the memory overcommit settings on the kernel using "sysctl -a." If the kernel is using the default as follows:
    vm.overcommit_memory = 0
    vm.overcommit_ratio = 50

    It would be advantageous to talk to the OS vendor about changing these parameters as follows, to give the OS more memory to work with after a reboot and avoid this issue in the future:
    vm.overcommit_memory = 2
    vm.overcommit_ratio = 98
     
  4. The Linux Kernel may need to be upgraded if this becomes a regular pattern of OS memory fragmentation

 

Impacted Versions

All PowerFlex SDCs can be impacted by OS memory fragmentation issues.
 

Affected Products

PowerFlex Software
Article Properties
Article Number: 000197199
Article Type: Solution
Last Modified: 09 Jun 2025
Version:  2
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.