PowerFlex SDSs Begin To Crash After Enabling FG Metadata Cache

Summary: PowerFlex SDSs begin to crash after enabling the Fine Granularity (FG) metadata cache.

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Symptoms

Scenario

After enabling FG metadata cache on certain Protection Domains, some of the SDSs begin to crash and restart. 

Symptoms

 - FG metadata cache was enabled on the Protection Domain (from MDM events):

 2023-09-27 02:19:51.115000:4614824:CLI_COMMAND_SUCCEEDED            INFO     Command set_default_fgl_metadata_cache_size succeeded
2023-09-27 02:20:27.996000:4614851:MDM_CLI_CONF_COMMAND_RECEIVED    INFO     Command enable_fgl_metadata_cache received, User: 'admin'. Protection Domain: pd1

 - From the messages file, we can see that the SDS service is being restarted because the oom-killer process is stopping the service:

 Sep 27 02:20:28 sds60 kernel: sds-3.6.700.103 invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Sep 27 02:20:28 sds60 kernel: sds-3.6.700.103 cpuset=/ mems_allowed=0-1
Sep 27 02:20:28 sds60 kernel: CPU: 1 PID: 9615 Comm: sds-3.6.700.103 Kdump: loaded Tainted: G           OE  ------------   3.10.0-1160.80.1.el7.x86_64 #1
Sep 27 02:20:28 sds60 kernel: Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020
Sep 27 02:20:28 sds60 kernel: Call Trace:
Sep 27 02:20:28 sds60 kernel: [] dump_stack+0x19/0x1f
Sep 27 02:20:28 sds60 kernel: [] dump_header+0x90/0x22d
Sep 27 02:20:28 sds60 kernel: [] ? ktime_get_ts64+0x52/0xf0
Sep 27 02:20:28 sds60 kernel: [] ? delayacct_end+0x8f/0xc0
Sep 27 02:20:28 sds60 kernel: [] oom_kill_process+0x2d5/0x4a0
Sep 27 02:20:28 sds60 kernel: [] ? oom_unkillable_task+0x93/0x120
Sep 27 02:20:28 sds60 kernel: [] out_of_memory+0x31a/0x500
Sep 27 02:20:28 sds60 kernel: [] __alloc_pages_nodemask+0xae4/0xbf0
Sep 27 02:20:28 sds60 kernel: [] alloc_pages_current+0x98/0x110
Sep 27 02:20:28 sds60 kernel: [] __page_cache_alloc+0x97/0xb0
Sep 27 02:20:28 sds60 kernel: [] filemap_fault+0x270/0x420
Sep 27 02:20:28 sds60 kernel: [] __xfs_filemap_fault+0x7e/0x1d0 [xfs]
Sep 27 02:20:28 sds60 kernel: [] xfs_filemap_fault+0x2c/0x40 [xfs]
Sep 27 02:20:28 sds60 kernel: [] __do_fault.isra.61+0x8a/0x100
Sep 27 02:20:28 sds60 kernel: [] do_read_fault.isra.63+0x4c/0x1b0
Sep 27 02:20:28 sds60 kernel: [] handle_mm_fault+0xa20/0xfb0
Sep 27 02:20:28 sds60 kernel: [] ? ep_scan_ready_list.isra.7+0x1b9/0x1f0
Sep 27 02:20:28 sds60 kernel: [] __do_page_fault+0x213/0x510
Sep 27 02:20:28 sds60 kernel: [] do_page_fault+0x35/0x90
Sep 27 02:20:28 sds60 kernel: [] page_fault+0x28/0x30
 Sep 27 02:20:28 sds60 kernel: Out of memory: Kill process 1262 (sds-3.6.700.103) score 240 or sacrifice child
Sep 27 02:20:28 sds60 kernel: Killed process 1262 (sds-3.6.700.103), UID 0, total-vm:75663912kB, anon-rss:9100672kB, file-rss:3944kB, shmem-rss:10796856kB
Sep 27 02:20:29 sds60 systemd: sds.service: main process exited, code=killed, status=9/KILL
Sep 27 02:20:29 sds60 systemd: Unit sds.service entered failed state.
Sep 27 02:20:29 sds60 systemd: sds.service failed.

 - SDSs crashing has two NUMA nodes (Non-Uniform Memory Access). This can be seen with "numactl --hardware":

 available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6
node 0 size: 20923 MB
node 0 free: 1014 MB
node 1 cpus: 7 8 9 10 11 12 13
node 1 size: 21166 MB
node 1 free: 19481 MB

 - Other SDSs in the same storage Protection Domain are not crashing and only have one NUMA node:

 available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13
node 0 size: 42089 MB
node 0 free: 19881 MB

 

Impact

This causes data to degrade, leading to rebuilds, and if enough SDSs are crashing at the same time, data unavailability. 

Cause

FG metadata cache allows FG-based volumes to cache reads increasing performance. It also needs more memory from the OS to cache the reads, so that memory must be available to take from the OS. If this memory is not available, the OS uses the oom-killer (Out of Memory) to stop the biggest user of memory, which will likely be the SDS service. 

In this case, the SDSes that were crashing and not crashing were SVMs. All of them had 40+ GB of RAM and 14 vCPUs allocated from their respective ESXi hosts. The SDS by default can only pull memory from NUMA node 0. If there are multiple NUMA nodes, then the SDS cannot pull from NUMA node 1 to satisfy the memory requirement when enabling the FG metadata cache.

The divergence in SVM configurations with either one or two NUMA nodes can be attributed to the underlying hardware. In this case, the SVMs with one NUMA node have 18 cores per socket, and therefore, all 14 vCPUs can fit onto one CPU socket. The SVMs with two NUMA nodes only had 12 cores per socket and therefore, the vCPUs could not fit all on one CPU socket and had to be spread across both sockets. This necessitates the need for two NUMA nodes.  

Resolution

Workaround

Verify that the SVM or OS has sufficient memory to satisfy the new memory requirements. 

There are a couple of workarounds that are possible. Only one needs to be used:

Add the below line in the SDS conf.txt for the nodes that have two NUMA nodes and restart the SDS process (follow all regular maintenance procedures):

numa_memory_affinity=0 

This line in the conf.txt allows the SDS to use memory from both NUMA nodes. 
Additional documentation can be found here on why this setting is needed.
 
Reduce the number of vCPUs to match the underlying hardware. In this case, if the SVM had 12 vCPUs, then all memory would have been available in a single NUMA node. 

Impacted Versions

PowerFlex 3.5.x
PowerFlex 3.6.x
PowerFlex 4.x

Fixed In Version

This is working as designed.

Affected Products

PowerFlex rack, ScaleIO
Article Properties
Article Number: 000223657
Article Type: Solution
Last Modified: 03 Feb 2025
Version:  2
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.