PowerFlex SDSs Begin To Crash After Enabling FG Metadata Cache
Summary: PowerFlex SDSs begin to crash after enabling the Fine Granularity (FG) metadata cache.
Symptoms
Scenario
After enabling FG metadata cache on certain Protection Domains, some of the SDSs begin to crash and restart.
Symptoms
- FG metadata cache was enabled on the Protection Domain (from MDM events):
2023-09-27 02:19:51.115000:4614824:CLI_COMMAND_SUCCEEDED INFO Command set_default_fgl_metadata_cache_size succeeded 2023-09-27 02:20:27.996000:4614851:MDM_CLI_CONF_COMMAND_RECEIVED INFO Command enable_fgl_metadata_cache received, User: 'admin'. Protection Domain: pd1
- From the messages file, we can see that the SDS service is being restarted because the oom-killer process is stopping the service:
Sep 27 02:20:28 sds60 kernel: sds-3.6.700.103 invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0 Sep 27 02:20:28 sds60 kernel: sds-3.6.700.103 cpuset=/ mems_allowed=0-1 Sep 27 02:20:28 sds60 kernel: CPU: 1 PID: 9615 Comm: sds-3.6.700.103 Kdump: loaded Tainted: G OE ------------ 3.10.0-1160.80.1.el7.x86_64 #1 Sep 27 02:20:28 sds60 kernel: Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020 Sep 27 02:20:28 sds60 kernel: Call Trace: Sep 27 02:20:28 sds60 kernel: [] dump_stack+0x19/0x1f Sep 27 02:20:28 sds60 kernel: [] dump_header+0x90/0x22d Sep 27 02:20:28 sds60 kernel: [] ? ktime_get_ts64+0x52/0xf0 Sep 27 02:20:28 sds60 kernel: [] ? delayacct_end+0x8f/0xc0 Sep 27 02:20:28 sds60 kernel: [] oom_kill_process+0x2d5/0x4a0 Sep 27 02:20:28 sds60 kernel: [] ? oom_unkillable_task+0x93/0x120 Sep 27 02:20:28 sds60 kernel: [] out_of_memory+0x31a/0x500 Sep 27 02:20:28 sds60 kernel: [] __alloc_pages_nodemask+0xae4/0xbf0 Sep 27 02:20:28 sds60 kernel: [] alloc_pages_current+0x98/0x110 Sep 27 02:20:28 sds60 kernel: [] __page_cache_alloc+0x97/0xb0 Sep 27 02:20:28 sds60 kernel: [] filemap_fault+0x270/0x420 Sep 27 02:20:28 sds60 kernel: [] __xfs_filemap_fault+0x7e/0x1d0 [xfs] Sep 27 02:20:28 sds60 kernel: [] xfs_filemap_fault+0x2c/0x40 [xfs] Sep 27 02:20:28 sds60 kernel: [] __do_fault.isra.61+0x8a/0x100 Sep 27 02:20:28 sds60 kernel: [] do_read_fault.isra.63+0x4c/0x1b0 Sep 27 02:20:28 sds60 kernel: [] handle_mm_fault+0xa20/0xfb0 Sep 27 02:20:28 sds60 kernel: [] ? ep_scan_ready_list.isra.7+0x1b9/0x1f0 Sep 27 02:20:28 sds60 kernel: [] __do_page_fault+0x213/0x510 Sep 27 02:20:28 sds60 kernel: [] do_page_fault+0x35/0x90 Sep 27 02:20:28 sds60 kernel: [] page_fault+0x28/0x30
Sep 27 02:20:28 sds60 kernel: Out of memory: Kill process 1262 (sds-3.6.700.103) score 240 or sacrifice child Sep 27 02:20:28 sds60 kernel: Killed process 1262 (sds-3.6.700.103), UID 0, total-vm:75663912kB, anon-rss:9100672kB, file-rss:3944kB, shmem-rss:10796856kB Sep 27 02:20:29 sds60 systemd: sds.service: main process exited, code=killed, status=9/KILL Sep 27 02:20:29 sds60 systemd: Unit sds.service entered failed state. Sep 27 02:20:29 sds60 systemd: sds.service failed.
- SDSs crashing has two NUMA nodes (Non-Uniform Memory Access). This can be seen with "numactl --hardware":
available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 node 0 size: 20923 MB node 0 free: 1014 MB node 1 cpus: 7 8 9 10 11 12 13 node 1 size: 21166 MB node 1 free: 19481 MB
- Other SDSs in the same storage Protection Domain are not crashing and only have one NUMA node:
available: 1 nodes (0) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 node 0 size: 42089 MB node 0 free: 19881 MB
Impact
This causes data to degrade, leading to rebuilds, and if enough SDSs are crashing at the same time, data unavailability.
Cause
FG metadata cache allows FG-based volumes to cache reads increasing performance. It also needs more memory from the OS to cache the reads, so that memory must be available to take from the OS. If this memory is not available, the OS uses the oom-killer (Out of Memory) to stop the biggest user of memory, which will likely be the SDS service.
In this case, the SDSes that were crashing and not crashing were SVMs. All of them had 40+ GB of RAM and 14 vCPUs allocated from their respective ESXi hosts. The SDS by default can only pull memory from NUMA node 0. If there are multiple NUMA nodes, then the SDS cannot pull from NUMA node 1 to satisfy the memory requirement when enabling the FG metadata cache.
The divergence in SVM configurations with either one or two NUMA nodes can be attributed to the underlying hardware. In this case, the SVMs with one NUMA node have 18 cores per socket, and therefore, all 14 vCPUs can fit onto one CPU socket. The SVMs with two NUMA nodes only had 12 cores per socket and therefore, the vCPUs could not fit all on one CPU socket and had to be spread across both sockets. This necessitates the need for two NUMA nodes.
Resolution
Workaround
Verify that the SVM or OS has sufficient memory to satisfy the new memory requirements.
There are a couple of workarounds that are possible. Only one needs to be used:
Add the below line in the SDS conf.txt for the nodes that have two NUMA nodes and restart the SDS process (follow all regular maintenance procedures):
numa_memory_affinity=0
This line in the conf.txt allows the SDS to use memory from both NUMA nodes.
Additional documentation can be found here on why this setting is needed.
Reduce the number of vCPUs to match the underlying hardware. In this case, if the SVM had 12 vCPUs, then all memory would have been available in a single NUMA node.
Impacted Versions
PowerFlex 3.5.x
PowerFlex 3.6.x
PowerFlex 4.x
Fixed In Version
This is working as designed.