PowerFlex:由于内存分配失败,MDM 切换失败 — mos_MemMalloc
Summary: 切换 MDM 所有权(手动或其他方式)时,由于内存分配失败,接收 MDM 无法正常启动,使群集没有主 MDM。
Symptoms
来自接收 MDM /opt/emc/scaleio/mdm/bin/showevents.py 的事件输出将包含多个条目,用于尝试接管主 MDM 职责,所有这些条目都在短时间内相差无几:
2017-10-04 12:08:33.915 MDM_CLUSTER_BECOMING_MASTER WARNING This MDM, ID 394760fd6714xxxx, took control of the cluster and is now the Master MDM. 2017-10-04 12:08:33.915 MDM_BECOMING_MASTER WARNING This MDM is switching to Master mode. MDM will start running. .. 2017-10-04 12:08:34.309 MDM_CLUSTER_BECOMING_MASTER WARNING This MDM, ID 394760fd6714xxxx, took control of the cluster and is now the Master MDM. 2017-10-04 12:08:34.309 MDM_BECOMING_MASTER WARNING This MDM is switching to Master mode. MDM will start running.
来自接收 MDM 的 exp.0 文件具有如下条目:
04/10 12:08:34.079823 Panic in file /data/build/workspace/ScaleIO-SLES12-2/src/mos/usr/mos_utils.c, line 73,
function mos_MemMalloc, PID 9978.Panic Expression bCanFail . /opt/emc/scaleio/mdm/bin/mdm-2.0.13000.211(
mosDbg_PanicPrepare+0x115) [0x6a86f5] /opt/emc/scaleio/mdm/bin/mdm-2.0.13000.211(mos_MemMalloc+0x81)
[0x6ac0d1] /opt/emc/scaleio/mdm/bin/mdm-2.0.13000.211(multiHeadMgr_GetUpdateMultiHeadsMsg+0x66) [0x57123c]
/opt/emc/scaleio/mdm/bin/mdm-2.0.13000.211(tgtMgr_ConfigureTgt+0x9c1) [0x4d579e]
/opt/emc/scaleio/mdm/bin/mdm-2.0.13000.211(tgtMgr_HandleWorkReq+0x41b) [0x4d6206]
/opt/emc/scaleio/mdm/bin/mdm-2.0.13000.211() [0x6c57d8]
/opt/emc/scaleio/mdm/bin/mdm-2.0.13000.211(mosUmt_StartFunc+0xea) [0x6c51af]
/opt/emc/scaleio/mdm/bin/mdm-2.0.13000.211(mosUmt_SignalHandler+0x51) [0x6c65d1]
/lib64/libpthread.so.0(+0x10b00) [0x7f844e8a6b00] /lib64/libc.so.6(sleep+0xd4) [0x7f844d8911a4]
/var/log/messages 文件显示 MDM 服务的多次重新启动,events.txt执行以下作:
systemd[1]: mdm.service: Main process exited, code=exited, status=255/n/a systemd[1]: mdm.service: Unit entered failed state. systemd[1]: mdm.service: Failed with result 'exit-code'. systemd[1]: mdm.service: Service has no hold-off time, scheduling restart. systemd[1]: Stopped scaleio mdm. systemd[1]: mdm.service: Start request repeated too quickly. systemd[1]: Failed to start scaleio mdm. systemd[1]: mdm.service: Unit entered failed state. systemd[1]: mdm.service: Failed with result 'start-limit'.
Cause
Resolution
这不是 ScaleIO 问题。ScaleIO 按设计正常工作。
要检查和/或修改 vm.overcommit 设置,请执行以下步骤:
1.使用 SSH 以 root
用户身份登录 SDS2。在该节点上运行
cat /etc/sysctl.conf | grep "vm.overcommit" Ex. [root@sds-node logs]# cat /etc/sysctl.conf | grep "vm.overcommit" vm.overcommit_memory = 2 vm.overcommit_ratio = 50
3、运行以下命令。
sed -i 's/vm\.overcommit_memory = .*/vm\.overcommit_memory = 2/g' /etc/sysctl.conf sed -i 's/vm\.overcommit_ratio = .*/vm\.overcommit_ratio = 100/g' /etc/sysctl.conf sysctl -p
验证
[root@sds-node logs]# cat /etc/sysctl.conf | grep "vm.overcommit" vm.overcommit_memory = 2 vm.overcommit_ratio = 100
对环境中的所有受影响的 SDS 重复这些步骤,以确保将其设置为建议的最佳实践设置。您无需将 SDS 置于维护模式即可执行此作。
要了解有关这些设置的更多信息,请参阅有关过量使用记帐的 Linux 内核文档
Additional Information
检查 sysctl 内核参数是否过量使用内存:
# sysctl -a |grep commit vm.overcommit_memory = 2 (default is 0) vm.overcommit_ratio = 50 (default is 50)
在这种情况下,将“vm.overcommit_memory”设置为 2 意味着不会过度使用内存。超过过量使用限制的任何内存分配都会失败。系统提交的总地址空间不得超过交换 + 可配置的物理 RAM 量(默认值为 50%)。当此设置为 0 时,它将拒绝明显的过量使用请求,但允许根进程分配超过过量使用限制。
要检查当前过量使用限制和提交的金额,请分别参阅 CommitLimit 和 Committed_AS,分别来自以下命令:
#cat /proc/meminfo MemTotal: 8174572 kB .. CommitLimit: 4087284 kB Committed_AS: 3879388 kB
此主机上有 8 GB RAM,CommitLimit 设置为 ~4 GB,即总地址空间的 50%。
要解决此问题,请在 /etc/sysctl.conf 中添加/编辑以下内容之一:
- 将“vm.overcommit_ratio”更改为 100,以便作系统可以提交可用的总地址空间并重新启动。
要了解有关这些设置的更多信息,请参阅有关过量使用记帐的 Linux 内核文档