VxRail:由于节点未响应“esxcli”命令,物理视图未显示

Summary: VxRail 群集节点物理视图缺失,节点未响应“esxcli”命令,NTP 未同步。

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Symptoms

  1. 缺少所有节点的物理视图。
    从 web.log,API 网关在物理视图数据获取 10 分钟后超时:

    2023-11-20T09:24:31.039Z <7527c8d153655e9bbb43b32dcd312443> marvin [ERROR] <261> ApplianceServiceImpl.java populatePvCache() (276): failed to fetch data.
    javax.ws.rs.ServerErrorException: HTTP 504 Gateway Time-out
    	at org.glassfish.jersey.client.JerseyInvocation.createExceptionForFamily(JerseyInvocation.java:1125) ~[jersey-client-2.27.jar:?]
    	at org.glassfish.jersey.client.JerseyInvocation.convertToException(JerseyInvocation.java:1105) ~[jersey-client-2.27.jar:?]
    	at org.glassfish.jersey.client.JerseyInvocation.translate(JerseyInvocation.java:883) ~[jersey-client-2.27.jar:?]
    	at org.glassfish.jersey.client.JerseyInvocation.lambda$invoke$1(JerseyInvocation.java:767) ~[jersey-client-2.27.jar:?]
    	at org.glassfish.jersey.internal.Errors.process(Errors.java:316) ~[jersey-common-2.27.jar:?]
    	at org.glassfish.jersey.internal.Errors.process(Errors.java:298) ~[jersey-common-2.27.jar:?]
    	at org.glassfish.jersey.internal.Errors.process(Errors.java:229) ~[jersey-common-2.27.jar:?]
    	at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:414) ~[jersey-common-2.27.jar:?]
    	at org.glassfish.jersey.client.JerseyInvocation.invoke(JerseyInvocation.java:765) ~[jersey-client-2.27.jar:?]
    	at org.glassfish.jersey.client.JerseyInvocation$Builder.method(JerseyInvocation.java:456) ~[jersey-client-2.27.jar:?]
    	at org.glassfish.jersey.client.JerseyInvocation$Builder.post(JerseyInvocation.java:357) ~[jersey-client-2.27.jar:?]
    	at com.vce.commons.domainowner.graphq.DefaultQueryExecutorImpl.doJsonRequestExecution(DefaultQueryExecutorImpl.java:139) ~[commons-7.0.480.jar:?]
    	at com.vce.commons.domainowner.graphq.DefaultQueryExecutorImpl.execute(DefaultQueryExecutorImpl.java:94) ~[commons-7.0.480.jar:?]
    	at com.emc.mystic.manager.graphql.client.host.HostQuery.configuredHosts(HostQuery.java:138) ~[do-host-graphql-client-1.20.41.jar:?]
    	at com.emc.mystic.manager.graphql.client.host.HostQuery.configuredHosts(HostQuery.java:102) ~[do-host-graphql-client-1.20.41.jar:?]
    	at com.vce.commons.domainowner.node.NodeRepository.getAllClusterNodeData(NodeRepository.java:1997) ~[commons-7.0.480.jar:?]
    	at com.emc.mystic.manager.cluster.service.ApplianceServiceImpl.getAllHostData(ApplianceServiceImpl.java:543) ~[classes/:?]
    	at com.emc.mystic.manager.cluster.service.ApplianceServiceImpl.populatePvCache(ApplianceServiceImpl.java:260) ~[classes/:?]
    	at com.emc.mystic.manager.cluster.service.ApplianceServiceImpl.lambda$updateCacheTask$6(ApplianceServiceImpl.java:320) ~[classes/:?]
    	at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
    	at java.lang.Thread.run(Thread.java:829) [?:?]
    2023-11-20T09:24:31.045Z <7527c8d153655e9bbb43b32dcd312443> marvin [INFO] <261> ApplianceServiceImpl.java lambda$updateCacheTask$6() (321): Success to refresh cache for node: <node SN>
    2023-11-20T09:24:31.046Z <7527c8d153655e9bbb43b32dcd312443> marvin [INFO] <59> ApplianceDataRoot.java refreshVxRailClusterTag() (289): Skip refreshing VxRail-Cluster-Tag triggered by scheduled job as it has already been done within 15 minutes.
    2023-11-20T09:24:31.046Z <7527c8d153655e9bbb43b32dcd312443> marvin [INFO] <59> ApplianceDataRoot.java fetchData() (357): [VXMPERF] PV fetch data execution time in 602 seconds
    
  2. 在 VxRail Manager 上运行以下命令以查询节点物理视图数据:

    #curl -X GET --unix-socket /var/lib/vxrail/nginx/socket/nginx.sock http://127.0.0.1/rest/vxm/internal/do/v1/host/query -H 'Content-Type: application/json' -d '{"query":"{ configuredHosts { hardware { sn } } }"}' 2>/dev/null | jq | egrep "name|sn" | awk -F\" '/sn/{print $4}' | sort -u | while read sn; do time curl -X POST --unix-socket /var/lib/vxrail/nginx/socket/nginx.sock -H 'Content-Type: application/json' -d '{"variables":"{\"sn\":\"'${sn}'\"}","query":"query ($sn:[String]){ configuredHosts(sn:$sn) { moid name type summary{ hardware{ cpuNum}} config{ hostUUID isPrimary network{ vnic{ device ipv4 allIpv6s nonLinkLocalIpv6} idrac{ ipAddress ipAddressSource netmask gateway ipAddressV6 gatewayV6 prefixLen ipv6AutoConfig vlan{ enabled id priority}}} localSlotClaims{ slot bay type usage diskgroupId} diskgroup{ slotNum current{ type}} installedComponent{ displayName version model description installedTime}} runtime{ connectionState overallStatus powerState inMaintenanceMode} hardware{ sn psnt slot manufacturer name systemStatusLed tpm model firmware{ id model} firmwareRevisions{ idsdmFwRevision biosFwRevision bmcFwRevision diskCtrlFwRevision bossFwRevision cpldFwRevision expanderBackplane nonExpanderBackplane dcpmFwRevision percDiskCtrlFwRevision} baseline{ sn slot chassisId isMissing} chassis{ name model psnt partNumber serviceTag psus{ sn name slot manufacturer partNumber firmwareVersion baseline{ sn slot isMissing}}} disks{ sn guid capacity slot firmwareVersion diskType diskState manufacturer protocol maxCapableSpeed model ledStatus writeEndurance bay enclosure remainingWriteEnduranceRate encryptionAbility encryptionStatus baseline{ sn slot bay isMissing}} bootDevices{ sn firmwareVersion sataType powerOnHours powerCycleCount avrEraseCount maxEraseCount capacity deviceModel slot health bootDeviceType status blockSizeBytes partNumber manufacturer controllerFirmware controllerModel controllerStatus raidStatus} nics{ mac linkSpeed firmwareFamilyVersion linkStatus fqdd specificNicType wwnn wwpn drivers{ driverName driverVersion}} position{ rackName rackSlot} storageInstance{ securityStatus encryptionMode}}} }","operationName":null}' http://0/rest/vxm/internal/do/v1/host/query; echo ""; echo ""; echo "Done checking $sn"; done
  3. 结果显示一个或多个节点在 10 分钟内未返回数据:
    有问题的节点返回错误“504 网关超时”,而工作节点返回正确的物理视图数据。
    显示一个或多个节点在 10 分钟内未返回数据的结果

  4. 基于上述标识的节点SN,登录节点并运行 esxcli 命令,它卡住了,但 localcli 工作:
    运行 esxcli 命令卡住,但 localcli 工作正常

 

Cause

节点存在 NTP 同步问题,导致 esxcli 命令卡住,无响应。

VxRail 物理视图调用 esxcli 节点上用于检索固件信息的命令,当节点卡在运行时,它无法获取信息 esxcli 相同名称。

 

Resolution

解决方案是识别所有节点上的 NTP 同步问题并修复它。请查看以下步骤以验证节点上的 NTP 状态:

  1. 检查 /var/log/vobd.log是否有任何指示系统时钟不再与上游时间服务器同步的错误消息。

  2. 如果是这样,请检查 NTP 服务器状态。

    #ntpq -p

    ntpq -p 命令示例

    如果“reach”值不是 377,则主机缺少 NTP 事务,此 ESXi 上的时间可能不正确,需要注意。

    #ntpq -c

    ntpq -c 命令示例

    如果“层”值超出 2 到 6 的范围,则此 ESXi 上的时间同步可能会遇到额外的延迟,从而可能导致计时不准确。

  3. 使用 API 检查 VxRail Manager 数据库中的 NTP 服务器地址是否正确。

    curl -k --user "[vCenter account]:[vCenter password]" --request GET "https://localhost/rest/vxm/v1/system/ntp"
  4. 如果没有,请按照现有的“操作方法”过程(VxRail 过程→其他→“操作方法”过程→更改 VxRail IP 地址→重新指向新的 NTP 服务器 IP 地址)更新 NTP 服务器 IP 地址。

  5. 如果需要,请使用另一个正常工作的 NTP 服务器,重新启动节点 vpxa 和主机服务,请注意,重新启动节点 vpxa 和主机服务将暂时解决问题,但如果 NTP 无法再次同步,问题将再次出现。

    /etc/init.d/vpxa restart
    /etc/init.d/hostd restart

 

Affected Products

VxRail
Article Properties
Article Number: 000224344
Article Type: Solution
Last Modified: 24 May 2024
Version:  2
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.