Dell Technologies VxRail: High CPU contention on NSX edge node.
Summary: Dell Technologies VxRail: High CPU contention on NSX edge node. Need to figure out what is causing the high cpu usage on nsx edge node.
This article applies to
This article does not apply to
This article is not tied to any specific product.
Not all product versions are identified in this article.
Symptoms
There is high CPU contention on the ESXi node, specifically with the NSX edge node.
If you boot this Edge node, and it is using Equal cost multipath (ECMP), the CPU high contention is found on the next edge node along with high network traffic. The original is back to normal again.
From the edge node itself, there is normal load, and no specific network capture is found to be dropped.
If you boot this Edge node, and it is using Equal cost multipath (ECMP), the CPU high contention is found on the next edge node along with high network traffic. The original is back to normal again.
From the edge node itself, there is normal load, and no specific network capture is found to be dropped.
Cause
This is caused by high cpu usage and also high network traffic through some edge vnic.
CPU usage comparison:
Bad edge
CPU run% comparison:
Bad edge
Good edge
Network port comparison for RX and TX:
Package per second comparison:
Bad edge
Good edge
There is high network traffic against a specific vnic on the edge node. A specific application running causing high traffic is captured on the edge vm which acts as gateway.
Below is the final wireshark information.
CPU usage comparison:
Bad edge
xxx 27 454.64 471.21 43.19 2307.95 7.32 13.72 334.52 2.67 0.00 0.00 0.00 Good edge
xxx 27 240.09 225.96 20.80 2507.98 6.72 8.39 443.93 1.71 0.00 0.00 0.00
CPU run% comparison:
Bad edge
ID GID NAME NWLD %USED %RUN %SYS %WAIT %VMWAIT %RDY %IDLE %OVRLP %CSTP %MLMTD %SWPWT
16580792 16580792 xxx 27 454.64 471.21 43.19 2307.95 7.32 13.72 334.52 2.67 0.00 0.00 0.00
Good edge
ID GID NAME NWLD %USED %RUN %SYS %WAIT %VMWAIT %RDY %IDLE %OVRLP %CSTP %MLMTD %SWPWT
10908367 10908367 xxx 27 240.09 225.96 20.80 2507.98 6.72 8.39 443.93 1.71 0.00 0.00 0.00
Network port comparison for RX and TX:
PORT-ID USED-BY TEAM-PNIC DNAME PKTTX/s MbTX/s PSZTX PKTRX/s MbRX/s PSZRX %DRPTX %DRPRX 50331714 2666974:xxx.eth2 vmnic2 DvsPortset-1 519615.172729.88 688.00 128623.96 694.32 707.00 0.00 0.00 50331715 2666974:xxx.eth1 vmnic3 DvsPortset-1 76622.01 523.06 894.00 230747.221126.70 640.00 0.00 0.00 50331716 2666974:xxx.eth0 vmnic6 DvsPortset-1 51422.12 168.87 430.00 312557.221691.50 709.00 0.00 0.00
PORT-ID USED-BY TEAM-PNIC DNAME PKTTX/s MbTX/s PSZTX PKTRX/s MbRX/s PSZRX %DRPTX %DRPRX 50331744 1752165:xxx.eth2 vmnic3 DvsPortset-1 42856.22 238.49 729.00 50329.21 262.45 683.00 0.00 0.00 50331745 1752165:xxx.eth1 vmnic7 DvsPortset-1 22069.93 91.24 541.00 20044.33 96.35 630.00 0.00 0.00 50331746 1752165:xxx.eth0 vmnic2 DvsPortset-1 27771.00 169.72 801.00 23548.13 144.95 806.00 0.00 0.00
Package per second comparison:
Bad edge
"rxqueue": { "count": 4, "details": [
{"intridx": 0, "pps": 30175, "mbps": 203.9, "errs": 0.0},
{"intridx": 1, "pps": 17175, "mbps": 61.1, "errs": 0.0},
{"intridx": 2, "pps": 15626, "mbps": 51.4, "errs": 0.0},
{"intridx": 3, "pps": 14596, "mbps": 57.4, "errs": 0.0} ]},
"txqueue": { "count": 4, "details": [
{"intridx": 0, "pps": 121634, "mbps": 828.2, "errs": 0.0},
{"intridx": 1, "pps": 105483, "mbps": 708.5, "errs": 0.0},
{"intridx": 2, "pps": 137687, "mbps": 1087.9, "errs": 0.0},
{"intridx": 3, "pps": 116488, "mbps": 831.6, "errs": 0.0} ]},
Good edge
"rxqueue": { "count": 4, "details": [
{"intridx": 0, "pps": 22388, "mbps": 115.1, "errs": 0.0},
{"intridx": 1, "pps": 54248, "mbps": 497.1, "errs": 0.0},
{"intridx": 2, "pps": 67004, "mbps": 650.2, "errs": 0.0},
{"intridx": 3, "pps": 22688, "mbps": 118.8, "errs": 0.0} ]},
"txqueue": { "count": 4, "details": [
{"intridx": 0, "pps": 21222, "mbps": 125.0, "errs": 0.0},
{"intridx": 1, "pps": 46125, "mbps": 384.3, "errs": 0.0},
{"intridx": 2, "pps": 22771, "mbps": 131.7, "errs": 0.0},
{"intridx": 3, "pps": 29040, "mbps": 162.0, "errs": 0.0} ]},
There is high network traffic against a specific vnic on the edge node. A specific application running causing high traffic is captured on the edge vm which acts as gateway.
Below is the final wireshark information.
Resolution
To resolve this issue:
Use the following troubleshooting workflow to find the cause of your issue.
1. Enable the edge node engineering mode to capture the system load and run top with root mode.
2. Capture the esxtop information about the ESXi node. It is best to compare the result on the ESXi node which is running the normal edge node and the ESXi node which is running the problematic edge node.
3. Run Net stats for statistical information. Check the stats of Packet per Second statistics on the output and compare it to the ESXi node which is running normal edge node.
4. Use Wireshark network software to determine what application was generating the most traffic.
5. Put all the collect .pcap package information into wireshark to generate the overall report in chronological order. Work out the port where most of the traffic was coming from by judging its source and target IP address.
6. Some load traffic is present under the ECMP environment. It is pinned to an edge node using ECMP hashing. It is moved to another ESG in the event of a reload/redeploy of the ESG. After that, the ESG on which this traffic gets moved to, starts reporting high CPU usage.
By default, traffic is distributed among all ECMP pairs based on its internal hashing algorithm which uses two tuples (srcIP+dstIP). This is so all port TCP/1556 traffic is not pinned to one specific edge.
In our instance, a heavy-hitter traffic of backups between a src and dst IPs are pinned to this edge causing the ESXi to provide more CPU cycles to this ESG VM for that traffic. That is why we are seeing high CPU utilization from the ESXi/vCenter level, but inside the guest operating system of the ESG, CPU utilization is normal. So overall this is also the expected behavior.
- If a specific application is caught generating high network traffic on a specific port, contact the application team.
- Review the design of the network components to avoid generating large amounts of traffic on specific nodes.
Use the following troubleshooting workflow to find the cause of your issue.
1. Enable the edge node engineering mode to capture the system load and run top with root mode.
/home/secureall/secureall/sem/WEB-INF/classes/GetSpockEdgePassword.sh edge-xx (edge-xx could be found on nsx manager GUI) logon console of edge node with admin->enable>debug engineeringmode enable->st en->
2. Capture the esxtop information about the ESXi node. It is best to compare the result on the ESXi node which is running the normal edge node and the ESXi node which is running the problematic edge node.
A. 'esxtop' - run on migrated ESXi host.
B. 'esxtop' following with 'n' - run on migrated ESXi host.
C. 'esxtop' per CPU core data using the current GID of the problematic VM. Get the GID value, press 'E' and input the GID number.
D. Review all the data regarding this specific edge vm.
B. 'esxtop' following with 'n' - run on migrated ESXi host.
C. 'esxtop' per CPU core data using the current GID of the problematic VM. Get the GID value, press 'E' and input the GID number.
D. Review all the data regarding this specific edge vm.
3. Run Net stats for statistical information. Check the stats of Packet per Second statistics on the output and compare it to the ESXi node which is running normal edge node.
'net-stats -A -t WwQqihVvh -i 5 -n 2' - run on the migrated ESXi host and got following high figure
"txqueue": { "count": 4, "details": [
{"intridx": 0, "pps": 121634, "mbps": 828.2, "errs": 0.0},
{"intridx": 1, "pps": 105483, "mbps": 708.5, "errs": 0.0},
{"intridx": 2, "pps": 137687, "mbps": 1087.9, "errs": 0.0},
{"intridx": 3, "pps": 116488, "mbps": 831.6, "errs": 0.0} ]},
4. Use Wireshark network software to determine what application was generating the most traffic.
A. On the ESXi host shell, get the switchport details of the ESG VM using "net-stats -l " command. Note the switchport of the vnic of the concerned edge vm. This allows you to know what type of traffic is flowing through this vnic.
B. Perform the packet capture for all related switchports one by one for one minute and save it in a .pcap file. Change the <values> as per your setup.
pktcap-uw --switchport <switchport-id> --capture VnicTx,VnicRx -o /vmfs/volumes/<Datastore-name>/<switchport-id>.pcap
5. Put all the collect .pcap package information into wireshark to generate the overall report in chronological order. Work out the port where most of the traffic was coming from by judging its source and target IP address.
6. Some load traffic is present under the ECMP environment. It is pinned to an edge node using ECMP hashing. It is moved to another ESG in the event of a reload/redeploy of the ESG. After that, the ESG on which this traffic gets moved to, starts reporting high CPU usage.
By default, traffic is distributed among all ECMP pairs based on its internal hashing algorithm which uses two tuples (srcIP+dstIP). This is so all port TCP/1556 traffic is not pinned to one specific edge.
In our instance, a heavy-hitter traffic of backups between a src and dst IPs are pinned to this edge causing the ESXi to provide more CPU cycles to this ESG VM for that traffic. That is why we are seeing high CPU utilization from the ESXi/vCenter level, but inside the guest operating system of the ESG, CPU utilization is normal. So overall this is also the expected behavior.
Affected Products
VxRail Appliance Family, VxRail Appliance SeriesArticle Properties
Article Number: 000202066
Article Type: Solution
Last Modified: 16 May 2023
Version: 3
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.