Dell Technologies VxRail: High CPU contention on NSX edge node.

Summary: Dell Technologies VxRail: High CPU contention on NSX edge node. Need to figure out what is causing the high cpu usage on nsx edge node.

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Check out other resources

Symptoms

There is high CPU contention on the ESXi node, specifically with the NSX edge node.
If you boot this Edge node, and it is using Equal cost multipath (ECMP), the CPU high contention is found on the next edge node along with high network traffic. The original is back to normal again.
From the edge node itself, there is normal load, and no specific network capture is found to be dropped.

Cause

This is caused by high cpu usage and also high network traffic through some edge vnic.

CPU usage comparison:

Bad edge

 xxx    27  454.64  471.21   43.19 2307.95    7.32   13.72  334.52    2.67    0.00    0.00    0.00

Good edge

 xxx    27  240.09  225.96   20.80 2507.98    6.72    8.39  443.93    1.71    0.00    0.00    0.00

CPU run% comparison:

Bad edge

  ID      GID NAME             NWLD   %USED    %RUN    %SYS   %WAIT %VMWAIT    %RDY   %IDLE  %OVRLP   %CSTP  %MLMTD  %SWPWT
16580792 16580792 xxx            27  454.64  471.21   43.19 2307.95    7.32   13.72  334.52    2.67    0.00    0.00    0.00

Good edge

  ID      GID NAME             NWLD   %USED    %RUN    %SYS   %WAIT %VMWAIT    %RDY   %IDLE  %OVRLP   %CSTP  %MLMTD  %SWPWT
10908367 10908367 xxx            27  240.09  225.96   20.80 2507.98    6.72    8.39  443.93    1.71    0.00    0.00    0.00

Network port comparison for RX and TX:

PORT-ID USED-BY                         TEAM-PNIC DNAME              PKTTX/s  MbTX/s   PSZTX    PKTRX/s  MbRX/s   PSZRX %DRPTX %DRPRX
50331714 2666974:xxx.eth2               vmnic2 DvsPortset-1        519615.172729.88  688.00  128623.96  694.32  707.00   0.00   0.00
50331715 2666974:xxx.eth1               vmnic3 DvsPortset-1        76622.01  523.06  894.00  230747.221126.70  640.00   0.00   0.00
50331716 2666974:xxx.eth0               vmnic6 DvsPortset-1        51422.12  168.87  430.00  312557.221691.50  709.00   0.00   0.00

PORT-ID USED-BY                         TEAM-PNIC DNAME              PKTTX/s  MbTX/s   PSZTX    PKTRX/s  MbRX/s   PSZRX %DRPTX %DRPRX
50331744 1752165:xxx.eth2               vmnic3 DvsPortset-1        42856.22  238.49  729.00   50329.21  262.45  683.00   0.00   0.00
50331745 1752165:xxx.eth1               vmnic7 DvsPortset-1        22069.93   91.24  541.00   20044.33   96.35  630.00   0.00   0.00
50331746 1752165:xxx.eth0               vmnic2 DvsPortset-1        27771.00  169.72  801.00   23548.13  144.95  806.00   0.00   0.00

Package per second comparison:

Bad edge

    "rxqueue": { "count": 4, "details": [
    {"intridx": 0, "pps": 30175, "mbps": 203.9, "errs": 0.0},
    {"intridx": 1, "pps": 17175, "mbps": 61.1, "errs": 0.0},
    {"intridx": 2, "pps": 15626, "mbps": 51.4, "errs": 0.0},
    {"intridx": 3, "pps": 14596, "mbps": 57.4, "errs": 0.0} ]},
  "txqueue": { "count": 4, "details": [
    {"intridx": 0, "pps": 121634, "mbps": 828.2, "errs": 0.0},
    {"intridx": 1, "pps": 105483, "mbps": 708.5, "errs": 0.0},
    {"intridx": 2, "pps": 137687, "mbps": 1087.9, "errs": 0.0},
    {"intridx": 3, "pps": 116488, "mbps": 831.6, "errs": 0.0} ]},

Good edge

    "rxqueue": { "count": 4, "details": [
    {"intridx": 0, "pps": 22388, "mbps": 115.1, "errs": 0.0},
    {"intridx": 1, "pps": 54248, "mbps": 497.1, "errs": 0.0},
    {"intridx": 2, "pps": 67004, "mbps": 650.2, "errs": 0.0},
    {"intridx": 3, "pps": 22688, "mbps": 118.8, "errs": 0.0} ]},
  "txqueue": { "count": 4, "details": [
    {"intridx": 0, "pps": 21222, "mbps": 125.0, "errs": 0.0},
    {"intridx": 1, "pps": 46125, "mbps": 384.3, "errs": 0.0},
    {"intridx": 2, "pps": 22771, "mbps": 131.7, "errs": 0.0},
    {"intridx": 3, "pps": 29040, "mbps": 162.0, "errs": 0.0} ]},

There is high network traffic against a specific vnic on the edge node. A specific application running causing high traffic is captured on the edge vm which acts as gateway.
Below is the final wireshark information.

Resolution

To resolve this issue:

If a specific application is caught generating high network traffic on a specific port, contact the application team.
Review the design of the network components to avoid generating large amounts of traffic on specific nodes.

Use the following troubleshooting workflow to find the cause of your issue.

1. Enable the edge node engineering mode to capture the system load and run top with root mode.

/home/secureall/secureall/sem/WEB-INF/classes/GetSpockEdgePassword.sh edge-xx (edge-xx could be found on nsx manager GUI)
logon console of edge node with admin->enable>debug engineeringmode enable->st en->

2. Capture the esxtop information about the ESXi node. It is best to compare the result on the ESXi node which is running the normal edge node and the ESXi node which is running the problematic edge node.

A. 'esxtop' - run on migrated ESXi host.
B. 'esxtop' following with 'n' - run on migrated ESXi host.
C. 'esxtop' per CPU core data using the current GID of the problematic VM. Get the GID value, press 'E' and input the GID number.
D. Review all the data regarding this specific edge vm.

3. Run Net stats for statistical information. Check the stats of Packet per Second statistics on the output and compare it to the ESXi node which is running normal edge node.

'net-stats -A -t WwQqihVvh -i 5 -n 2' - run on the migrated ESXi host and got following high figure

  "txqueue": { "count": 4, "details": [
    {"intridx": 0, "pps": 121634, "mbps": 828.2, "errs": 0.0},
    {"intridx": 1, "pps": 105483, "mbps": 708.5, "errs": 0.0},
    {"intridx": 2, "pps": 137687, "mbps": 1087.9, "errs": 0.0},
    {"intridx": 3, "pps": 116488, "mbps": 831.6, "errs": 0.0} ]},

4. Use Wireshark network software to determine what application was generating the most traffic.

A. On the ESXi host shell, get the switchport details of the ESG VM using "net-stats -l " command. Note the switchport of the vnic of the concerned edge vm. This allows you to know what type of traffic is flowing through this vnic.

B. Perform the packet capture for all related switchports one by one for one minute and save it in a .pcap file. Change the <values> as per your setup.

pktcap-uw --switchport <switchport-id> --capture VnicTx,VnicRx -o /vmfs/volumes/<Datastore-name>/<switchport-id>.pcap

5. Put all the collect .pcap package information into wireshark to generate the overall report in chronological order. Work out the port where most of the traffic was coming from by judging its source and target IP address.

6. Some load traffic is present under the ECMP environment. It is pinned to an edge node using ECMP hashing. It is moved to another ESG in the event of a reload/redeploy of the ESG. After that, the ESG on which this traffic gets moved to, starts reporting high CPU usage.
By default, traffic is distributed among all ECMP pairs based on its internal hashing algorithm which uses two tuples (srcIP+dstIP). This is so all port TCP/1556 traffic is not pinned to one specific edge.
In our instance, a heavy-hitter traffic of backups between a src and dst IPs are pinned to this edge causing the ESXi to provide more CPU cycles to this ESG VM for that traffic. That is why we are seeing high CPU utilization from the ESXi/vCenter level, but inside the guest operating system of the ESG, CPU utilization is normal. So overall this is also the expected behavior.

Affected Products

VxRail Appliance Family, VxRail Appliance Series

Article Number: 000202066

Article Type: Solution

Last Modified: 16 May 2023

Version: 3

Check if your device is covered by Support Services.

Dell Technologies VxRail: High CPU contention on NSX edge node.

Summary: Dell Technologies VxRail: High CPU contention on NSX edge node. Need to figure out what is causing the high cpu usage on nsx edge node.

Symptoms

Cause

Resolution

Affected Products

Article Properties

Find answers to your questions from other Dell users

Support Services

Article Properties

Find answers to your questions from other Dell users

Support Services

Dell Technologies VxRail: High CPU contention on NSX edge node.

Summary: Dell Technologies VxRail: High CPU contention on NSX edge node. Need to figure out what is causing the high cpu usage on nsx edge node.

Detailed Article

Symptoms

Cause

Resolution

Affected Products

Symptoms

Cause

Resolution

Affected Products

Article Properties

Find answers to your questions from other Dell users

Support Services

Article Properties

Find answers to your questions from other Dell users

Support Services