VNX: kernel.cpu.utilization.cpuutil is > 90 for 15 minutes

Oversigt: kernel.cpu.utilization.cpuutil is > 90 for 15 minutes.

Denne artikel gælder for Denne artikel gælder ikke for Denne artikel er ikke knyttet til et bestemt produkt. Det er ikke alle produktversioner, der er identificeret i denne artikel.

Symptomer

Degraded CIFS, NFS performance, and excessively high CPU utilization on the date movers for prolonged periods of time

The /nas/log/sys_log files are flooded with the following alerts:

Oct 20 10:19:21 2016:CS_PLATFORM:PERFSTATS:NOTICE:3:::::nas_alerterd: Clearing event for policy: default:server_2:kernel.cpu.utilization.cpuutil is > 90 for 15 minutes due to value = 88
Oct 20 10:20:26 2016:CS_PLATFORM:PERFSTATS:NOTICE:3:::::nas_alerterd: Clearing event for policy: default:server_3:kernel.cpu.utilization.cpuutil is > 90 for 15 minutes due to value = 89
Oct 20 10:35:21 2016:CS_PLATFORM:PERFSTATS:WARNING:2:::::nas_alerterd: Raising event for policy: default:server_2:kernel.cpu.utilization.cpuutil is > 90 for 15 minutes. The last sample value was 94
Oct 20 10:48:26 2016:CS_PLATFORM:PERFSTATS:WARNING:2:::::nas_alerterd: Raising event for policy: default:server_3:kernel.cpu.utilization.cpuutil is > 90 for 15 minutes. The last sample value was 95
Oct 20 11:01:26 2016:CS_PLATFORM:PERFSTATS:NOTICE:3:::::nas_alerterd: Clearing event for policy: default:server_3:kernel.cpu.utilization.cpuutil is > 90 for 15 minutes due to value = 87
Oct 20 11:17:21 2016:CS_PLATFORM:PERFSTATS:NOTICE:3:::::nas_alerterd: Clearing event for policy: default:server_2:kernel.cpu.utilization.cpuutil is > 90 for 15 minutes due to value = 89
Oct 20 11:21:26 2016:CS_PLATFORM:PERFSTATS:WARNING:2:::::nas_alerterd: Raising event for policy: default:server_3:kernel.cpu.utilization.cpuutil is > 90 for 15 minutes. The last sample value was 96
[nasadmin@storage ~]$ server_sysstat server_2
server_2 :
threads runnable = 216
threads blocked = 6859
threads I/J/Z = 1
memory free(kB) = 6925375
cpu idle_% = 2 < ------- 98% utilized

[nasadmin@storage ~]$ server_sysstat server_3
server_3 :
threads runnable = 61
threads blocked = 6940
threads I/J/Z = 1
memory free(kB) = 6683987
cpu idle_% = 1 < ------- 99% utilized

 

Årsag

An overall system configuration and capacity analysis were completed to determine what types of demands (number of replication, deduplication, checkpoint schedules and so on) are being placed on each data mover. It was determined that a VMware application called Mirage was being used for image management. This software was backing up thousands of work stations in the customers environment to the CIFS, NFS shares by creating multiple small CVD files.

In this example, 200 VMware Mirage sessions were configured taking hourly snapshots of 4,000 machines. One machine requires one CVD file which in turn needs 1.5 IOPS for the snapshot to complete. When the backups were running, performance was slow and huge latency in the region of 626.78 KB per ms was experienced on the Mirage side.

A data mover profile of server_2 was captured when CPU utilization was excessively high and the VMware Mirage application was running. The profile was configured to run for 60 seconds:
Example:

[nasadmin@storage ~]$ /nas/tools/profile_slot -slot 2 -method function -seconds 60 -output /root_vdm_3/FS_Backup_01/profile_slot2.out
Starting profile on slot 2 with the following params...
Slot = 2
Method = function
Seconds = 60
Ignorebounds = no
Frequency = 256
Outfile = /root_vdm_3/FS_Backup_01/profile_slot2.out
Profile started. Waiting for 60 seconds...
Profile stopped.
Profile output has been written to /root_vdm_3/FS_Backup_01/profile_slot2.out on server in slot 2.

An analysis of the data mover profile was completed, and it was found that the primary bottleneck consuming the majority of CPU was an SMB security encryption process called "AES_encrypt." This SMB process is used to provide end-to-end encryption of SMB data and protects data from eavesdropping occurrences on untrusted networks.

[nasadmin@storage ~]$ more profile_slot2.out | grep -i aes
0.0% (          1 ) EVP_aes_128_cbc
0.0% (          2 ) aes_init_key
1.0% (        631 ) aes_cbc_cipher
0.0% (         16 ) private_AES_set_encrypt_key
44.8% (      27454 ) AES_encrypt < ---------------------
0.9% (        581 ) AES_cbc_encrypt
0.0% (          1 ) EVP_aes_128_cbc
0.8% (        123 ) aes_cbc_cipher
0.0% (          1 ) private_AES_set_encrypt_key
37.0% (       5676 ) AES_encrypt < ---------------------
0.8% (        128 ) AES_cbc_encrypt
0.0% (          1 ) aes_init_key
0.9% (        140 ) aes_cbc_cipher
0.0% (          3 ) private_AES_set_encrypt_key
47.1% (       7219 ) AES_encrypt < ---------------------
0.9% (        146 ) AES_cbc_encrypt
0.0% (          1 ) aes_init_key
1.3% (        204 ) aes_cbc_cipher
0.0% (          7 ) private_AES_set_encrypt_key
48.2% (       7388 ) AES_encrypt < ---------------------
0.9% (        151 ) AES_cbc_encrypt
1.0% (        164 ) aes_cbc_cipher
0.0% (          5 ) private_AES_set_encrypt_key
46.8% (       7171 ) AES_encrypt < ---------------------
1.0% (        156 ) AES_cbc_encrypt

 

Løsning

Two options are available:

Option 1:
Make no changes, allow the max protocol for SMB communication to remain at SMB3, endure the high CPU and poor performance.

Option 2:
Implement a workaround to drop the max protocol for SMB communication from SMB3 to SMB2. The main difference between SMB3 and SMB2 is "AES_Encrypt." By dropping the max protocol to SMB2, this drops the encryption process, the CPU utilization drops, and performance should improve.

To enable the SMB2 protocol on the data mover:

  1. Log in to the Primary Control Station as 'root' user in putty/SSH

  2. Request permission from the customer to temporarily stop the CIFS service on the data mover. This causes a small outage to CIFS access while the CIFS service is stopped so it must be scheduled accordingly with the customer.

    server_setup server_x -P cifs -o stop
  3. Change the max protocol for SMB communication from SMB3 to SMB2:

    server_cifs server_x -add security=NT,dialect=SMB2
  4. Restart the CIFS service:

    server_setup server_x -P cifs -o start
  5. Ensure that the CIFS service has restarted successfully and the max protocol is set to SMB2:

    server_cifs server_x

    Example:

    [root@Bstorage]# server_cifs server_2
    server_2 :
    384 Cifs threads started
    Security mode = NT
    Max protocol = SMB2.1 < -----

 

Berørte produkter

VNX2 Series

Produkter

VNX2 Series
Artikelegenskaber
Artikelnummer: 000056854
Artikeltype: Solution
Senest ændret: 20 okt. 2025
Version:  5
Find svar på dine spørgsmål fra andre Dell-brugere
Supportservices
Kontrollér, om din enhed er dækket af supportservices.