VNX: Clients Disconnected from CIFS server during internal checkpoint refresh

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Symptoms




Large Directories
Nasdirtool confirms impacted production file systems contains multiple directories with over 500,00 files in a single directory

From nasdirtool output:
.....
/root_vdm_5/Applications/Appstorage/Images,95616,1458761 <=== 95MB in size and 1.4 million files
/root_vdm_6/Production/SubDirectory2/REP,150731,2104554 <=== 150MB in size and 2.1 million files

 

Some CIFS Clients are disconnected from the VNX CIFS Server during the update of the internal checkpoints used for replication on the source side array.

Other CIFS Clients and NFS Clients on other shares are operating normally.

High CPU utilization on the data mover can be seen frequently, depending on how large the directories contents are the data mover CPU utilization may reach 100%. 


[nasadmin@VNX-CS0 tmp]$ server_stats server_2 -i 60
server_2   CPU    Network     Network       dVol        dVol
Timestamp  Util      In         Out         Read       Write
            %      KiB/s       KiB/s       KiB/s       KiB/s
10:41:25     99       16123       62578       61912       28048
10:42:25     98        4242       63170       62433        9793
10:43:25     99        2935       46987       48618        8918
10:44:25     99        7499       45901       46373       13019
10:45:25     99        4564       47836       48018        9625
10:46:25     98        3973       52316       52167        9035
10:47:25     98        9777       60167       55127       16238
10:48:25     97       18513       76583       70269       26258
10:49:25     98       11885       43789       43595       17238
10:50:25     99       17868       55491       52966       21029
10:51:25     99        8171       43491       43013       11961
10:52:25     99        8835       50947       50328       13369


A network capture taken during the incident showed TCP communications from client to server were working ok but the CIFS Server did not respond to the specific client experiencing the issue at the SMB Protocol Level resulting in a client timeout.

Cause

The source side File system in use for replication contains directories that exceed 500,000 files in a single directory. As documented in the EMC VNX OE for File Release notes, exceeding 500,000 files in a single directory will result in performance issues.

From the data mover log the following events are logged during the issue:

2016-08-12 12:58:40: SMB: 6:[VDM2] Quota:getFsAndLock for Thread 1SMB415 aborted (client WINCLIENT01 disconnected)
2016-08-12 12:58:49: SMB: 6:[VDM2] Quota:getFsAndLock for Thread 1SMB034 aborted (client WINCLIENT02 disconnected) 

2016-08-12 13:09:29: SMB: 6:[VDM2] Quota:getFsAndLock for Thread 1SMB356 aborted (client WINCLIENT03 disconnected)
2016-08-12 13:09:29: SMB: 6:[VDM2] Quota:getFsAndLock for Thread 1SMB358 aborted (client WINCLIENT04 disconnected) 



The Data mover log shows that the issue corresponds to an internal replication checkpoint refresh

Example of normal quick FS pause for checkpoint refresh on this source side array
2016-08-19 12:33:39: 26042826752: SVFS: 6: pause() requested on fsid:1103
2016-08-19 12:33:39: 26042826752: SVFS: 6: pause done on fsid:1103
   
In this case some operation is delaying the pause
2016-08-19 12:42:36: 26042826752: SVFS: 6: pause() requested on fsid:1103
...
2016-08-19 12:45:17: 26041909248: SMB: 6:[VDM2] Quota:getFsAndLock for Thread 1SMB396 aborted (client WINCLIENT01 disconnected)
2016-08-19 12:45:26: 26041909248: SMB: 6:[VDM2] Quota:getFsAndLock for Thread 1SMB478 aborted (client WINCLIENT02 disconnected)
...
2016-08-19 13:00:47: 26041909248: SMB: 6:[VDM2] Quota:getFsAndLock for Thread 1SMB298 aborted (client WINCLIENT03 disconnected)
2016-08-19 13:00:52: 26042826752: SVFS: 6: pause done on fsid:1103

The Source side Internal Checkpoint refresh Pause above above shows non-normal behavour. A force panic was done to confirm what was causing the pause to take so much time and the analysis of the panic dump file confirmed the file system contains directories with millions of files in a single directory.


Resolution

A new subdirectory structure should be put in place on the production file system. The files in the problematic directories must be distributed the across the new directories so as not to exceed 500,00 files in a single directory. The original problematic directories should then be deleted by the VNX Administrator.

Additional Information

 
 
EMC VNX Operating Environment for File Version 7.1.79.8 Release Notes
 
Guideline/Specification Maximum tested value comment
Number of files per directory 500,000 Exceeding this number will cause performance problems.

Affected Products

VNX1 Series

Products

VNX1 Series, VNX2 Series
Article Properties
Article Number: 000052074
Article Type: Solution
Last Modified: 06 Nov 2025
Version:  3
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.