Data Domain : Unexpected DDFS (Data Domain File System) restart occurred with either alert EVT-FILESYS-00008 / FILESYS-00008 or EVT-FILESYS-00010 / FILESYS-00010 or EVT-FILESYS-00011 / FILESYS-00011

Summary: This knowledge base article explains what happens when an unexpected filesystem restart occurs, the potential alerts encountered and what information to capture for triage purposes.

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Symptoms

The DDFS process is the main process responsible for the operation of the DDOS (Data Domain Operating System) de-duplication filesystem.

If this process encounters a problem, an alert is created which will be one of the following :
  • EVT-FILESYS-00008 / FILESYS-00008
  • EVT-FILESYS-00010 / FILESYS-00010
  • EVT-FILESYS-00011 / FILESYS-00011
The above alerts indicate the problem encountered was unexpected and further information is required to ascertain the cause.

The alert will be sent via the configured alerting mechanism on the Data Domain system, for example email, SNMP, connectemc, etc.. The alert will also appear in the 'alerts show history' output.

Cause

The DDFS process running in DDOS is a complex piece of software and as any software it may have defects which causes it to unexpectedly fail. Also, hardware related issues may induce conditions in the DDFS process which can't be safely handled. 


Be it software or hardware caused, DDFS may end up restarting through a few different means :
  • A direct PANIC (process tries to run a piece of code which results in handled or unhandled error, such as for example an explicit code bug, or an unexpected condition met
  • An internal timeout is encountered. DDFS has an internal heartbeat monitor thread (called hmon) which monitors the health of the various subsystems within the DDFS process. If hmon ascertains that either a subsystem has hung or has been waiting too long, it terminates the DDFS process to try recovering from a possible deadlock (a situation by which two work items depend on each other and will never complete)
  • An external timeout is encountered. A process called ddr_stated is responsible for externally monitoring the DDFS process by a heartbeat mechanism. If DDFS does not send a heartbeat to ddr_stated within a certain duration, ddr_stated assumes DDFS has hung and terminates the DDFS process.
  • The process requests more memory than it is allowed (although DDFS is allowed to grab 90% or more of the installed RAM in a system)
  • An internal sanity check failed

When any of these are conditions are encountered, the filesystem attempts to automatically restart to resume normal operation. 

During the DDFS restart, any operations that were ongoing, such as restores/backups, i.e. reads/writes, will be interrupted and need to be restarted. Most backup applications can recognize that the reads/writes were interrupted and restart these operations automatically.

When an unexpected DDFS restart occurs, the following things happen:
  • The process is halted.
  • The memory footprint that the process was using is written to a 'core file' which will be written to a core dump device, which is a special area on one of the head unit disks. A core file contains the necessary information to debug why the unexpected restart occurred.
  • Once the above step completes, the DDFS process can restart.
  • In parallel, i.e. once DDFS is restarting, the core file needs to be extracted from the core dump device to a DDOS filesystem so that it can be accessed. The process that accomplishes this task is called 'savecore'.
  • Savecore creates an initial temporary directory in /ddvar/core. The directory name will be called 'app-<date and time the core file occurred>'.
  • As DDFS uses the majority of the memory on the system, the memory footprint for DDFS can be large. To minimize the amount of data written to the core file, savecore reads from the core dump device, passes this information through gzip, to ensure that the core file is as small as possible, and starts writing to a file called 'core-incomplete.gz'.
  • Once this process completes, the temporary directory will be removed, the core file placed in /ddvar/core and renamed. The naming convention for a core file is as follows:
    • The process name.
    • The process ID.
    • The string "core".
    • The date/time the core was generated in an UNIX epoch format.
    • So for example a core file for DDFS could be called 'ddfs.core.14226.1469256407.gz'.
Due to the memory footprint being large, creating a core file is not immediate and can take a number of minutes to fully complete, getting the compressed version of the core file finalized can take hours in lower end DDs or those with huge amounts of memory.

Resolution

As mentioned above, the creation of the core file is not immediate, the /ddvar/core directory can be checked periodically via an NFS or CIFS share to ascertain when the core file creation has completed. 

Once the core file creation has been completed, two items of information are required in order to triage what caused the unexpected restart. These are:
  1. A new support bundle. Please refer to the following article on how to capture and upload a support bundle: https://support.emc.com/kb/323283
  2. The core file generated when the problem occurred. Please refer to the following knowledge base article on the various methods that can be used to upload and access a core file: https://support.emc.com/kb/457974
Please upload the above items to the support case.

In some situations determining the cause for the FS restart may be easier, as the PANIC string which, for a majority of unexpected FS restarts is printed to the logs (included those in an alert ASUP), may be an easy match for earlier and well-know code defects or situations. For example : 
# log view debug/ddfs.info
08/18 07:38:30.576 (tid 0xa4444b0): ERROR: MSG-INTRNL-00001: PANIC: ddr/segstore/ss_nvram.c: ssnv_cp_append: 1887: Failed in cp_append_container: Err = [No more blocks to allocate in cset]

This one for example points to an issue related to the NVRAM being unable to flush further data to disk, as there are no more blocks free in the container set (collection partition is full). In this case, there would be no need for a SUB , less so for a ddfs.core file, to initially determine the problem, and to propose a solution.

Affected Products

Data Domain

Products

Data Domain, DD OS, Data Domain Virtual Edition
Article Properties
Article Number: 000064290
Article Type: Solution
Last Modified: 19 Sept 2022
Version:  4
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.