...
The DDFS process is the main process responsible for the operation of the DDOS (Data Domain Operating System) de-duplication filesystem.If this process encounters a problem, an alert is created which will be one of the following : EVT-FILESYS-00008 / FILESYS-00008EVT-FILESYS-00010 / FILESYS-00010EVT-FILESYS-00011 / FILESYS-00011 The above alerts indicate the problem encountered was unexpected and further information is required to ascertain the cause.The alert will be sent via the configured alerting mechanism on the Data Domain system, for example email, SNMP, connectemc, etc.. The alert will also appear in the 'alerts show history' output.
The DDFS process running in DDOS is a complex piece of software and as any software it may have defects which causes it to unexpectedly fail. Also, hardware related issues may induce conditions in the DDFS process which can't be safely handled. Be it software or hardware caused, DDFS may end up restarting through a few different means : A direct PANIC (process tries to run a piece of code which results in handled or unhandled error, such as for example an explicit code bug, or an unexpected condition metAn internal timeout is encountered. DDFS has an internal heartbeat monitor thread (called hmon) which monitors the health of the various subsystems within the DDFS process. If hmon ascertains that either a subsystem has hung or has been waiting too long, it terminates the DDFS process to try recovering from a possible deadlock (a situation by which two work items depend on each other and will never complete)An external timeout is encountered. A process called ddr_stated is responsible for externally monitoring the DDFS process by a heartbeat mechanism. If DDFS does not send a heartbeat to ddr_stated within a certain duration, ddr_stated assumes DDFS has hung and terminates the DDFS process.The process requests more memory than it is allowed (although DDFS is allowed to grab 90% or more of the installed RAM in a system)An internal sanity check failed When any of these are conditions are encountered, the filesystem attempts to automatically restart to resume normal operation. During the DDFS restart, any operations that were ongoing, such as restores/backups, i.e. reads/writes, will be interrupted and need to be restarted. Most backup applications can recognize that the reads/writes were interrupted and restart these operations automatically.When an unexpected DDFS restart occurs, the following things happen: The process is halted.The memory footprint that the process was using is written to a 'core file' which will be written to a core dump device, which is a special area on one of the head unit disks. A core file contains the necessary information to debug why the unexpected restart occurred.Once the above step completes, the DDFS process can restart.In parallel, i.e. once DDFS is restarting, the core file needs to be extracted from the core dump device to a DDOS filesystem so that it can be accessed. The process that accomplishes this task is called 'savecore'.Savecore creates an initial temporary directory in /ddvar/core. The directory name will be called 'app-'.As DDFS uses the majority of the memory on the system, the memory footprint for DDFS can be large. To minimize the amount of data written to the core file, savecore reads from the core dump device, passes this information through gzip, to ensure that the core file is as small as possible, and starts writing to a file called 'core-incomplete.gz'.Once this process completes, the temporary directory will be removed, the core file placed in /ddvar/core and renamed. The naming convention for a core file is as follows: The process name.The process ID.The string "core".The date/time the core was generated in an UNIX epoch format.So for example a core file for DDFS could be called 'ddfs.core.14226.1469256407.gz'. Due to the memory footprint being large, creating a core file is not immediate and can take a number of minutes to fully complete, getting the compressed version of the core file finalized can take hours in lower end DDs or those with huge amounts of memory.
As mentioned above, the creation of the core file is not immediate, the /ddvar/core directory can be checked periodically via an NFS or CIFS share to ascertain when the core file creation has completed. Once the core file creation has been completed, two items of information are required in order to triage what caused the unexpected restart. These are: A new support bundle. Please refer to the following article on how to capture and upload a support bundle: https://support.emc.com/kb/323283The core file generated when the problem occurred. Please refer to the following knowledge base article on the various methods that can be used to upload and access a core file: https://support.emc.com/kb/457974 Please upload the above items to the support case.In some situations determining the cause for the FS restart may be easier, as the PANIC string which, for a majority of unexpected FS restarts is printed to the logs (included those in an alert ASUP), may be an easy match for earlier and well-know code defects or situations. For example : # log view debug/ddfs.info 08/18 07:38:30.576 (tid 0xa4444b0): ERROR: MSG-INTRNL-00001: PANIC: ddr/segstore/ss_nvram.c: ssnv_cp_append: 1887: Failed in cp_append_container: Err = [No more blocks to allocate in cset] This one for example points to an issue related to the NVRAM being unable to flush further data to disk, as there are no more blocks free in the container set (collection partition is full). In this case, there would be no need for a SUB , less so for a ddfs.core file, to initially determine the problem, and to propose a solution.