Data Domain: FS process PANIC in the inode cache when running out of memory in cache element pool

摘要: A defect has been found in some recent DDOS versions (confirmed in 7.7.4, 7.9.0.10 and 7.10.0, dubious if affecting DDOS 7.7.3 as well) by which an FS process PANIC may occur in the inode cache code when, depending on the workload, a cache element pool runs out of memory for further allocations. ...

本文适用于 本文不适用于 本文并非针对某种特定的产品。 本文并非包含所有产品版本。

症状

There is no degradation or advance warning for this issue, which will manifest itself in the form of an FS process failure (PANIC), after which, the process would restart and come up again fine automatically.
Due to the code path being exercised, the FS process may PANIC in several different ways, including the following:
PANIC: ddr/sm/ddfs/ddfs_mtree.c: ddfs_mtree_list: 829: !((dd_errno(e) == ENOENT) || (dd_errno(e) == DD_ERR_FM_EATTRNOENT) || (dd_errno(e) == DD_ERR_STALE))
PANIC: ddr/fv/file_verify.c: file_verify_update_marker_attrs: 4872: Fatal Error
PANIC: ddr/fv/file_verify.c: file_verify_update_snap_attr: 4446: Fatal Error
PANIC: ddr/fv/file_verify.c: file_verify_update_marker_attrs: 4860: Fatal Error
In the FS process log files (ddfs.info) the following messages will be found prior to each process crash:
01/17 20:21:59.292947 [7fbbf4f98f50] dd_cache_elem_reclaim: Evict count=256, Visited count=257, Skipped elem count=0, Skipped bucket count=0, Time threshold=1539816333626910. (99% full) Complete=True
01/17 20:22:04.662303 [7fbb031ad4f0] ERROR: FM fm_iget:355 - fm_iget failed to allocate elem in dd_cache 5001

Messages indicating the internal process full was 99% full, then unable to allocate any further elements, hence leading to process crash. 

NOTE: This issue is known to only affect the following versions:
  • DDOS 7.7.3.x : Not fully confirmed
  • DDOS 7.7.4.x
  • DDOS 7.9.0.10
  • DDOS 7.10.0.x

原因

For any file operation like read/write, an inode structure is allocated from the dd_cache element pool.
If this cache is full and a new request comes in, then an element is evicted from this cache and the new request is fulfilled.
This eviction is based on a time policy (an element is evicted if it has not been accessed in last 'x' seconds).
In case this cache becomes too hot (all elements have been accessed within last 'x' seconds), and no elements can be evicted even after multiple retries, then fm_iget returns DD_ERR_NOMEM.
Some callers of this element pool allocation will be unable to handle the error gracefully and hence cause the FS process to PANIC and dump core should function "fm_iget" returns any error. That is why there are a few different PANIC signatures corresponding to the underlying code defect.

解决方案

The fundamental code issue resulting in these FS process crashes is fixed using DDOS-168410 in the following versions (and all later ones in the same code branches) :
  • DDOS 7.7.5.1
  • DDOS 7.10.1.0
  • DDOS 7.11.0
Customers impacted by this problem who cannot immediately upgrade to any of the releases above can try a workaround for which they need to contact Dell Support.
If running a version with the problem (those listed above) but you have not experienced an unexpected FS process crash yet matching the symptoms in this KB, it is our recommendation to not proactively apply the workaround, and instead, upgrade to any of the fixed releases above (or any of their successors) to avail of the latest updates and code fixes.

受影响的产品

Data Domain
文章属性
文章编号: 000207919
文章类型: Solution
上次修改时间: 21 12月 2023
版本:  17
从其他戴尔用户那里查找问题的答案
支持服务
检查您的设备是否在支持服务涵盖的范围内。