PowerScale: nodes panic: Assertion Failure kernel:pdm_entries_get_cacheable_entry+0x48e
Sommaire: Multiple PowerScale nodes panic with stack: Assertion Failure kernel:pdm_entries_get_cacheable_entry+0x48e, kernel:pdm_member_generate_entry+0xda
Symptômes
Multiple PowerScale nodes panic with stack:
ipfw2 (+ipv6) initialized, divert loadable, nat loadable, default to accept, logging disabled mce2: Interface stopped DISTRIBUTING, possible flapping panic @ time 1642530761.327, thread 0xfffffea7a8a76580: Assertion Failure time = 1642530761 cpuid = 24, TSC = 0xe457b789d574ac Panic occurred in module kernel loaded at 0xffffffff80200000: Stack: -------------------------------------------------- kernel:isi_assert_halt+0x2e kernel:pdm_entries_get_cacheable_entry+0x48e kernel:pdm_entries_get+0x95 kernel:pdm_entries_get_global+0x33 kernel:pdm_get_painted_domids+0x2e2 kernel:pdm_member_generate_entry+0xda kernel:pdm_member_get_membership_entry+0x1b7 kernel:pdm_member_init_operation+0x16d kernel:ifm_init_operation+0xb7 kernel:txn_i_include_vnode_to_list+0xc0 kernel:pdm_vget_adsio_txn_include+0x12c kernel:pdm_domain_paint_adsdir+0x41 kernel:pdm_unlink+0x1f0 kernel:bam_rename+0x2bcb kernel:ifs_vnop_wraprename+0x96 kernel:VOP_RENAME_APV+0x9b isi_lwext.ko:lwextsvc_rename+0xe35 kernel:amd64_syscall+0x380 -------------------------------------------------- *** FAILED ASSERTION !pdm_lk_is_held(domid, PDM_EXCLUSIVE) @ /b/mnt/src/sys/ifs/pdm/pdm_ops.c:443: Disabling swatchdog Dumping stacks (40960 bytes)
Cause
Caused by the following defect: PSCALE-63084: FAILED ASSERTION !pdm_lk_is_held(domid, PDM_EXCLUSIVE) @ /b/mnt/src/sys/ifs/pdm/pdm_ops.c:443 During ADS rename
This issue is related to ADS (alternate data streams) workflow. Alternate Data Streams is a data structure used within Windows that stores metadata-types of information about a file, such as comments about a file. It does not exist in FreeBSD (which is the underlying file system of OneFS on Isilon), but OneFS supports it and it is treated as a file within OneFS.
The race condition occurs when OneFS renames an ADS file across a domain boundary (in or out of snapshot) in order to overwrite a hard linked ADS file. This causes a node to become unresponsive when an operation was attempted. As a result, data unavailability (DU) may occur.
How to determine whether a cluster is at risk for this issue?
The cluster is running OneFS code older than 9.3.0.0.
If workflow does not use ADS and 'rename' operations, then OneFS will not experience this issue.
Résolution
Upgrade to OneFS version 9.3.0.0 or above.
Workaround:
If needed, PowerScale Support can study the collected minidump from the panicked node in order to help us identify what file/folder was associated with the panic.