Unsolved

1 Rookie

 • 

21 Posts

388

April 15th, 2024 22:44

Memory Channel Error Identification on PowerEdge R6515

Hello

I recently bought 16x Dell Part AA783423 as part of a memory upgrade but one of the sticks seem bad.

I am experiencing memory errors on my Dell PowerEdge R6515 server and need assistance with identifying the problematic DIMM slot. The edac-util tool reports errors specifically at "mc#0csrow#3channel#2"

edac-util -v 
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: mc#0csrow#0channel#0: 0 Corrected Errors
mc0: csrow0: mc#0csrow#0channel#1: 0 Corrected Errors
mc0: csrow0: mc#0csrow#0channel#2: 0 Corrected Errors
mc0: csrow0: mc#0csrow#0channel#3: 0 Corrected Errors
mc0: csrow0: mc#0csrow#0channel#4: 0 Corrected Errors
mc0: csrow0: mc#0csrow#0channel#5: 0 Corrected Errors
mc0: csrow1: 0 Uncorrected Errors
mc0: csrow1: mc#0csrow#1channel#0: 0 Corrected Errors
mc0: csrow1: mc#0csrow#1channel#1: 0 Corrected Errors
mc0: csrow1: mc#0csrow#1channel#2: 0 Corrected Errors
mc0: csrow1: mc#0csrow#1channel#3: 0 Corrected Errors
mc0: csrow1: mc#0csrow#1channel#4: 0 Corrected Errors
mc0: csrow1: mc#0csrow#1channel#5: 0 Corrected Errors
mc0: csrow2: 0 Uncorrected Errors
mc0: csrow2: mc#0csrow#2channel#0: 0 Corrected Errors
mc0: csrow2: mc#0csrow#2channel#1: 0 Corrected Errors
mc0: csrow2: mc#0csrow#2channel#2: 0 Corrected Errors
mc0: csrow2: mc#0csrow#2channel#3: 0 Corrected Errors
mc0: csrow2: mc#0csrow#2channel#4: 0 Corrected Errors
mc0: csrow2: mc#0csrow#2channel#5: 0 Corrected Errors
mc0: csrow3: 0 Uncorrected Errors
mc0: csrow3: mc#0csrow#3channel#0: 0 Corrected Errors
mc0: csrow3: mc#0csrow#3channel#1: 0 Corrected Errors
mc0: csrow3: mc#0csrow#3channel#2: 74 Corrected Errors
mc0: csrow3: mc#0csrow#3channel#3: 0 Corrected Errors
mc0: csrow3: mc#0csrow#3channel#4: 0 Corrected Errors
mc0: csrow3: mc#0csrow#3channel#5: 0 Corrected Errors

. I would like guidance on which DIMM slot corresponds to this particular memory channel. Could you provide me with the memory layout or any specific documentation that could help me isolate and address this issue? Any additional troubleshooting steps or advice would also be appreciated.

Thank you for your assistance.

Moderator

 • 

4.2K Posts

 • 

20.9K Points

April 16th, 2024 03:25

Hi,

 

It is hard to identify EDAC error message, as we need to refer to architectural schemetics, usually this would need engineering to be involved. 

 

I would suggest, disabling EDAC and let the server's lifecycle controller capture the error, this would be an easier and faster way. These errors occur when the Error Detection and Correction (EDAC) module reads the registers from the chipset. You may not notice any memory or CPU errors in the ESM/BMC/IPMI/iDRAC log because the registers are read-once and when enabled, EDAC will get them first.

1 Rookie

 • 

21 Posts

April 16th, 2024 10:44

@DELL-Joey C​ Hello

I disabled the EDAC, rmmod amd64_edac edac_mce_amd

Now dmesg prints:

mce: [Hardware Error]: Machine check events logged
mce: [Hardware Error]: CPU 1: Machine Check: 0 Bank 17: d42040000000011b
mce: [Hardware Error]: TSC 0 ADDR 15250b5100 PPIN 2b4a63d009dc115 SYND bb4400800a800403 IPID 9600250f00
mce: [Hardware Error]: PROCESSOR 2:830f10 TIME 1713263957 SOCKET 0 APIC 2 microcode 830107a

and:
edac-util -v

edac-util: Error: No memory controller data found.

Checking the IDRAC lifecycle log I'm not seeing anything picked up there

Moderator

 • 

3K Posts

 • 

14.9K Points

April 16th, 2024 12:37

Hello, 

If there is nothing on iDRAC LCC log then it's hard to say there is a memory error. EDAC Errors in 'messages' Log in RedHat Enterprise Linux (RHEL) and PowerEdge | Dell

These errors occur when the Error Detection and Correction (EDAC) module reads the registers from the chipset. You may not notice any memory or CPU errors in the ESM/BMC/IPMI/iDRAC log because the registers are read-once and when enabled, EDAC will get them first.

Resolution

Resolution :
  • Blacklist the edac driver :
    • List edac modules :
      • # lsmod | grep -i edac
    • Take the output and blacklist them :
    • Edit '/etc/modprobe.d/blacklist.conf' with your favorite editor
    • Add the modules at the bottom of the file
    • Example :
      • blacklist i7core_edac
      • blacklist edac_core
  • Reboot
  • Run hardware diagnostics

Hope that helps!

1 Rookie

 • 

21 Posts

July 9th, 2024 08:19

@DELL-Erman O​ Hello. I've attempted this and now dmesg -T doesn't print any more errors related to memory. But IDRAC 9 also doesn't show any errors. I've attempted to run memtester software with 1 pass for 26 hours but nothing gets picked up. I think it's because of ECC maybe correcting the errors.

The system isn't stable :( As soon as I start it, and ram usage goes up, applications start freezing and dmesg -T shows this.

[Tue Jul  9 05:54:51 2024] Modules linked in: ib_core xt_recent xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter act_police cls_u32 sch_ingress cls_fw sch_sfq sch_htb kcare(OE) tls nft_meta_bridge ebt_arp ebt_ip6 ebt_ip nft_counter vhost_net vhost vhost_iotlb tap xt_physdev nft_compat nf_tables nfnetlink tun wireguard(X) ip6_udp_tunnel udp_tunnel curve25519_x86_64 libcurve25519_generic bridge stp llc sunrpc vfat fat ipmi_ssif intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd kvm_amd kvm mgag200 i2c_algo_bit drm_shmem_helper dell_wmi ledtrig_audio sparse_keymap rfkill drm_kms_helper video syscopyarea irqbypass dcdbas dell_smbios sysfillrect wmi_bmof dell_wmi_descriptor sysimgblt acpi_ipmi rapl pcspkr acpi_cpufreq fb_sys_fops ipmi_si i2c_piix4 ptdma k10temp ipmi_devintf ipmi_msghandler acpi_power_meter joydev raid10 fuse drm xfs libcrc32c raid1 ahci nvme crct10dif_pclmul libahci crc32_pclmul crc32c_intel nvme_core libata ghash_clmulni_intel tg3 nvme_common ccp t10_pi
[Tue Jul  9 05:54:51 2024]  sp5100_tco wmi dm_mirror dm_region_hash dm_log dm_mod
[Tue Jul  9 05:54:51 2024] CPU: 19 PID: 45636 Comm: CPU 2/KVM Kdump: loaded Tainted: G        W  OEL X  -------  ---  5.14.0-427.16.1.el9_4.x86_64 #1
[Tue Jul  9 05:54:51 2024] Hardware name: Dell Inc. PowerEdge R6515/0R4CNN, BIOS 2.13.3 09/12/2023
[Tue Jul  9 05:54:51 2024] RIP: 0010:_raw_spin_unlock_irqrestore+0x1c/0x30
[Tue Jul  9 05:54:51 2024] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 c6 07 00 0f 1f 40 00 f7 c6 00 02 00 00 74 01 fb 65 ff 0d 8c cf 59 55 <74> 05 e9 1d 1a 00 00 0f 1f 44 00 00 e9 13 1a 00 00 0f 1f 00 90 90
[Tue Jul  9 05:54:51 2024] RSP: 0018:ffff9ac406d675a8 EFLAGS: 00000246
[Tue Jul  9 05:54:51 2024] RAX: 0000000000000000 RBX: ffff894bd80485b8 RCX: 0000000000000001
[Tue Jul  9 05:54:51 2024] RDX: ffff8998bfe55008 RSI: 0000000000000213 RDI: ffff8998bfe7fb28
[Tue Jul  9 05:54:51 2024] RBP: 0000000000000004 R08: 0000000000000000 R09: 0000000000000000
[Tue Jul  9 05:54:51 2024] R10: ffff894b5db585b8 R11: ffff89569ade75b8 R12: ffff893473a3b000
[Tue Jul  9 05:54:51 2024] R13: ffff9ac406d67718 R14: 0000000000000001 R15: ffff8a3b9f602000
[Tue Jul  9 05:54:51 2024] FS:  00007f1dc35fe640(0000) GS:ffff8996bf8c0000(0000) knlGS:0000000000000000
[Tue Jul  9 05:54:51 2024] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Tue Jul  9 05:54:51 2024] CR2: 0000021488556000 CR3: 0000000e83cf6000 CR4: 0000000000350ee0
[Tue Jul  9 05:54:51 2024] watchdog: BUG: soft lockup - CPU#31 stuck for 26s! [migration/31:206]
[Tue Jul  9 05:54:51 2024] Call Trace:
[Tue Jul  9 05:54:51 2024]  <IRQ>
[Tue Jul  9 05:54:51 2024] Modules linked in:
[Tue Jul  9 05:54:51 2024]  ? srso_return_thunk+0x5/0x5f
[Tue Jul  9 05:54:51 2024]  ib_core
[Tue Jul  9 05:54:51 2024]  ? show_trace_log_lvl+0x26e/0x2df
[Tue Jul  9 05:54:51 2024]  xt_recent
[Tue Jul  9 05:54:51 2024]  ? show_trace_log_lvl+0x26e/0x2df
[Tue Jul  9 05:54:51 2024]  xt_conntrack
[Tue Jul  9 05:54:51 2024]  ? shrink_many+0xd4/0x270
[Tue Jul  9 05:54:51 2024]  nf_conntrack
[Tue Jul  9 05:54:51 2024]  ? watchdog_timer_fn+0x1b2/0x210
[Tue Jul  9 05:54:51 2024]  nf_defrag_ipv6
[Tue Jul  9 05:54:51 2024]  ? __pfx_watchdog_timer_fn+0x10/0x10
[Tue Jul  9 05:54:51 2024]  nf_defrag_ipv4
[Tue Jul  9 05:54:51 2024]  ? __hrtimer_run_queues+0x12a/0x2c0
[Tue Jul  9 05:54:51 2024]  br_netfilter
[Tue Jul  9 05:54:51 2024]  ? hrtimer_interrupt+0xfc/0x210
[Tue Jul  9 05:54:51 2024]  act_police
[Tue Jul  9 05:54:51 2024]  ? __do_softirq+0x16a/0x2ac
[Tue Jul  9 05:54:51 2024]  cls_u32
[Tue Jul  9 05:54:51 2024]  ? __sysvec_apic_timer_interrupt+0x5f/0x110
[Tue Jul  9 05:54:51 2024]  sch_ingress
[Tue Jul  9 05:54:51 2024]  ? sysvec_apic_timer_interrupt+0x6d/0x90
[Tue Jul  9 05:54:51 2024]  cls_fw
[Tue Jul  9 05:54:51 2024]  </IRQ>
[Tue Jul  9 05:54:51 2024]  sch_sfq
[Tue Jul  9 05:54:51 2024]  <TASK>
[Tue Jul  9 05:54:51 2024]  sch_htb
[Tue Jul  9 05:54:51 2024]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
[Tue Jul  9 05:54:51 2024]  kcare(OE)
[Tue Jul  9 05:54:51 2024]  ? _raw_spin_unlock_irqrestore+0x1c/0x30
[Tue Jul  9 05:54:51 2024]  tls
[Tue Jul  9 05:54:51 2024]  shrink_many+0xd4/0x270
[Tue Jul  9 05:54:51 2024]  nft_meta_bridge
[Tue Jul  9 05:54:51 2024]  shrink_node+0x406/0x4a0
[Tue Jul  9 05:54:51 2024]  ebt_arp ebt_ip6
[Tue Jul  9 05:54:51 2024]  shrink_zones.constprop.0+0x88/0x280
[Tue Jul  9 05:54:51 2024]  ebt_ip
[Tue Jul  9 05:54:51 2024]  do_try_to_free_pages+0x92/0x2d0
[Tue Jul  9 05:54:51 2024]  nft_counter
[Tue Jul  9 05:54:51 2024]  try_to_free_pages+0xd8/0x200
[Tue Jul  9 05:54:51 2024]  vhost_net
[Tue Jul  9 05:54:51 2024]  __alloc_pages_slowpath.constprop.0+0x344/0x960
[Tue Jul  9 05:54:51 2024]  vhost
[Tue Jul  9 05:54:51 2024]  ? srso_return_thunk+0x5/0x5f
[Tue Jul  9 05:54:51 2024]  vhost_iotlb
[Tue Jul  9 05:54:51 2024]  ? srso_return_thunk+0x5/0x5f
[Tue Jul  9 05:54:51 2024]  tap
[Tue Jul  9 05:54:51 2024]  ? get_page_from_freelist+0x2ab/0x530
[Tue Jul  9 05:54:51 2024]  xt_physdev
[Tue Jul  9 05:54:51 2024]  ? post_alloc_hook+0xb6/0xd0
[Tue Jul  9 05:54:51 2024]  nft_compat
[Tue Jul  9 05:54:51 2024]  __alloc_pages+0x21d/0x250
[Tue Jul  9 05:54:51 2024]  nf_tables
[Tue Jul  9 05:54:51 2024]  __folio_alloc+0x17/0x50
[Tue Jul  9 05:54:51 2024]  nfnetlink
[Tue Jul  9 05:54:51 2024]  vma_alloc_folio+0x281/0x390
[Tue Jul  9 05:54:51 2024]  tun
[Tue Jul  9 05:54:51 2024]  do_huge_pmd_anonymous_page+0xb6/0x390
[Tue Jul  9 05:54:51 2024]  wireguard(X)
[Tue Jul  9 05:54:51 2024]  ? ttwu_queue_wakelist+0xf2/0x110
[Tue Jul  9 05:54:51 2024]  ip6_udp_tunnel
[Tue Jul  9 05:54:51 2024]  __handle_mm_fault+0x661/0x670
[Tue Jul  9 05:54:51 2024]  udp_tunnel curve25519_x86_64
[Tue Jul  9 05:54:51 2024]  handle_mm_fault+0xcd/0x290
[Tue Jul  9 05:54:51 2024]  libcurve25519_generic
[Tue Jul  9 05:54:51 2024]  __get_user_pages+0x1df/0x470
[Tue Jul  9 05:54:51 2024]  bridge
[Tue Jul  9 05:54:51 2024]  get_user_pages_unlocked+0xcc/0x320
[Tue Jul  9 05:54:51 2024]  stp
[Tue Jul  9 05:54:51 2024]  hva_to_pfn+0xf9/0x360 [kvm]
[Tue Jul  9 05:54:51 2024]  llc
[Tue Jul  9 05:54:51 2024]  ? srso_untrain_ret+0x2/0x2
[Tue Jul  9 05:54:51 2024]  sunrpc
[Tue Jul  9 05:54:51 2024]  ? xas_load+0x9/0xa0
[Tue Jul  9 05:54:51 2024]  vfat
[Tue Jul  9 05:54:51 2024]  ? srso_return_thunk+0x5/0x5f
[Tue Jul  9 05:54:51 2024]  fat
[Tue Jul  9 05:54:51 2024]  ? xa_load+0x70/0xb0
[Tue Jul  9 05:54:51 2024]  ipmi_ssif
[Tue Jul  9 05:54:51 2024] watchdog: BUG: soft lockup - CPU#63 stuck for 26s! [php:3671602]
[Tue Jul  9 05:54:51 2024] Modules linked in: ib_core xt_recent xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter act_police cls_u32 sch_ingress cls_fw sch_sfq sch_htb kcare(OE) tls nft_meta_bridge ebt_arp ebt_ip6 ebt_ip nft_counter vhost_net vhost vhost_iotlb tap xt_physdev nft_compat nf_tables nfnetlink tun wireguard(X) ip6_udp_tunnel udp_tunnel curve25519_x86_64 libcurve25519_generic bridge stp llc sunrpc vfat fat ipmi_ssif intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd kvm_amd kvm mgag200 i2c_algo_bit drm_shmem_helper dell_wmi ledtrig_audio sparse_keymap rfkill drm_kms_helper video syscopyarea irqbypass dcdbas dell_smbios sysfillrect wmi_bmof dell_wmi_descriptor sysimgblt acpi_ipmi rapl pcspkr acpi_cpufreq fb_sys_fops ipmi_si i2c_piix4 ptdma k10temp ipmi_devintf ipmi_msghandler acpi_power_meter joydev raid10 fuse drm xfs libcrc32c raid1 ahci nvme crct10dif_pclmul libahci crc32_pclmul crc32c_intel nvme_core libata ghash_clmulni_intel tg3 nvme_common ccp t10_pi
[Tue Jul  9 05:54:51 2024]  sp5100_tco wmi dm_mirror dm_region_hash dm_log dm_mod
[Tue Jul  9 05:54:51 2024] CPU: 63 PID: 3671602 Comm: php Kdump: loaded Tainted: G        W  OEL X  -------  ---  5.14.0-427.16.1.el9_4.x86_64 #1
[Tue Jul  9 05:54:51 2024] Hardware name: Dell Inc. PowerEdge R6515/0R4CNN, BIOS 2.13.3 09/12/2023
[Tue Jul  9 05:54:51 2024] RIP: 0010:_raw_spin_unlock_irqrestore+0x1c/0x30
[Tue Jul  9 05:54:51 2024] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 c6 07 00 0f 1f 40 00 f7 c6 00 02 00 00 74 01 fb 65 ff 0d 8c cf 59 55 <74> 05 e9 1d 1a 00 00 0f 1f 44 00 00 e9 13 1a 00 00 0f 1f 00 90 90
[Tue Jul  9 05:54:51 2024] RSP: 0018:ffff9ac3f56f79d8 EFLAGS: 00000246
[Tue Jul  9 05:54:51 2024] RAX: 0000000000000000 RBX: ffff8946173435b8 RCX: 0000000000000001
[Tue Jul  9 05:54:51 2024] RDX: ffff8998bfe55008 RSI: 0000000000000216 RDI: ffff8998bfe7fb28
[Tue Jul  9 05:54:51 2024] RBP: 0000000000000004 R08: 0000000000000000 R09: 0000000000000000
[Tue Jul  9 05:54:51 2024] R10: ffff892bc44315b8 R11: ffff896766de25b8 R12: ffff89242de1c000
[Tue Jul  9 05:54:51 2024] R13: ffff9ac3f56f7b48 R14: 0000000000000001 R15: ffff8a265c5a0000
[Tue Jul  9 05:54:51 2024] FS:  00007f6c65029800(0000) GS:ffff8996bfdc0000(0000) knlGS:0000000000000000
[Tue Jul  9 05:54:51 2024] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Tue Jul  9 05:54:51 2024] CR2: 0000021488657000 CR3: 00000032bc02e000 CR4: 0000000000350ee0
[Tue Jul  9 05:54:51 2024] Call Trace:
[Tue Jul  9 05:54:51 2024]  <IRQ>
[Tue Jul  9 05:54:51 2024]  ? srso_return_thunk+0x5/0x5f
[Tue Jul  9 05:54:51 2024]  ? show_trace_log_lvl+0x26e/0x2df
[Tue Jul  9 05:54:51 2024]  ? show_trace_log_lvl+0x26e/0x2df
[Tue Jul  9 05:54:51 2024]  ? shrink_many+0xd4/0x270
[Tue Jul  9 05:54:51 2024]  ? watchdog_timer_fn+0x1b2/0x210
[Tue Jul  9 05:54:51 2024]  ? __pfx_watchdog_timer_fn+0x10/0x10
[Tue Jul  9 05:54:51 2024]  ? __hrtimer_run_queues+0x12a/0x2c0
[Tue Jul  9 05:54:51 2024]  ? hrtimer_interrupt+0xfc/0x210
[Tue Jul  9 05:54:51 2024]  ? __sysvec_apic_timer_interrupt+0x5f/0x110
[Tue Jul  9 05:54:51 2024]  ? sysvec_apic_timer_interrupt+0x6d/0x90
[Tue Jul  9 05:54:51 2024]  </IRQ>
[Tue Jul  9 05:54:51 2024]  <TASK>
[Tue Jul  9 05:54:51 2024]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
[Tue Jul  9 05:54:51 2024]  ? _raw_spin_unlock_irqrestore+0x1c/0x30
[Tue Jul  9 05:54:51 2024]  shrink_many+0xd4/0x270
[Tue Jul  9 05:54:51 2024]  shrink_node+0x406/0x4a0
[Tue Jul  9 05:54:51 2024]  shrink_zones.constprop.0+0x88/0x280
[Tue Jul  9 05:54:51 2024]  do_try_to_free_pages+0x92/0x2d0
[Tue Jul  9 05:54:51 2024]  try_to_free_pages+0xd8/0x200
[Tue Jul  9 05:54:51 2024]  __alloc_pages_slowpath.constprop.0+0x344/0x960
[Tue Jul  9 05:54:51 2024]  ? srso_return_thunk+0x5/0x5f
[Tue Jul  9 05:54:51 2024]  ? free_p4d_range+0xcf/0x230
[Tue Jul  9 05:54:51 2024]  ? srso_return_thunk+0x5/0x5f
[Tue Jul  9 05:54:51 2024]  ? get_page_from_freelist+0x2ab/0x530
[Tue Jul  9 05:54:51 2024]  ? post_alloc_hook+0xb6/0xd0
[Tue Jul  9 05:54:51 2024]  __alloc_pages+0x21d/0x250
[Tue Jul  9 05:54:51 2024]  __folio_alloc+0x17/0x50
[Tue Jul  9 05:54:51 2024]  __kvm_faultin_pfn+0xb6/0x470 [kvm]
[Tue Jul  9 05:54:51 2024]  ? srso_return_thunk+0x5/0x5f
[Tue Jul  9 05:54:51 2024]  kvm_faultin_pfn+0x24/0x1b0 [kvm]
[Tue Jul  9 05:54:51 2024]  kvm_tdp_page_fault+0xf4/0x150 [kvm]
[Tue Jul  9 05:54:51 2024]  kvm_mmu_do_page_fault+0x16d/0x1a0 [kvm]
[Tue Jul  9 05:54:51 2024]  kvm_mmu_page_fault+0x81/0x1f0 [kvm]
[Tue Jul  9 05:54:51 2024]  vcpu_enter_guest.constprop.0+0x60e/0xf20 [kvm]
[Tue Jul  9 05:54:51 2024]  ? srso_return_thunk+0x5/0x5f
[Tue Jul  9 05:54:51 2024]  ? kvm_apic_local_deliver+0x7b/0xb0 [kvm]
[Tue Jul  9 05:54:51 2024]  vcpu_run+0x130/0x260 [kvm]
[Tue Jul  9 05:54:51 2024]  kvm_arch_vcpu_ioctl_run+0x17a/0x490 [kvm]
[Tue Jul  9 05:54:51 2024]  kvm_vcpu_ioctl+0x271/0x6b0 [kvm]
[Tue Jul  9 05:54:51 2024]  ? srso_safe_ret+0xb/0x20
[Tue Jul  9 05:54:51 2024]  __x64_sys_ioctl+0x8a/0xc0
[Tue Jul  9 05:54:51 2024]  do_syscall_64+0x5c/0x90
[Tue Jul  9 05:54:51 2024]  ? do_syscall_64+0x69/0x90
[Tue Jul  9 05:54:51 2024]  ? do_syscall_64+0x69/0x90
[Tue Jul  9 05:54:51 2024]  ? sysvec_apic_timer_interrupt+0x3c/0x90
[Tue Jul  9 05:54:51 2024]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
[Tue Jul  9 05:54:51 2024]  intel_rapl_msr
[Tue Jul  9 05:54:51 2024] RIP: 0033:0x7f21d463ec6b
[Tue Jul  9 05:54:51 2024] Code: 73 01 c3 48 8b 0d b5 b1 1b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 85 b1 1b 00 f7 d8 64 89 01 48
[Tue Jul  9 05:54:51 2024] RSP: 002b:00007f1dc35fd3e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[Tue Jul  9 05:54:51 2024] RAX: ffffffffffffffda RBX: 000055e466f60fa0 RCX: 00007f21d463ec6b
[Tue Jul  9 05:54:51 2024] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 000000000000001a
[Tue Jul  9 05:54:51 2024] RBP: 000000000000ae80 R08: 000055e4680d2ef0 R09: 0000000000000000
[Tue Jul  9 05:54:51 2024] R10: 00007f1dc3dffdb0 R11: 0000000000000246 R12: 00007f1dc3dffe50
[Tue Jul  9 05:54:51 2024] R13: 9ef165b6e25d7500 R14: 0000000000000000 R15: fffffffffffffd00
[Tue Jul  9 05:54:51 2024]  </TASK>
[Tue Jul  9 05:54:51 2024]  vma_alloc_folio+0x281/0x390
[Tue Jul  9 05:54:51 2024]  intel_rapl_common
[Tue Jul  9 05:54:51 2024]  do_huge_pmd_anonymous_page+0xb6/0x390
[Tue Jul  9 05:54:51 2024]  amd64_edac
[Tue Jul  9 05:54:51 2024]  ? srso_return_thunk+0x5/0x5f
[Tue Jul  9 05:54:51 2024]  edac_mce_amd
[Tue Jul  9 05:54:51 2024]  ? __do_huge_pmd_anonymous_page+0x26f/0x500
[Tue Jul  9 05:54:51 2024]  kvm_amd
[Tue Jul  9 05:54:51 2024]  __handle_mm_fault+0x661/0x670
[Tue Jul  9 05:54:51 2024]  kvm mgag200
[Tue Jul  9 05:54:51 2024]  handle_mm_fault+0xcd/0x290
[Tue Jul  9 05:54:51 2024]  i2c_algo_bit
[Tue Jul  9 05:54:51 2024]  do_user_addr_fault+0x1b4/0x6a0
[Tue Jul  9 05:54:51 2024]  drm_shmem_helper
[Tue Jul  9 05:54:51 2024]  ? syscall_exit_work+0x103/0x130
[Tue Jul  9 05:54:51 2024]  dell_wmi
[Tue Jul  9 05:54:51 2024]  exc_page_fault+0x62/0x150
[Tue Jul  9 05:54:51 2024]  ledtrig_audio
[Tue Jul  9 05:54:51 2024]  asm_exc_page_fault+0x22/0x30
[Tue Jul  9 05:54:51 2024]  sparse_keymap
[Tue Jul  9 05:54:51 2024] RIP: 0033:0x7f6c64959b86
[Tue Jul  9 05:54:51 2024]  rfkill
[Tue Jul  9 05:54:51 2024] Code: 0f 1f 40 00 c5 fe 6f 4e 60 c5 fe 6f 56 40 c5 fe 6f 5e 20 c5 fe 6f 26 48 83 c6 80 c5 fd 7f 49 60 c5 fd 7f 51 40 c5 fd 7f 59 20 <c5> fd 7f 21 48 83 c1 80 48 39 cf 72 cd c5 fe 7f 07 c5 fe 7f 6f 20
[Tue Jul  9 05:54:51 2024]  drm_kms_helper
[Tue Jul  9 05:54:51 2024] RSP: 002b:00007ffc16b83858 EFLAGS: 00010207
[Tue Jul  9 05:54:51 2024]  video

[Tue Jul  9 05:54:51 2024]  syscopyarea
[Tue Jul  9 05:54:51 2024] RAX: 0000000000600000 RBX: 00007ffc16b83890 RCX: 00000000007fffe0
[Tue Jul  9 05:54:51 2024]  irqbypass
[Tue Jul  9 05:54:51 2024] RDX: 0000000000375000 RSI: 00007f6c60a97f60 RDI: 0000000000600000
[Tue Jul  9 05:54:51 2024]  dcdbas
[Tue Jul  9 05:54:51 2024] RBP: 0000000000400000 R08: 0000000000000000 R09: 0000000000000000
[Tue Jul  9 05:54:51 2024]  dell_smbios
[Tue Jul  9 05:54:51 2024] R10: 0000000000400000 R11: 0000000000000246 R12: 0000000000600000
[Tue Jul  9 05:54:51 2024]  sysfillrect
[Tue Jul  9 05:54:51 2024] R13: 0000000002c4f030 R14: 0000000000375000 R15: 00007f6c60898000
[Tue Jul  9 05:54:51 2024]  wmi_bmof
[Tue Jul  9 05:54:51 2024]  </TASK>
[Tue Jul  9 05:54:51 2024]  dell_wmi_descriptor sysimgblt acpi_ipmi rapl pcspkr acpi_cpufreq fb_sys_fops ipmi_si i2c_piix4 ptdma k10temp ipmi_devintf ipmi_msghandler acpi_power_meter joydev raid10 fuse drm xfs libcrc32c raid1 ahci nvme crct10dif_pclmul libahci crc32_pclmul crc32c_intel nvme_core libata ghash_clmulni_intel tg3 nvme_common ccp t10_pi sp5100_tco wmi dm_mirror dm_region_hash dm_log dm_mod
[Tue Jul  9 05:54:51 2024] CPU: 31 PID: 206 Comm: migration/31 Kdump: loaded Tainted: G        W  OEL X  -------  ---  5.14.0-427.16.1.el9_4.x86_64 #1
[Tue Jul  9 05:54:51 2024] Hardware name: Dell Inc. PowerEdge R6515/0R4CNN, BIOS 2.13.3 09/12/2023
[Tue Jul  9 05:54:51 2024] Stopper: multi_cpu_stop+0x0/0x100 <- migrate_swap+0xb4/0x110
[Tue Jul  9 05:54:51 2024] RIP: 0010:multi_cpu_stop+0x98/0x100
[Tue Jul  9 05:54:51 2024] Code: 0f 8b 43 20 8b 4b 10 83 c0 01 89 4b 24 89 43 20 e8 bd 31 fa ff 41 83 ff 04 74 2e 45 89 fc 4c 89 ef e8 4c ff ff ff 44 8b 7b 20 <45> 39 fc 75 aa 41 83 ff 01 76 0a e8 a8 08 02 00 e8 b3 fc 01 00 e8
[Tue Jul  9 05:54:51 2024] RSP: 0018:ffff9ac3d953be78 EFLAGS: 00000246
[Tue Jul  9 05:54:51 2024] RAX: 0000000000000000 RBX: ffff9ac40b3db6e8 RCX: 0000000000000002
[Tue Jul  9 05:54:51 2024] RDX: 0000000000000004 RSI: ffff9ac40b3db710 RDI: ffffffffaae3efc0
[Tue Jul  9 05:54:51 2024] RBP: ffff9ac40b3db70c R08: ffff8a963e5e3f90 R09: ffff8a188fcf3000
[Tue Jul  9 05:54:51 2024] R10: 0000000000000034 R11: 0000000000000000 R12: 0000000000000001
[Tue Jul  9 05:54:51 2024] R13: ffffffffaae3efc0 R14: 0000000000000000 R15: 0000000000000001
[Tue Jul  9 05:54:51 2024] FS:  0000000000000000(0000) GS:ffff8a963e5c0000(0000) knlGS:0000000000000000
[Tue Jul  9 05:54:51 2024] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Tue Jul  9 05:54:51 2024] CR2: 00007f3790185f50 CR3: 0000012b05810000 CR4: 0000000000350ee0
[Tue Jul  9 05:54:51 2024] Call Trace:
[Tue Jul  9 05:54:51 2024]  <IRQ>
[Tue Jul  9 05:54:51 2024]  ? srso_return_thunk+0x5/0x5f
[Tue Jul  9 05:54:51 2024]  ? show_trace_log_lvl+0x26e/0x2df
[Tue Jul  9 05:54:51 2024]  ? show_trace_log_lvl+0x26e/0x2df
[Tue Jul  9 05:54:51 2024]  ? cpu_stopper_thread+0x93/0x140
[Tue Jul  9 05:54:51 2024]  ? watchdog_timer_fn+0x1b2/0x210
[Tue Jul  9 05:54:51 2024]  ? __pfx_watchdog_timer_fn+0x10/0x10
[Tue Jul  9 05:54:51 2024]  ? __hrtimer_run_queues+0x12a/0x2c0
[Tue Jul  9 05:54:51 2024]  ? hrtimer_interrupt+0xfc/0x210
[Tue Jul  9 05:54:51 2024]  ? __sysvec_apic_timer_interrupt+0x5f/0x110
[Tue Jul  9 05:54:51 2024]  ? sysvec_apic_timer_interrupt+0x6d/0x90
[Tue Jul  9 05:54:51 2024]  </IRQ>
[Tue Jul  9 05:54:51 2024]  <TASK>
[Tue Jul  9 05:54:51 2024]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
[Tue Jul  9 05:54:51 2024]  ? multi_cpu_stop+0x98/0x100
[Tue Jul  9 05:54:51 2024]  ? multi_cpu_stop+0x94/0x100
[Tue Jul  9 05:54:51 2024]  ? __pfx_multi_cpu_stop+0x10/0x10
[Tue Jul  9 05:54:51 2024]  cpu_stopper_thread+0x93/0x140
[Tue Jul  9 05:54:51 2024]  ? __pfx_smpboot_thread_fn+0x10/0x10
[Tue Jul  9 05:54:51 2024]  smpboot_thread_fn+0xd6/0x1a0
[Tue Jul  9 05:54:51 2024]  kthread+0xe0/0x100
[Tue Jul  9 05:54:51 2024]  ? __pfx_kthread+0x10/0x10
[Tue Jul  9 05:54:51 2024]  ret_from_fork+0x2c/0x50
[Tue Jul  9 05:54:51 2024]  </TASK>

I attempted to remove all 16x Dell Part AA783423 I installed and go back to the old 16 x 32 GB sticks. When doing so all issues would go away. I'm thinking these cpu locking / freezing issues is because of that one bad ram stick that I still can't find!

Could you give me any other advice on how to best solve this issue other than testing by removing 1-2 sticks at a time until issue is gone because I live far away from the DC and this would be really really hard on me.

Moderator

 • 

4K Posts

July 9th, 2024 08:29

Hello,

this is very hard to diagnose as the only way is to try to isolate the memory bank that is faulty.

Also of course I suggest you to keep the bios and idrac up to date.

Thanks

 

1 Rookie

 • 

21 Posts

July 9th, 2024 08:51

@DELL-Marco B​ Yes :(

Is there any specific way that I should do it? Would you know of any specific command to run that is very memory intensive in Linux that would easily replicate the issue? Or would I just go to the datacenter, plugin say 8x Dell Part AA783423, wait a day and see if issue comes back?

Moderator

 • 

4K Posts

July 9th, 2024 10:23

Which CPU is installed?
This memory bank are not compatible with Skylake CPU

Dell 64GB Ram Memory Upgrade - DDR4; 3200MHz (Cascade Lake, Ice Lake & AMD CPU only) | Dell USA | Dell USA

1 Rookie

 • 

21 Posts

July 9th, 2024 10:42

@DELL-Marco B

AMD EPYC 7R32

(edited)

Moderator

 • 

4K Posts

July 9th, 2024 16:57

No Events found!

Top