Start a Conversation

Unsolved

This post is more than 5 years old

6753

May 21st, 2016 09:00

T7500 Riser Card broken ?

Hi,

I have a T7500 server (or is it considered still a workstation?) which stopped one day with  "uncorrectable memory error in Riser DIMM 3". I tried reseating the memory without success (same error) and after a few changes to the memory ( it has 12x 4GB modules from Hynix 2Rx8 PC3L-10600R) it turns out it runs just fine when I remove all Riser Memory and only keep 6 Motherboard modules. My understanding is that this is not such a great memory configuration and so this morning after an overnight job which ran just fine, I removed DIMM4, 5 and 6 from the MB and moved it to Riser DIMM 1 2 and 3 which I believe is an acceptable configuration. A crash happened after about 10 min.with "Uncorrectable memory error has been previously detected in RISER DIMM 3" - This means that since the memory chip came from the MB that previously worked well overnight, it is not the memory that is broken but the RISER Card perhaps ? Is this right ? If so, can I plug a new one right in after I transfer the memory and the CPU from the old one ? Can this work ?

BTW, when this error occurs, the system offers me to hit F5 for tests: the memory tests run for a while until they freeze without any additional error message. At this point I can do nothing but reboot the hard way.

Thanks,

Lothar

10 Elder

 • 

43.5K Posts

May 21st, 2016 17:00

has been previously detected in RISER DIMM 3

I'd interpret that to mean it's referring to the error before you swapped the RAM around, and the system may be "afraid" to use any module in riser DIMM slot 3.

You might want to clear the error logs. Not exactly sure how that's done on this model, but maybe by clearing BIOS:

  1. Reboot and press F2 to open BIOS setup
  2. Copy down all current settings
  3. Power off and unplug
  4. Press/hold power button for ~15 sec
  5. Open case and remove motherboard battery
  6. Press/hold power button for ~30 sec
  7. Install RAM with 12 GB on motherboard and 12 GB on riser, in slots 1, 2, 3 on each
  8. Reinstall battery
  9. Close case and reboot
  10. If you get any error messages at boot, open BIOS setup and make sure all settings match what you copied

I'd try that before replacing the riser card. And if that solves the problem, you can order replacement RAM, if you need it. Hope you know which module was in riser 3!  And you may have to buy more than one module so things match...

25 Posts

May 22nd, 2016 10:00

Hi,

given the short space that I gave me to pose my question combined with a certain awkwardness in the way I express problems there are a few things I need to correct.

This workstation came from our organization's surplus (so I did not put in the original order) but it worked brilliantly running CentOS 6.7 for well over a year. One day, running a hefty parallel job, it crashed (stopped dead) When I rebooted it it stopped before booting with the message "An uncorrectable memory error had occurred in Riser DIMM 3" - I write this from memory. At that time it had 12 4GB Hynx Memory modules installed 6 on the MB and obviously 6 on the Riser. Every time I tried to replace this chip and run the very same job that crashed the machine before, I ended up getting the same message about faulty memory in Riser DIMM3. Curiously,  I can still run it in the original configuration with a bunch of single threaded jobs without any problems. Also, one of the first things I did was to run Memtest86 and it ran and ran without a problem whatsoever - again with the 'bad' memory installed. So I was puzzled as to what is going on. After a while I tried different recommened and non-standard memory configuration with mixed success. What worked with the program that crashed the machine every time was to remove *all* Riser Memory and keep 24 GB on MB. This is *not* a recommended configuration as it is proably very slow compared to what it could do with 12GB + 12 GB distributed over MB and Riser. Maybe so. Slow but works. When I setup the 12+12 configuration it crashes with the above message about Riser DIMM 3. This time I know the memory chip came from a configuration that definitely worked. So I have no reason to believe it is broken.

So one question is, why did Memtest86 fail to detect this apparently bad memory chip or bad memory access?

Before I thought of the Riser board as the culprit, I strongly suspected the memory modules. I had decided to buy new 8GB modules (6) and put them into DIMM 1 2 3 of both MB and Riser as this is a recommended memory configuration. Now with he observation that 6 x 4GB modules in this configuration still don't seem to work, I am in doubt.

BTW the BIOS seem to be very good at detecting changes and does not dwell on old error states. When I change the system memory configuration it tells me that if found changes but this is okay - it normally boots and runs. The job that crashes the machine typically runs for a few minutes presumably until it uses all 16 threads and then crashes.

Is there anything else I can try? One more thing, it seems that refurbished RISER boards H236F are not that expensive (anymore). Is getting a refubished one a bad idea ?


Thanks.

Lothar

10 Elder

 • 

43.5K Posts

May 22nd, 2016 11:00

I'd still try clearing BIOS before buying another riser. Clearing BIOS is  easy and it's free.

I guess it could be a slot failure on the riser. Have you tried using canned air to blow out any dirt that might have accumulated in that slot?

Only you can decide how next to proceed...

25 Posts

May 22nd, 2016 13:00

Hi,

can you give me a hint as to which settings I have to write down? The only thing that I remember I did was the boot device order the rest I did not touch. 

Yes I did use compressed air to clean the sockets and the rest of the fans / heat sinks. No change in the end.

I do appreciate your time and advice!


Lothar

10 Elder

 • 

43.5K Posts

May 23rd, 2016 10:00

Copy all BIOS settings. I have no way to know which ones have been changed to something other than the defaults. And when you clear BIOS all of them will be reset to the defaults.

You might be able to take digital photos of the BIOS screens instead of writing the settings down, but make sure the flash is off, and the photos are readable before you clear BIOS.

8 Wizard

 • 

47K Posts

May 23rd, 2016 11:00

Memory that isnt all from the exact same vendor and speed will have issues.

Very unlikely the RISER is bad and not the RAM.

If you have RISERS you CANNOT have ANY ram on the motherboard.

This is not a valid configuration.

10 Elder

 • 

43.5K Posts

May 23rd, 2016 14:00

If you have RISERS you CANNOT have ANY ram on the motherboard.

This is not a valid configuration.

Not correct....

Here's the table from the T7500 manual showing RAM configs on motherboard and riser for dual CPU systems:

You originally had 48 GB (12x4 GB) with all slots filled on both motherboard and riser. After removing RAM from the riser and redistributing the remaining 6x4 GB, you have 3 slots filed on each. Both are acceptable configs.

When you bought replacement RAM, was it Dual Rank (DR) ? Single Rank (SR) RAM won't work.  

25 Posts

May 24th, 2016 07:00

Hi,

thanks to all of the rockstar experts who help to debug this system. It seems that we ran into questions about the proper configuration of the system. It is a T7500 and it has a RISER I believe H236F. It does not interfere with the mainboard memory as you suggested. Maybe this is the case in one of the other systems ?

I had initially 8 Sockets (white and black tab) on the Mother board filled with 4GB modules and the Riser CPU/Memory board (your 2nd picture is correct) filled with 6x 4GB modules (same type and specs). The last 3 pictures are unfamiliar to me and depict hardware that is not part of this configuration.

Currently I did some test: I flushed the BIOS as suggested by the first reply I got, reset all the necessary paramters in the BIOS (time and boot order, asset tag) and had 6 modules (4GB) in Memory: 3 on the MB and 3 in the Riser in each of the DIMM 1 2 and 3 position. I believe this is an acceptable mem. configuration.

All jobs I can throw at it run well, except the one that crashed it the first time. This is surprising that it is just this one that creates the problem. Odd.

As the last test I ran the said job and it froze. After restart, it stops saying " Alert! Uncorrectable Memory Error has been previously detected in RISER DIMM 3" It offers F1 F2 and F5. F5 runs diagnostics and when I do this it freezes at some point during the memory checks. No error message nothing.

To me all this points more to a RISER DIMM3 internal hardware error. Since refurbished Riser boards are on the order of  $ 120 it seems to me that I could replace the board without breaking the bank. Are there any concerns ? Is it difficult to move the CPU ?

Thanks, Lothar

8 Wizard

 • 

47K Posts

May 24th, 2016 07:00

The chart Is not allowing for risers and ram on the motherboard.

I am correct. 

We do need to differentiate T7400 from T7500.

There is no RISER #3 on a T7500.  The T7500 max ram is 192 Gig where the additional 6 slots are on the 2nd CPU Riser Card. Ram is not Dual Channel its Tri channel on the T7500 aka the white ears get ram first.  1 2 and 3 are not physically in order.  The banks are codified by the White EARS being 1st.



16 Sockets (4 Sockets (2 banks of 2) per Riser)  16 X 8 =128 MAX

8 Sockets (4 banks of 2) (standard) YOU CANNOT MIX THESE.

There are 16 slots with the risers installed which eliminates the 5 6 7 8 SLOTS on the motherboard.

The cooling fan for the ram is also different which would cause issues.

The risers are not ronco set and forget and have power cables as well as daughterboards.




 


 

8 Wizard

 • 

47K Posts

May 24th, 2016 08:00

Test the memory by marking it and moving 3 pieces at a time.  To motherboard with riser installed.

Use White Ears first.  So its not 1 2 3  is 1 3 5  2 4 6.

If the ram doesnt fail each SET of 3 then the riser could be a problem but it could also be MISMATCHED CPU steppings.

25 Posts

May 24th, 2016 09:00

Thanks SpeedStep fo the clarification. I did read the manual and positioned the Memory (I strongly believe) correctly in DIMM1 DIMM2 DIMM3 using the slots with white tabs only. I once had an incorrect setting and the BIOS complained about. It seems that the bios knows those things. I had the correct configuration and it failed. No matter which chip I put in RISER slot 3 it eventually crashes. This suggests that slot 3 has some problem and not the memory chip unless all of them are broken or at least the 3-4 that I tried. 

25 Posts

June 13th, 2016 17:00

Hi all,

it been a while since it took me time to order the Riser board. Recap: Jobs crashed and after restart, before boot, it says " Alert ! Uncorrectable memory error has been detected in Riser DIMM 3 Strike F1 ...."

Well, I installed a new Riser board bought directly from Dell and it happened again. Same error. Riser DIMM 3 seems at fault. Now I can do some more serious swapping of memory chips but I kind of suspect it is not the module because it had been swapped a few times. I am a bit stumped but then again I would not be writing this if I wasn't. 

Other tests: I ran memtest86+ as boot option and never encountered any problems.

When i get the "Alert !" message, it offers to hit F5 to run diagnostics. When I do that, it crashes (freezes, the cpu bars are no longer moving) after a short while but I get no error message (or I don't know where to look for them). 

When I run under unix (Centos 6.8) $ memtester 40G 5 it crashes completely after a while (dark screen, not rebootable with cntrl shift del or so. But again I don't seem to get any additional information as to what is wrong. After the crash that memtester prodcued a restart gives me  the "Alert ! Uncorrectable memory error...Riser DIMM 3"

For lack of imagination, it is possible that the motherboard is broken ?

Any ideas that could help diagnose the problem ?

Thanks.

8 Wizard

 • 

47K Posts

June 14th, 2016 06:00

MISMATCHED CPU steppings.

What is the part number and "S" spec of each CPU?

25 Posts

June 14th, 2016 09:00

Hi,

  this is what I found:

part:

J131J 2 PROCESSOR, W5580, 3.2/6.4, 8MB, XDN, D0
from /proc/cpuinfo
processor    : 15
vendor_id    : GenuineIntel
cpu family    : 6
model        : 26
model name    : Intel(R) Xeon(R) CPU           W5580  @ 3.20GHz
stepping    : 5
microcode    : 25
cpu MHz        : 1596.000
cache size    : 8192 KB
physical id    : 0
siblings    : 8
core id        : 3
cpu cores    : 4
apicid        : 7
initial apicid    : 7
fpu        : yes
fpu_exception    : yes
cpuid level    : 11
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt lahf_lm dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips    : 6382.93
clflush size    : 64
cache_alignment    : 64
address sizes    : 40 bits physical, 48 bits virtual
power management:

Actually I don't know what "S" spec of each CPU means. Where can I find this information?
Thanks.

25 Posts

June 16th, 2016 15:00

Another thought, since it repeatedly singled out Riser DIMM 3 no matter what chip no matter what Riser card, is it possible that it overheats ? How would I check that and is there a way to read the fan speed of the fan that cools the memory ?

Lothar

No Events found!

Top