Multiple K20 Kepler GPU cards incompatible with C6100 host server?

Question

We have a Dell C410x PCI-e expansion chassis fitted with 3 x M2090 GPU cards and 2 x K20 cards. The host server is a C6100 fitted with two blades with iPASS connects to the C410x so that the M2090 cards connect to one blade and the K20 cards connect to the other blade. All of this is working fine with Ubuntu 14.04 installed on the host servers.

Recently we purchased three more K20 GPU cards from Dell and fiited them alongside the existing two K20 cards. Although all 5 cards are listed when running the lspci utility, the nVidia driver does not load. After some experimentation I found that the kernel cannot allocate memory for the PCI-e BAR (base address registers) if more than 2 GPUs are installed - for the nVidia K20 cards lspci -vv reports things like:

Region 0: Memory at c1000000 (32-bit, non-prefetchable)
Region 1: Memory at (64-bit, prefetchable)
Region 3: Memory at (64-bit, prefetchable)

and there are lots of errors in dmesg like:

[    0.573153] pci 0000:04:00.0: BAR 13: can't assign io (size 0xe000)
[    0.573236] pci 0000:05:08.0: BAR 14: can't assign mem (size 0x200000)
[    0.573320] pci 0000:05:08.0: BAR 15: can't assign mem pref (size 0x200000)

The nVidia driver reports

[   17.028050] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
[   17.028050] NVRM: BAR1 is 0M @ 0x0 (PCI:0000:14:00.0)
[   17.028052] NVRM: The system BIOS may have misconfigured your GPU.
[   17.028056] nvidia: probe of 0000:14:00.0 failed with error -1
[   17.028241] NVRM: The NVIDIA probe routine failed for 1 device(s).
[   17.028243] NVRM: None of the NVIDIA graphics adapters were initialized!
[   17.028244] [drm] Module unloaded
[   17.028314] NVRM: NVIDIA init module failed!

The BIOS on the C6100 blade was version 1.66 and I have updated this to the latest 1.71 version but the problem remains.

There is another interesting thing: the original K20 GPU cards we bought last year have serial nos ending in 0009 and 0010 - the new ones we have just received have serial nos 0012, 0058 and 0064.
Now, the card with serial # 0012 works on it's own or with either of the existing cards fitted but cards 0058 and 0064 do not work even it fitted on their own - we are seeing the same error messages reported by dmesg and lspci as we get when fitting more than 2 GPU cards. I'm guessing these are later K20 cards that are different from the earlier ones?

It seems there is a limit to the number of K20s that can be used with the C6100 and/or compatibility issues with later cards. Before I return these GPUs to Dell, does anyone know of a solution?

Thanks,

Andy

Daniel My · Accepted Answer

Hello Andy

Before I return these GPUs to Dell, does anyone know of a solution?

No, I don't have a solution. I did want to make sure that you are aware that this is not a validated configuration. It is very likely that the firmware on the new K20s are not the same as the old K20s. That is likely why it is not working.

The K20s are not a validated GPGPU for the C6100. Only M series are validated for the C6100. Also, Ubuntu is not a validated OS for the C410x. I'm glad you were able to get everything working with the original setup. I just want you to be aware that this is not a validated configuration so that you understand getting these GPGPUs to function in this configuration will be trial and error.

Returning the K20s and ordering the same cards again will likely produce the same results. You may be able to flash the firmware on your new K20s to what the older K20s are running. I'm not sure if that is possible or how to do it though.

Thanks

andyt22imp · Answer

Thanks for your very quick reply - I had a feeling the C6100 BIOS might be an issue here as other servers often have options in BIOS to configure pre-boot PCI resource allocation but the C6100 BIOS doesn't. The fact our first two K20's worked 'out of the box' for us sheer luck, I suppose.

I'll see on Monday if we can exchange these K20 GPGPUs for M2090 GPGPUs.

Thanks for your help,

Andy

Daniel My · Answer

I'll see on Monday if we can exchange these K20 GPGPUs for M2090 GPGPUs.

You should also consider one other issue that you may encounter. My documentation states that mixing GPGPUs from a C410x to a host is not validated. You may encounter issues mixing your current K20s with M2090s on the same module. If it is possible I would plan on putting the new M2090s on a different module if one is available.

Thanks

andyt22imp · Answer

Yes, I knew it was a bad idea to mix M-series and K-series GPGPUs on the same - our existing two K20s are on one C410x bus attached to one C6100 host and we have three M2090's on another C410x bus attached to a second C6100 host.. Hopefully I can return the K20s and fit 3 replacement M2090s alongside the three we already have.

Longer term, I will be looking into which servers have been validated as being compatible with K20 GPGPUs in the C410x as this would allow us to add more K20s in the future.

Thanks for your help & advice,

Andy

PowerEdge Hardware General

Multiple K20 Kepler GPU cards incompatible with C6100 host server?

Was this post helpful?