Unsolved
1 Message
0
511
June 17th, 2021 06:00
R740 CPU overloading with real-time scientific floating-point calculations
Greetings,
I am the simulator hardware and configuration engineer at a nuclear power plant, and am tasked with maintaining the various computer and network systems as well as the physical interfaces of a nuclear reactor training simulator. I am currently struggling with a performance issue related to ongoing software upgrades. Apologies in advance for the lengthy post, but I wanted to be as thorough as possible.
Our infrastructure begins with a primary and backup R740 server, each with two Xeon Gold 6136 12-core 3.0 GHz processors, 96 Gb DDR4-2666, 6xSSD in striped AHCI configuration, running Server 2016. At any given point, one server is connected to a multitude of TCP/IP based instrumentation interfaces as well as a GE-Fanuc VMIS system that manages physical inputs and outputs (switches, lights, meters, etc.). The other server is attached to the network but does not share any active simulation load, but rather serves as both a redundant backup and development test bed. The software being run is a simulator suite that performs real-time scientific floating point calculations to mimic various systems associated with nuclear reactors. Each system model runs in a separate thread, and is coordinated by an executive program that verifies calculation timing from each model and coordinates variable data transfer to the various IOs. A complete data calculation cycle is completed 4 times per second as a balance between calculation loading and operational realism. The executive program also includes the ability to monitor processor demand above 100% and flag/count incomplete calculation cycles. Because of regulatory requirements, the maximum satisfactory amount is zero incomplete calculations.
During previous optimization tests, we determined that max performance was found with 6 of the 12 cores per CPU enabled and logical CPUs disabled.
We are currently coordinating with our software vendor to upgrade our thermal-hydraulic model to one that more accurately encompasses accident-range scenarios, and have run into a single-thread bottleneck in the processing of the upgraded model. On initial testing the core model was demanding roughly 265% processor load on the individual core running the thermal-hydraulic model. For data collection purposes, I re-enabled logical cores, at which point demand rose to 350%. We were able to run the model successfully on a development workstation with a single i7 4-core 3.6 GHz processor, 16 GB DDR4-2400, and running Windows 10 on NVME, but based on our existing infrastructure including TCP/IP connections, a Win10 workstation is not an option.
Based on the successful workstation run, I updated BIOS to 2.11.2 then upgraded the CPUs to Xeon Gold 6244 8-core at 3.6 GHz, and I have been massaging performance-oriented BIOS settings. I have gotten CPU demand down to 105%, at 6 cores per CPU with logical processors disabled, RAM optimized, MWaitDisabled, all prefetches enabled, turbo enabled, C states disabled, and power management set to performance, but it is still missing calculations and I have not been able to find any ways to further reduce demand.
Based on concurrent information from performance monitor, it does not appear as though RAM, pagefile, power, or cooling are the issue, and based on the successful workstation runs I believe reaching 90-95% core demand with no data slips is still possible on the server. While I understand my particular use-case is fairly unique, I imagine there have been other users needing high-performance scientific applications that have experienced and overcome bottleneck issues on this or a similar platform that might be able to shed some light on something that I am likely overlooking. Any help or suggestions would be greatly appreciated.
Thanks in advance,
Buck


DELL-Charles R
Moderator
•
4.7K Posts
•
25.5K Points
0
June 17th, 2021 11:00
Hello Buck
My first recommendation would be make sure the system board management firmware are current : BIOS, CPLD, iDRAC
The BIOS you have 2.11.2 is current.
CPLD Version 1.1.4
https://dell.to/3wvJV9n
Note: should be installed alone. See Important Information at bottom of the page.
iDRAC 4.40.40.00
https://dell.to/3gvPtv0
Are you seeing any errors or faults on the System Event Log or LifeCycle Controller Log in the DRAC?
Try these System Setup selections.
Processor settings screen : https://dell.to/3gzn2wb
System Profile Settings screen : https://dell.to/3q2zgRf
In the Processor Settings, set the Number of Cores per Processor to the desired value.
In the Processor Settings, set Dell Controlled Turbo to Enabled.
Set the System Profile option in BIOS to Custom mode
Set CPU Power Management to Maximum Performance mode
Set the Turbo Boost mode to Enabled
Please let me know how it goes.