BIOS characterization for HPC with Intel Cascade Lake processors

BIOS characterization for HPC with Intel Cascade Lake processors


Article written by Varun Bawa, Savitha Pareek, & Ashish K Singh of HPC and AI Innovation Lab in April 2019



With the release of the 2nd generation Intel Xeon® Processor Scalable Family processors (architecture codenamed "Cascade Lake"), Dell EMC has updated the PowerEdge 14th generation servers to benefit from the increased number of cores and higher memory speeds thus benefiting HPC applications.

This blog presents the first set of results and discusses the impact of the different BIOS tuning options available on Dell EMC PowerEdge C6420 with the latest Intel Xeon® Cascade Lake processors for some HPC benchmarks and applications. A brief description of the Cascade Lake processor, BIOS options and HPC applications used in this study is provided below.

Cascade Lake is Intel’s successor for Skylake. The Cascade Lake processor supports up to 28 cores, six DDR4 memory channels with speed up to 2933 MT/s. Similar to Skylake, Cascade Lake supports additional vectorization power with the AVX512 instruction set allowing 32 DP FLOP/cycle. Cascade Lake introduces the Vector Neural Network Instructions (VNNI), which accelerates performance of AI and DL workloads like Image classification, speech recognition, language translation, object detection and more. VNNI also supports 8-bit instruction to accelerate inference performance.

Cascade Lake includes hardware mitigations for some side-channel vulnerabilities. It is expected that this could improve performance on storage workloads, look for future studies from the Innovation Lab.

Since Skylake and Cascade Lake are socket compatible, the processor tuning knobs exposed in the system BIOS are similar across these processor generations. The following BIOS tuning options were explored in this study, similar to work published in the past on Skylake.

Processor settings:

  • Adjacent Cache Line Prefetch: The Adjacent Cache-Line Prefetch mechanism allows automatic hardware prefetch, it operates without programmer intervention. When enabled, it accommodates two 64-byte cache lines into a 128-byte sector, regardless of whether the additional cache line has been requested or not.
  • Software Prefetcher: It avoids the stall by loading the data into cache before it is needed. Example: To prefetch data from main memory to the L2 cache far ahead of the use with an L2 prefetch instruction, and then prefetch data from the L2 cache to the L1 cache just before the use with a L1 prefetch instruction. In here, when set enabled, the processor will prefetch extra cache line for every memory request.
  • SNC (Sub-Numa Cluster): Enabling SNC is akin to splitting the single socket into two NUMA domains, each with half the physical cores and half the memory of the socket. If this sounds familiar, it is similar in utility to the Cluster-on-Die option that was available in Intel Xeon E5-2600 v3 and v4 processors. SNC is implemented differently from COD, and these changes improve remote socket access in Cascade Lake when compared to the previous generations, which were using Cluster-on-Die option. At the Operating System level, a dual socket server with SNC enabled will display four NUMA domains. Two of the domains will be closer to each other (on the same socket), and the other two will be a larger distance away, across the UPI to the remote socket. This can be seen using OS tools such as: numactl –H and is illustrated in Figure 1.
NUMA nodes layout
Figure 1: NUMA nodes Layout

System profiles:

The system profiles are a meta options that, in turn, set multiple performance and power management focused BIOS options like Turbo mode, Cstate, C1E, Pstate management, Uncore frequency, etc. The different system profiles compared in this study include:
  • Performance
  • PerformancePerWattDAPC
  • PerformancePerWattOS
We used two HPC benchmarks and two HPC applications to understand the impact of these BIOS options on Cascade Lake performance. The configurations of server and HPC applications used for this study is described in Table 1 and Table 2.
Applications Domain Version Benchmark
High Performance Linpack(HPL) Computation-Solve a dense system of linear equations From Intel MKL - 2019 Update 1 Problem size 90%, 92% and 94% of total memory
Stream Memory Bandwidth 5.4 Triad
WRF Weather Research and Forecasting 3.9.1 Conus 2.5km
ANSYS® Fluent® Fluid Dynamics 19.2 Ice_2m,
Combustor_12m,
Aircraft_wing_14m,
Exhaust_System_33m

Table 1: Applications and Benchmarks

Components Details
Server PowerEdge server C6420
Processor Intel® Xeon® Gold 6230 CPU @ 2.1GHz, 20 cores
Memory 192GB – 12 x 16GB 2933 MT/s DDR4
Operating System Red Hat Enterprise Linux 7.6Red Hat Enterprise Linux 7.6
Kernel 3.10.0-957.el7.x86_64
Compiler Intel Parallel Studio Cluster Edition_2019_Update_1

Table 2 Server Configuration

All the results shown here are based on single-server tests; cluster level performance will be bound by the single server performance. The following metrics were used to compare performance:
  • Stream – Triad score as reported by the stream benchmark.
  • HPL – GFLOP/second.
  • Fluent - Solver rating as reported by Fluent.
  • WRF – Average time step computed over the last 719 intervals for Conus 2.5km

Benchmarks and Application Results

Graph notation abbreviations:

System Profiles:

Perf – PerformancePerf-Performance OS – PerformancePerWattOS PerformancePerWattOS DAPC – PerformancePerWattDAPCPerformancePerWattDAPC
Sub-NUMA Clustering: SNC = 0(SNC = Disabled): SNC = 1(SNC = Enabled: Formatted as Striped in graphs)
SW – Software Prefetcher: SW = 0 (SW = Disabled): SW = 1 (SW = Enabled)

High Performance Linpack
Figure 2: High Performance Linpack

Figure 2 compares the result of HPL with Problem size = 90% i.e. N=144476 across different BIOS options. The graph plots absolute Gigaflops obtained while running HPL across different BIOS configurations. These Gigaflops obtained are plotted on y-axis, higher is better.
Below are the observations from the graph:
  • Less than 1% difference in HPL performance due to software prefetch.
  • No major effect of SNC on HPL Performance (0.5% better with SNC=Disabled).
  • The Performance System Profile is up to 6% better in comparison to OS and DAPC.
Stream
Figure 3: Stream

Figure 3 compares the result of STREAM across the different BIOS configurations.
The graph plots the Memory bandwidth in Gigabytes per second obtained while running STREAM Triad. The Memory bandwidth (GB/Sec) obtained is plotted on y-axis, higher is better. The BIOS configuration associated to specific values of Gigabytes per second are plotted on the x-axis.
Below are the observations from the graph:
  • Up to 3% better memory bandwidth with SNC=enabled.
  • Not much deviation in performance due to Software prefetch on STREAM memory bandwidth.
  • No deviation across System Profiles.
Memory Bandwidth - SNC
Figure 4: Memory Bandwidth – SNC

Figure 4 plots the Stream Triad memory bandwidth score in such a configuration. The full system memory bandwidth is ~220 GB/s. When 20 cores on a local socket access local memory, the memory bandwidth is ~ 109GB/s - half of the full system bandwidth. Half of this, ~56 GB/s, is the memory bandwidth of 10 threads on the same NUMA node accessing their local memory and on one NUMA node access memory belonging to the other NUMA node on the same socket. There is a 42% drop in memory bandwidth to ~33GB/s when the threads access remote memory across the QPI link on the remote socket. This tells us there is significant bandwidth penalty in SNC mode when data is not local.

WRF
Figure 5: WRF

Figure 5 compares the result of WRF across different BIOS options, the dataset used is conus2.5km with default "namelist.input" file.
The graph plots absolute average timestep in seconds obtained while running WRF-conus2.5km dataset on different BIOS configurations. The average timestep obtained is plotted on y-axis, lower is better. The relative profiles associated to specific values of average timestep are plotted on the x-axis.
Below are the observations from the graph:
  • 2% better performance with SNC=Enabled.
  • No performance difference for Software Prefetch Enabled vs Disabled.
  • Performance profile is 1% better than PerformancePerWattDAPC profiles
Ansys Fluent (ice_2m) and (Combustor_12m)
Ansys Fluent (Aircraft_wing_14m) and (Exhaust_System_33m)
Figure 6 through Figure 9 plots Solver Rating obtained while running Fluent- with Ice_2m, Combustor_12m, Aircraft_Wing_14m, and Exhaust_System_33m dataset respectively. The Solver Rating obtained is plotted on y-axis, Higher is better. The relative profiles associated to specific values of Average time are plotted on the x-axis.
Below are the overall observations from the above graphs:
  • Up to 4% better performance with SNC=Enabled.
  • No effect of Software Prefetch on performance.
  • Up to 2% better performance with Performance profile compared with DAPC and OS profiles.

Conclusion

In this study, we evaluated the impact of different BIOS tuning options on performance when using the Intel Xeon Gold 6230 processor. Observing the performance of different BIOS options across different benchmarks and applications, the following is concluded:
  • Software Prefetch has no significant performance impact on the datasets that were tested. Thus, we recommend Software Prefetcher to stay as default i.e. Enabled
  • With SNC=Enabled 2-4% performance increase in Fluent and Stream, approx. 1% in WRF compared with SNC = Disabled. Thus, we recommend that SNC should be enabled to attain better performance.
  • Performance profile is 2-4% better than PerformancePerWattDAPC and PerformancePerWattOS. Thus, we recommend the Performance Profile for HPC.
It is recommended that Hyper-Threading be turned off for general-purpose HPC clusters. Depending on the applications used, the benefit of this feature should be tested and enabled as appropriate.

Not discussed in this study is a memory RAS featured called Adaptive Double DRAM Device Correction (ADDDC) that is available when a system is configured with memory that has x4 DRAM organization (32GB, 64GB DIMMs). ADDDC is not available when a system has x8 based DIMMs (8GB, 16GB) and is immaterial in those configurations. For HPC workloads, it is recommended that ADDDC be set to disabled when available as a tunable option.


Quick Tips content is self-published by the Dell Support Professionals who resolve issues daily. In order to achieve a speedy publication, Quick Tips may represent only partial solutions or work-arounds that are still in development or pending further proof of successfully resolving an issue. As such Quick Tips have not been reviewed, validated or approved by Dell and should be used with appropriate caution. Dell shall not be liable for any loss, including but not limited to loss of data, loss of profit or loss of revenue, which customers may incur by following any procedure or advice set out in the Quick Tips.

Article ID: SLN316864

Last Date Modified: 04/23/2019 12:16 PM


Rate this article

Accurate
Useful
Easy to understand
Was this article helpful?
Yes No
Send us feedback
Comments cannot contain these special characters: <>()\
Sorry, our feedback system is currently down. Please try again later.

Thank you for your feedback.