Benchmarking Intel Systems and Understanding the Results
By Cornell Theory Center (Issue 4 2001)
The Cornell Theory Center recently ran a set of popular high-performance computing benchmarks for the latest Xeon and Itanium processor-based systems. This article presents the results, which can provide relevant insight for determining which systems will best suit the needs of high-performance applications.
Over the past several years, Cornell Theory Center (CTC) has implemented a desktop to high-performance computing (HPC) environment based on Dell® systems and Microsoft® Windows® 2000. CTC has gained experience with a variety of Intel-based architectures and recently ran benchmarks for the latest Intel® Pentium® III, XeonTM and ItaniumTM processor-based systems. These popular benchmarks, adopted by the high-performance computing community, include STREAM, the NAS Parallel Benchmarks, and Linpack.
The CTC wanted to demonstrate the applicability of these systems to various high-performance applications that exhibit performance and memory requirements comparable to the benchmarks. The results can provide relevant insight for determining which systems will best suit the needs of these applications today and as the systems evolve.
Appropriately applying benchmarks
Benchmarks are an important factor when evaluating systems, and they are most effective when appropriately applied. When applied in different ways, benchmarks can support opposing viewpoints. The characteristics of the application should determine which set of benchmarks to apply to the analysis.
Three key system components are the memory interface, the processor's integer arithmetic, and floating-point units. The following sections show results for three benchmarks that highlight particular strengths and weaknesses of the test systems in reference to these components. Because a real application is an intricate interplay of these and other components, only limited conclusions can be drawn for the performance of applications. The soundness of the conclusions depends completely on the depth of understanding of both the application and the benchmarks.
For each platform, the benchmarks were compiled using the following compiler options:
- Pentium III: icl/ifl -Ox -G6 -QxK
- Xeon: icl/ifl -Ox -G7 -QxW
- Itanium: ecl/efl -O3
|
Configuring the test environment
The three Intel processor-based systems used in the tests are the basis for current and future systems at CTC. CTC has several Pentium III-based HPC clusters, and the center plans to implement both Xeon and Itanium processor-based systems for the specific applications that can leverage their relative advantages.
CTC assembled the computer systems "out-of-the-box" as provided by the vendors for testing and did not customize them for application-specific performance. The following sections outline the test configurations as well as the tools that were used to build the benchmarks.
Hardware
The names in the first column of Figure A (Pentium III, Itanium, and Xeon) will identify the configurations throughout the article.
Operating systems
- Microsoft Windows 2000 Advanced Server SP2 (Pentium III, Xeon)
- Microsoft Windows Whistler Advanced Server, 64-bit Edition, Build 2462 (Itanium)
|
Runtime environment
- MPI Software Technology MPI/Pro® 1.6.3 (Pentium III, Xeon)
- MPI Software Technology MPI/Pro 1.6.3, 64-bit Edition (Itanium)
|
Compilers
- Intel OpenMP FORTRAN Compiler for 32-bit applications. Version 5.0.1, Build 010727Z
- Intel OpenMP C++ Compiler for 32-bit applications. Version 5.0.1, Build 010727Z
- Intel OpenMP FORTRAN Compiler for Itanium-based applications. Version 5.0.1, Build 20010529
- Intel OpenMP C++ Compiler for Itanium-based applications. Version 5.0.1, Build 20010529
|
Figure A. Hardware used for benchmark tests
Testing memory interface
Intel Pentium III processor-based systems have demonstrated a memory bottleneck in symmetric multiprocessing (SMP) systems with two, four, or eight processors. A single processor can completely utilize the front-side bus of these systems. As the number of processors in a particular system increases, all processors must share access to the front-side bus. In memory-intensive applications, processors remain idle while waiting for their memory requests to be satisfied. For this reason, Pentium III-based HPC clusters often use single- or dual-processor systems. Xeon and Itanium processor-based systems have been designed to address this limitation.
STREAM is a standard benchmark that measures sustainable memory bandwidth (in megabytes per second) and the corresponding computation rate for simple vector kernels. The benchmark comes in two versions: serial single-threaded and parallel multithreaded (OpenMP). Process threads are lightweight mechanisms that make a process on an SMP system parallel. The operating system handles context switching between threads. Windows 2000 has exceptionally good thread support.
Figures 1 and 2 show the results for the serial and parallel STREAM benchmarks. For all three systems, the sustained memory bandwidth reported by STREAM is roughly half of the advertised peak (read-only) memory bandwidth. As expected, the Xeon shows the highest bandwidth. The number of threads does not seem to affect sustained memory bandwidth. In other words, a single CPU alone already saturates the memory bus for all three systems (with a slight decrease for the Xeon).
Figure 1. STREAM serial
Figure 2. STREAM parallel OpenMP
Testing integer performance
The NAS Parallel Benchmarks (NPB) help evaluate the performance of parallel supercomputers. The benchmarks are derived from computational fluid dynamics (CFD) applications widely used in automotive, aircraft, and spacecraft design, as well as weather prediction and combustion modeling.
The benchmarks consist of eight programs: five kernels and three pseudo-applications. The NAS integer sort (IS) kernel ranks a large array of small integers as fast as possible. For each kernel, the NAS benchmarks specify five classes (S, W, A, B, and C) of increasing workloads. Class S is intended for testing, classes W and A typically run on desktop workstations, and classes B and C are intended for multiprocessor systems.
The array to be sorted has 220 and 223 elements for classes W and A, respectively. Figures 3 and 4 show the serial and parallel results for the test configurations. The time (y) axis is expressed in a logarithmic scale, which shows that Xeon is significantly faster. Both Xeon and Pentium III are faster than Itanium for NAS IS. Itanium processors are optimized for floating-point calculations.
Figure 3. NAS IS serial
Figure 4. NAS IS parallel message-passing interface (MPI)
Testing floating-point performance
Linpack is a class of standard floating-point benchmarks that measure a computer's floating-point rate of execution by running a computer program that solves a dense system of linear equations. Many high-performance computing applications, such as dense linear equation solving commonly used in computational fracture mechanics, are floating-point intensive.
Figure 5 shows results for the High-Performance Linpack (HPL) multiprocessor implementation of Linpack. The Itanium system far outperforms its competitors and scales well. The Itanium's large Level 3 caches and issue rate of four floating- point operations (two fma instructions) per cycle differentiate it from the other processors.
Figure 5. HPL 10,000
The results for the classic Linpack1000 demonstrate a benchmark's potential to be misleading. The results in Figure 6 show no overwhelming Linpack performance advantage of the Itanium over Xeon. It is not the different problem sizes (10,000 in Figure 5 and 1,000 in Figure 6 ), but rather the different underlying implementations of the matrix factorization in HPL and Linpack1000 that cause such a dramatic shift in performance.
Figure 6. Linpack1000
Figure 5. HPL 10,000
Linpack1000 is a vectorized version of Linpack, and its performance is completely determined by DAXPY operations. Linpack1000 resembles STREAM in this respect. Cache utilization is poor for Linpack1000. HPL, on the other hand, is a blocked implementation that, by design, attempts to optimize cache utilization.
Look beyond the numbers
Benchmarks, while not real applications, do capture important measurable characteristics that are often kernels of real applications. Industry-standard HPC benchmarks can help assess performance characteristics of various systems, especially when planning deployment of a new system for a specific application. A system that supports several applications, particularly ones that exhibit different requirements, will involve more detailed analysis.
It is important to understand how benchmarks are implemented with respect to the particular targeted application. Otherwise, benchmarks can be very misleading.
Supporting tools, such as compilers, also must be considered. For example, the benchmarks for the Itanium use recently developed compilers subject to ongoing improvement. In the case of an Itanium processor-based system, the compiler exercises full control over the processor's resources. This architecture has no runtime instruction-level parallelism (ILP) detection. Therefore, advances in compiler technology may yield significant gains in processor performance.
Cornell Theory Center (www.tc.cornell.edu) is a high-performance computing and interdisciplinary computational research center located at Cornell University. Researchers associated with the center work in fields such as genomics, digital materials, drug design, and financial risk analysis. CTC supports faculty and staff from more than 100 different research areas as well as corporate clients that require leading-edge computational resources.
For more information
High Performance Linpack (HPL):
http://www.netlib.org/benchmark/hpl/
Figure 5. HPL 10,000
Linpack:
http://www.netlib.org/utk/people/JackDongarra/faq-linpack.html
Figure 5. HPL 10,000
Linpack1000:
http://www.netlib.org/benchmark/1000d
Figure 5. HPL 10,000
STREAM:
http://www.cs.virginia.edu/stream/
Figure 5. HPL 10,000