System Performance Analysis: Tools, Techniques, and Methodology
By Ramesh Radhakrishnan, Ph.D.; Humayun Khalid, Ph.D.; and David J. Morse (Issue 3 2001)
Current computer designs are extremely complex. System performance analysis is a requirement for understanding and tuning computer systems to perform efficiently. The fact that application performance does not scale proportionately with CPU speed is leading to an increased awareness of the total system behavior. This article introduces various approaches to experimental, simulation, and modeling techniques used in the Dell System Performance and Analysis (SPA) Laboratory.
To design high-performance and cost-effective systems, we must first understand and then remove the system bottlenecks. Many factors, such as software inefficiency or inadequate bus bandwidth, can cause these bottlenecks. A modern computer system consists of application software running on an operating system (OS) that executes on multiple processors. The system bottlenecks could appear at various levels, or they could be a combination of events occurring at different levels. The CPU, Northbridge, Southbridge, memory system, I/O system, BIOS programming of the chipsets, the OS components, and the application software stack all influence the overall system performance. An optimal system will have the software architecture mapped correctly to the hardware.
Techniques for studying system performance
Studies related to computer system performance involve modeling, simulation, and measuring real runtime system parameters during the execution of real-world applications.
Modeling. An analytical performance evaluation approach is better than the simulation and measurement methodologies in terms of the time required to build a model, collect data, and get results. However, one problem with this approach is that it is sometimes relatively less accurate than its counterpart.
Simulation. Simulation techniques require detailed system knowledge. Since these techniques are also generally trace- or execution-based, they are often more accurate than the analytical models. The time to develop a simulation engine can sometimes be prohibitive. It can take months to create a model and validate simple devices and systems. Furthermore, trace-based simulation models need traces to be gathered and validated for real programs executed on real platforms, which is time consuming and not always possible.
Measurement. The measurement approach is highly accurate, but this technique is only available when a system has been developed. It is sometimes difficult to perform trade-off studies once a system has been built because it is not easy or feasible to change the hardware. On the other hand, many problems that relate to a new architecture or product do not surface until the systems are built and experiments are conducted on real hardware. Performance bottlenecks can be easily studied by performing careful measurements on real systems running real-world applications.
This article describes some commonly used experimental and analytical techniques for studying system performance and examples of these techniques. The Dell® System Performance and Analysis Laboratory (SPA) employs these techniques to tune and optimize performance of PowerEdge® systems.
Experimental methodologies for system performance analysis
One methodology for analyzing system performance is the use of performance counters.
On-chip performance counters
All current Intel® processors implement two internal counters that can be configured to measure a variety of events for count or duration (in CPU cycles), as well as a cycle counter that counts clock cycles. A software driver must be implemented to provide a device-style interface to these capabilities.
Hardware performance counters can profile applications and the OS to determine their performance on the processor. This information, in turn, can be used to tune the application and OS for better performance.
Various events can be monitored through the instructions used to set up and read the performance counters (see Figure 1 ). The performance counters can easily measure the cache performance, branch predictor performance, bus protocol efficiency, and so on. To eliminate system bottlenecks, a thorough understanding of the architecture and its limitations is necessary. This helps performance engineers to accurately interpret the results obtained using the performance counters and to use the counters effectively.
Figure 1. Several events that can be measured using the Pentium III processor counters
The Intel VTuneTM Performance Analyzer
VTune (http://developer.intel. com/software/products/vtune/) is a tool that uses the performance counters on Intel microprocessors to generate call graphs, perform hot-spot analysis, and display performance information at the application or system level. Event-based or time-based sampling can create profiles that indicate where key performance events occur within an application or an OS. This data can help locate hot spots or performance bottlenecks that can be removed by modifying the application code or by other means.
VTune provides source-level tuning advice for the latest Intel processors including the Pentium® 4 processor, and for multiple languages such as C/C++, JavaTM , FORTRAN, and Assembly. Based on the performance bottleneck, VTune can identify opportunities for algorithmic improvements, such as loop unrolling, poor cache utilization, inefficient branch statements, and redundant operations. It can also identify array or pointer usage that limits the ability of a compiler to optimize code. In addition, VTune also recognizes opportunities to improve code performance by using features such as streaming single instruction multiple data stream (SIMD) extensions or MMXTM technology. VTune provides examples of how to implement these features into code.
Performance analysis of individual subsystems
Certain scenarios require a closer look at a specific system component and the analysis of its performance. This may occur when different vendors provide the component, or if a particular subsystem is suspected of being a bottleneck in system performance. It is important to have the right tools to analyze the performance of major components of a server. Comparing benchmark scores or application performance can indicate the relative performance of different implementations of a component, but having the tools to analyze the performance will lead to efficient design decisions in the future.
The Iometer software tool from Intel can vary the workload, perform controlled tests, and evaluate the performance of both Peripheral Component Interconnect (PCI) and I/O subsystems. As an I/O workload generator and performance analysis tool for single and clustered servers, it measures the system I/O performance while stressing it with a controlled workload.
Iometer can be configured to simulate the workload of any application or benchmark, or create an artificial workload to stress the system in specific ways. While running the specified workload, Iometer collects important data such as throughput, latency, and CPU utilization that can determine the I/O system performance.
The PCI bus is the most commonly used bus technology in Intel-based servers and workstations. This bus allows devices such as network and I/O controllers to have independent access to the memory. Inefficient utilization of the PCI bus in system design can lead to system bottlenecks and poor application performance. PCI design knowledge and access to tools that help to measure and understand PCI system efficiency enable system designers to determine whether the PCI system is designed and working efficiently. Third-party PCI analyzers also provide this type of functionality to study the performance issues related to PCI devices and the PCI bus.
A PCI transaction or data transfer involves request, arbitration, grant, address, and data transfer phases. It also requires a turnaround phase because the same lines are used to transfer data and address.
Basic measurements for PCI performance include:
- Bus throughput quantifies the amount of data that is transferred. Although the theoretical throughput for a 33 MHz 32-bit bus is 132 MB/sec, the sustainable throughput rate could be about 110 MB/sec in reality.
- Utilization denotes the percentage of time that the bus is busy.
- Efficiency shows how well the bus is being used; a high overhead will indicate poor utilization.
The Dell SPA Team measures these basic aspects of PCI performance on Dell servers using the PCI analyzer, and looks at other details, such as burst sizes or first-word latency, if we find a bottleneck in the PCI performance.
Figure 2 shows a snapshot of the measurement taken on a PowerEdge server while running a Web server benchmark. It shows a high throughput of 102 MB/sec and 90 percent PCI bus utilization for the system being measured. Because of low overhead and optimized commands being used to enable longer burst transfers, the result is a high PCI efficiency of 85 percent.
Disk subsystem performance is often the major bottleneck in overall server system performance. The disk subsystem consists of multiple disks configured to provide fault tolerance and high throughput; it is accessed using a controller interface. Regardless of the speed in other parts of the computer system, delays because of disk latencies are inevitable if an application is I/O bound.
Many popular application environments like relational databases, file servers, or Web servers are I/O-bound environments. Depending on the type and intensity of the workload, the I/O subsystem can seriously downgrade the overall performance, which nullifies the positive effects of other high-performance components.
Because of the important impact on overall system performance, the Dell SPA team benchmarks disk subsystems separately and specifically. Because Dell offers such a wide variety of disk subsystem-related products—from hard drives to hard drive enclosures to IDE/SCSI controllers to RAID (Redundant Array of Inexpensive Disks) controllers—the SPA Lab performs benchmarks of the disk subsystems on Dell servers and analyzes the characteristic behavior of subsystem elements. Analysis reports based on these studies are available (see http://inside.us.dell.com/pg/spa/Servers/Index.html). Dell also cooperates with product development labs, providing necessary performance feedback on new products at design stage.
The disk I/O performance of an application environment is dependent not only on the workload generated, but also on the technology and configuration of RAID controllers, hard drives, drive enclosures, and software parameters like BIOS, and driver and OS optimization. Disk scaling and fault tolerance are also major design parameters that directly affect the disk I/O performance.
Using OS counters for tuning system performance
Windows NT® and Windows® 2000 users can use the Performance Monitor (perfmon) tool to identify system- and application-level performance issues on their systems. This tool operates in different modes: chart, log, and report modes to collect and display performance data.
Chart mode shows a view of real-time performance (Figure 3 ); however, the data cannot be stored for future analysis. Log and report modes allow users to capture data over a long period of time to perform analysis later. The SPA Lab uses perfmon extensively to ensure that the benchmark runs are stable and to detect any existing bottlenecks.
Figure 3. Viewing real-time performance using chart mode
Some benchmarks like TPC-WTM and LoadSim2000 require in their run rules that certain perfmon counters be turned on during the benchmark and that data be logged so it can be viewed later. Software products such as SQL Server and Microsoft® Exchange® provide additional monitors that provide more insight into the performance of these applications.
The performance monitoring tool
The Windows 2000 performance console measures a variety of system components to ensure that the system is tuned optimally. The performance console consists of two main parts: system monitor, and performance logs and alerts.
System Monitor shows real-time or logged data, whereas performance logs and alerts log data and create alerts. Real-time data or logged data can be viewed in System Monitor in three ways:
- Chart view: Graphically displays a performance chart that is a built-in response to real-time changes or logged data for various performance counters
- Histogram: A bar graph that displays the value of performance counters based on real-time changes or logged data; useful for displaying the peak values of the counters
- Report: The numeric value of the counters displayed; useful for displaying large number of counters or when comparing different servers
Performance logs and alerts allow performance data to be collected manually or automatically. The saved data can be displayed in System Monitor or exported to a spreadsheet for further analysis. Performance logs and alerts provide the following functions:
- Counter logs: Allows the creation of a log file with specific object and counters. The logging can be scheduled to start at a specific time, or it can be started manually. The log files can be saved in different formats (text, binary, or other) and viewed using the System Monitor or exported to other database or spreadsheet applications.
- Trace logs: Creates trace logs that contain trace data provider objects; differs from counter logs because they continuously measure data compared to measuring it at specific intervals. OS events such as page faults or disk input/output can be logged for the OS, as well as application-specific events if they provide their own objects.
- Alerts: Tracks objects and counters to ensure that they are within a specified range. An alert is generated if the value from a counter is above or below the specified value. Different actions such as logging the alert in the application event log or running a command from the command line can be programmed to occur when the alert is generated.
Using software profiling for performance tuning
Software profiling tools can provide program analysis and tuning information that helps users to write efficient programs for optimal system performance. A commonly used profiling tool is a call graph profiler, which visually shows the amount of CPU time utilized by each primary procedure and the subsequent procedures they call. These tools can easily identify the longest running portions of code that can be streamlined to make the application execute more efficiently.
Program analysis tools that profile heap-memory for programs help to identify memory bugs such as memory leaks, reading uninitialized memory, or accessing an invalid memory address. Left undetected, these bugs can significantly degrade system performance and cause the application or benchmark to return erroneous data or to terminate.
Certain profiling tools count the number of times each program instruction is executed. Other software optimization tools can use this data to restructure the program for more efficient execution. These tools reorder the program's procedures so that the most frequently executed instructions are stored in the cache rather than main memory. This minimizes instruction cache misses to greatly improve the speed of the application. The VTune performance analyzer includes some of this functionality.
Case study: Profiling the Linux kernel to detect performance bottlenecks
A profile of the Linux® kernel can show where the kernel spends most of its time. The readprofile tool can create this profile. To ensure that profiling is enabled, the line append="profile=2" must be added to /etc/lilo.conf and /sbin/lilo needs to be run for this change to take effect. This tells the kernel to store profiling information in /proc/profile. This file is stored in binary format, however, so the readprofile utility must parse it into human-readable ASCII format.
Kernel profiles are most useful when obtained during steady state; that is, when the system under test has reached peak utilization (CPU, I/O) during a benchmark. Once this occurs, the following command should be executed:
readprofile -r; sleep 60; readprofile | sort -nr | head
The command displays which 10 kernel calls are using the most clock ticks during a 60-second span. For a custom kernel, the -m parameter can specify the location of the custom System.map file; see the readprofile man page for more information about this performance analysis tool.
Figure 4 shows a machine under heavy network load. This is apparent from the high number of do_tcp_sendpages clock ticks (indicating TCP/IP activity) followed by the ace_interrupt call, indicating the Alteon® ACEnic network cards are generating a high number of interrupts. This information indicates that the TCP/IP stack should be tuned, or that the network interface card (NIC) parameters (such as transmit or receive coalescing clock ticks) might need adjustment to generate fewer interrupts.
Figure 4. Sample output of read profile taken under heavy network load
Analytical models developed at Dell allow performance engineers, architects, and designers to obtain fast turnaround time for answers to design- and trade-off-related questions. Quick response for preliminary performance projection is critical for designers and analysts to identify and rectify potential problems and bottlenecks very early during the design cycle. This, in turn, can reduce costs and help set expectations for marketing, customers, and project leaders.
Analytical models should be simple to use, but sophisticated enough to provide meaningful performance data or results for complex server systems. During our early phases of analytical performance modeling, we did not have any initial meaningful and credible performance data related to processors, internal subunits of the processor, FSB bandwidth and utilization, memory bus bandwidth and utilization, or chipset details. So we began with a full queuing model for the entire system based on several statistical processes.
We analyzed and studied Markov, Birth-Death, Poisson, Hyper-Exponential, and Logarithmic processes for the queuing model. Most models had similar parameters for the inputs and outputs: arrival process, service time distribution, number of servers, system capacity, population size, and service discipline. These parameters could be obtained by assuming various distributions. We soon realized that the analytical complexity of the model was a limiting factor for developing simple models. There were also issues about our assumptions regarding various distributions.
The problems and concerns with the full queuing model led us to look at the modeling aspect from the benchmarking and computing view rather than from the modeling theories. This means that our models are based on analytical equations for which the parameters are obtained not only from the queuing models based on certain distribution assumptions, but also from various sources such as direct measurements, system specifications, and estimations.
We chose the performance metric for the model from the relevant benchmark(s). For example, our analytical model for servers that use TPC-C® as the key benchmark produce tpmCTM as the output performance metric. Therefore, some level of benchmark characterization also becomes an implicit part of the overall equation.
Case study: TPC-C model
This section demonstrates how the TPC-C benchmark model was developed for Dell servers. For the TPC-C case study, the target was the tpmC (database transactions per minute) performance metric. We started to develop an equation that a computer program would use to compute tpmC for the TPC-C benchmark. A computer program can compute tpmC in several different ways.
One method is to compute tpmC from three basic parameters:
- Instructions per cycle (IPC)
- Cycles per minute (FREQuency in MHz x 60 seconds/minute)
- Transactions per instruction (TPI)
tpmC = IPC x (FREQ x 60) x TPI — (Equation 1)
Since reciprocal parameters are often used and reported, we can modify equation 1 to include the following reciprocal parameters:
- CPI = 1/IPC = cycles per instruction
- PL = 1/TPI = path length
Thus, our tpmC equation becomes:
tpmC = FREQ x 60 / CPI x PL — (Equation 2)
We need to isolate and break up the parameters in equation 2 with respect to their affiliation with various computer system units. The breakup helps us to understand the trade-offs. However, the parameters are so intertwined and interconnected that a clear-cut and simple breakup is not always possible. Each parameter value is dependent on several system subunits.
We attempted to split the parameters as much as possible so that it is relatively easy to obtain the numbers or values for the final parameters.
Equation 2 is further refined iteratively, which adds more levels of detail and the final equation models:
- Effects due to finite L2 cache and main memory (CPIMEMORY)
- Degradation effects because of I/O, interrupts, and snoops (CPIDEG)
- The calibration factor due to the assumptions in the entire analytical model
- Dependencies on the arrival rate of the requests, the arrival pattern, FSB and Memory Bus utilization, and the service rate of the FSB and Memory subsystem
- Response times, latencies, latency cover-up factors, queuing at various subunits such as the chipset
Because of space constraints, the derivation of the full detailed model is not included in this article.
Validating the analytical model
Derivation of the analytical model described previously is only one part of the job. Model validation and calibration is another important aspect of modeling that requires significant work.
The model validation steps include generating parametric values from several sources such as experimental data, data sheets, calculations, guesses, calibration data, projection data, and simulation data. The next step is computing the tpmC numbers for various platform configurations. Finally, we compare the result with actual measured numbers for the given configurations to determine the accuracy of the model.
The difference between the two numbers can be attributed to not only the inaccuracies in the model, but also to the inaccuracies and variations in the measured data itself. Our tolerance limit to such aggregate variations is usually about 5 percent; that is, the models are calibrated to generate reasonably accurate results.
FSB performance trade-off analysis
This section demonstrates the use of our model in evaluating two different alternatives. Consider a Dell server running TPC-C benchmark at 667 MHz. Suppose a designer/architect/customer has two choices: L2=256 KB and FSB Freq=133 MHz or L2=2 MB and FSB Freq=100 MHz. Using the analytical model described above would make the choice easy.
Suppose the parameters shown in Figure 5 are generated from the model. The choice is quite clear: the 2 MB/100 MHz configuration. A 33 percent increase in the frequency from 256 KB/100 MHz to 256 KB/133 MHz results in about 11 percent improvement in performance (see tpmC performance difference, columns 1 and 2, row 7). This is because only 37.25 percent (CPIMEMORY / CPI) of the total CPU cycles in 256 KB/100 MHz are consumed in going to the memory. A 33 percent increase in FSB speed/frequency improves only 37.25 percent of the cycles (by about 24 percent); the remaining 62.75 percent of the cycles do not see this improvement.
Figure 5. 667 MHz/256 KB/133 MHz versus 667 MHz/2 MB/100 MHz
On the other hand, the 2 MB/100 MHz scenario improves system performance by a much larger margin because the miss rates and bus utilization (vis-à-vis queuing times) values drop drastically. This results in significant overall improvement in system performance compared to the 256 KB/133 MHz case.
This simple example demonstrates that the analytical model is not only useful in evaluating design choices, but it also provides insight and explains why a particular phenomenon is happening.
Analysis techniques can help improve system performance
Performance analysis for servers means understanding server architectures thoroughly in order to tune the servers for maximum performance and to recognize, isolate, and eliminate performance bottlenecks by using detailed tools or back-of-the envelope calculations.
The Dell System Performance and Analysis team uses the performance analysis techniques, tools, and methodologies described in this article to perform the following important functions:
- Tune PowerEdge systems to achieve high performance on industry-standard benchmarks
- Test prototype systems to optimize performance of the system and subsystems
- Make design decision trade-offs
- Optimize the performance of the system during design and development stages
- Develop performance tuning guides and performance briefs for customers
Performance analysis is an ongoing process. It begins during the design and development stage of the server and continues even after customers begin using the server.
Thanks to Serdar Acir for his help on providing details about the I/O subsystem performance testing performed in the SPA Lab.
Ramesh Radhakrishnan, Ph.D. (email@example.com) is a design engineer consultant with the System Performance and Analysis Lab in Round Rock, Texas. His responsibilities include performance analysis of Dell servers and characterization of enterprise-level benchmarks. Ramesh has a Ph.D. in Computer Engineering from the University of Texas at Austin.
Humayun Khalid, Ph.D. (firstname.lastname@example.org) is a senior consultant for the Dell System Performance and Analysis Lab. His responsibilities include performance evaluation, analysis, and projections for Dell servers. He is also responsible for design support, performance modeling, marketing support, competitive analysis, and chipset-architecture-technology (CAT) evaluation. Humayun has a Ph.D. in Electrical Engineering from the City University of New York.
David J. Morse (email@example.com) is a senior performance engineer for the Dell System Performance and Analysis Lab. He specializes in Web server performance and is responsible for running the industry-standard SPECweb® Web server benchmark across Dell's line of PowerEdge servers. Prior to joining Dell, he spent two years at NCR in the performance integration and testing group. David has a B.S. in Computer Engineering from the University of South Carolina and is a Red Hat® Certified Engineer (RHCE).
For more information
PowerEdge Servers: System Performance Analysis
VTune Performance Analyzer
Iometer: The I/O Performance Analysis Tool for Servers