GenIDLEST Performance Analysis with PerfSuite

The following graphics provide some examples of collaborative work in performance engineering and analysis on Linux clusters using PerfSuite.

Note: these examples are out-of-date and do not reflect current enhancements in PerfSuite. More up-to-date information will appear here soon.

The application driver is GenIDLEST (Danesh Tafti, Virginia Tech), a computational fluid dynamics application that is being developed and tuned for use on Pentium III, Pentium 4, Xeon, Itanium, and Itanium 2 Linux clusters located at the University of Illinois/NCSA and Virginia Tech. GenIDLEST is used for academic research in addition to use by community and corporate researchers.

GenIDLEST development lead Tafti has been working with colleagues to isolate, diagnose, and improve performance-related issues of this application in the Linux cluster environment. Several performance tools have been employed to identify problem areas and to help point the way towards improvement.


GenIDLEST, PerfSuite, and PAPI

Reliably measuring performance is key to the GenIDLEST performance analysis effort. The team has measured the application with processor hardware performance counters using the PerfSuite hardware performance support library and the PAPI library. PAPI, which in turn uses IA-64 perfmon performance monitoring support, allows the team to effectively "X-ray" what is occurring in the Itanium processor as GenIDLEST executes. This makes it possible to accurately assess and improve processor utilization by examining and interpreting reports like the following (postprocessed for display by PerfSuite's psprocess utility):

PerfSuite 1.0 summary for execution of gen.ver2.3.inte (PID=9867, domain=user)

                          Based on 800 MHz -1
                          GenuineIntel 0 CPU
                          CPU revision 6.000


Event Counter Name                                                    Counter Value
===================================================================================

    0 Conditional branch instructions mispredicted.............          3006093956
    1 Conditional branch instructions correctly predicted......         32974709880
    2 Conditional branch instructions taken....................         26952022279
    3 Floating point instructions..............................         44525980237
    4 Total cycles.............................................        353262206234
    5 Instructions completed...................................        489764680025
    6 Level 1 data cache accesses..............................         56390921533
    7 Level 1 data cache hits..................................         41911206947
    8 Level 1 data cache misses................................         14615753570
    9 Level 1 load misses......................................         17611912424
   10 Level 1 cache misses.....................................         17597248300
   11 Level 2 data cache accesses..............................         53158617899
   12 Level 2 data cache misses................................          8440205387
   13 Level 2 data cache reads.................................         43528651785
   14 Level 2 data cache writes................................         10240563775
   15 Level 2 load misses......................................          3615923337
   16 Level 2 store misses.....................................           667575973
   17 Level 2 cache misses.....................................          8529931717
   18 Level 3 data cache accesses..............................          3826843278
   19 Level 3 data cache hits..................................          2799591986
   20 Level 3 data cache misses................................           999714206
   21 Level 3 data cache reads.................................          3573882130
   22 Level 3 data cache writes................................           171800425
   23 Level 3 load misses......................................           944624814
   24 Level 3 store misses.....................................            49427000
   25 Level 3 cache misses.....................................          1024569375
   26 Load instructions........................................         84907675686
   27 Load/store instructions completed........................         95346092870
   28 Cycles Stalled Waiting for memory accesses...............        140032176122
   29 Store instructions.......................................         10267472354
   30 Cycles with no instruction issue.........................         67247126931
   31 Data translation lookaside buffer misses.................             8365029


Statistics
===================================================================================
Graduated instructions/cycle...................................            1.386406
Graduated floating point instructions/cycle....................            0.126042
Graduated loads & stores/cycle.................................            0.269902
Graduated loads & stores/floating point instruction............            2.141359
L1 Cache Line Reuse............................................            5.523515
L2 Cache Line Reuse............................................            0.731682
L3 Cache Line Reuse............................................            7.442618
L1 Data Cache Hit Rate.........................................            0.846708
L2 Data Cache Hit Rate.........................................            0.422527
L3 Data Cache Hit Rate.........................................            0.881553
% cycles w/no instruction issue................................           19.036037
% cycles waiting for memory access.............................           39.639729
Correct branch predictions/branches taken......................            1.000000
MFLOPS.........................................................          100.833839

Pentium (IA-32) and Itanium (IA-64) hardware performance counter summary information provides insight into the effective use of the processor resources. Raw event counts in addition to derived statistics (here, presented in a similar manner to the IRIX perfex utility) are very useful in assessing the effects of code, data, and algorithmic modifications as well as in determining the compiler's ability to emit well-optimized code.

GenIDLEST, OptView, and the Intel IA-64 compiler

GenIDLEST is used frequently as an application driver for the development of tools to assist not only with this effort, but that are of general use. One example of this work is the PerfSuite OptView tool, which is being developed to assist in understanding the effectiveness of compiler optimizations.

An example of OptView screens that show compiler optimization effectiveness. In this case, a loop from within GenIDLEST's solvers failed to software pipeline and needs attention/reworking in order to enable more efficient execution on the Itanium processor.

GenIDLEST and VProf

GenIDLEST performance analysis includes other tools developed externally that are appropriate to assist in improving the application's performance. Source-level profiling is an important mechanism for isolating problem areas in the application based on data collected during an actual run. For example, researchers working on GenIDLEST have used the profiling tool VProf to relate performance information to the source code.

The VProf visual profiler provides the important capability of relating events measured during profiling runs back to the original source code.

GenIDLEST and FPMPIview

The initial port of GenIDLEST to the Itanium cluster showed very poor performance relative to other platforms. Using the FPMPI profiling library and the graphical tool FPMPIview (a graphical tool previously developed as a prototype in early versions of PerfSuite software), problems were diagnosed by noting excessive synchronization time between nodes due to I/O issues. Appropriate code modifications to GenIDLEST significantly reduced the total runtime.

GenIDLEST before modifications. Over half the run-time was due to processors waiting for I/O to complete. MPI profiling identified this wait time as synchronization overhead. GenIDLEST after modifications. Synchronization overhead was reduced to less than 20% of overall execution time, which in turn was reduced by more than 60%.

GenIDLEST Benchmarking and System Evaluation

The GenIDLEST team performs benchmarking of the application for scalability studies, not only of the code itself, but the underlying platform. These studies are useful in assessing how well the processor and communication facilities might be able to effectively support "Grid"-scale applications.

In the following case, the team noted substantial performance loss because of processor-to-memory bandwidth limits. When both processors on a node are employed, per-processor MFLOP rate is significantly reduced, due to high demands on the shared bus. The columns with a cross-hatched pattern show total MFLOP rate for a the two-processor per node configuration, while the columns with a solid pattern devote the node memory to a single CPU.

GenIDLEST performance analysis and benchmarking helps gain insight into strengths and limitations of the target processors and cluster environment.


PerfSuite
perfsuite@ncsa.uiuc.edu
National Center for Supercomputing Applications
University of Illinois at Urbana-Champaign

Last modified: February 20, 2004