GenIDLEST Performance Analysis with PerfSuite

The following graphics provide some examples of collaborative work in performance engineering and analysis on Linux clusters using PerfSuite.

Note: these examples are out-of-date and do not reflect current enhancements in PerfSuite. More up-to-date information will appear here soon.

The application driver is GenIDLEST (Danesh Tafti, Virginia Tech), a computational fluid dynamics application that is being developed and tuned for use on Pentium III, Pentium 4, Xeon, Itanium, and Itanium 2 Linux clusters located at the University of Illinois/NCSA and Virginia Tech. GenIDLEST is used for academic research in addition to use by community and corporate researchers.

GenIDLEST development lead Tafti has been working with colleagues to isolate, diagnose, and improve performance-related issues of this application in the Linux cluster environment. Several performance tools have been employed to identify problem areas and to help point the way towards improvement.

GenIDLEST, PerfSuite, and PAPI

Reliably measuring performance is key to the GenIDLEST performance analysis effort. The team has measured the application with processor hardware performance counters using the PerfSuite hardware performance support library and the PAPI library. PAPI, which in turn uses IA-64 perfmon performance monitoring support, allows the team to effectively "X-ray" what is occurring in the Itanium processor as GenIDLEST executes. This makes it possible to accurately assess and improve processor utilization by examining and interpreting reports like the following (postprocessed for display by PerfSuite's psprocess utility):

PerfSuite 1.0 summary for execution of gen.ver2.3.inte Based on 800 MHz -1 GenuineIntel 0 CPU CPU revision 6.000 Event Counter Name =============================================================== 0 Conditional branch instructions mispredicted............. 1 Conditional branch instructions correctly predicted...... 2 Conditional branch instructions taken.................... 3 Floating point instructions.............................. 4 Total cycles............................................. 5 Instructions completed................................... 6 Level 1 data cache accesses.............................. 7 Level 1 data cache hits.................................. 8 Level 1 data cache misses................................ 9 Level 1 load misses...................................... 10 Level 1 cache misses..................................... 11 Level 2 data cache accesses.............................. 12 Level 2 data cache misses................................ 13 Level 2 data cache reads................................. 14 Level 2 data cache writes................................ 15 Level 2 load misses...................................... 16 Level 2 store misses..................................... 17 Level 2 cache misses..................................... 18 Level 3 data cache accesses.............................. 19 Level 3 data cache hits.................................. 20 Level 3 data cache misses................................ 21 Level 3 data cache reads................................. 22 Level 3 data cache writes................................ 23 Level 3 load misses...................................... 24 Level 3 store misses..................................... 25 Level 3 cache misses..................................... 26 Load instructions........................................ 27 Load/store instructions completed........................ 28 Cycles Stalled Waiting for memory accesses............... 29 Store instructions....................................... 30 Cycles with no instruction issue......................... 31 Data translation lookaside buffer misses................. Statistics =============================================================== Graduated instructions/cycle................................... Graduated floating point instructions/cycle.................... Graduated loads & stores/cycle................................. Graduated loads & stores/floating point instruction............ L1 Cache Line Reuse............................................ L2 Cache Line Reuse............................................ L3 Cache Line Reuse............................................ L1 Data Cache Hit Rate......................................... L2 Data Cache Hit Rate......................................... L3 Data Cache Hit Rate......................................... % cycles w/no instruction issue................................ % cycles waiting for memory access............................. Correct branch predictions/branches taken...................... MFLOPS.........................................................

Pentium (IA-32) and Itanium (IA-64) hardware performance summary information provides insight into the effective use of the processor resources. Raw event counts in addition statistics (here, presented in a similar manner to the IRIX perfex utility) are very useful in assessing of code, data, and algorithmic modifications as well as the compiler's ability to emit well-optimized code.

(PID=9867, domain=user) Counter Value ==================== 3006093956 32974709880 26952022279 44525980237 353262206234 489764680025 56390921533 41911206947 14615753570 17611912424 17597248300 53158617899 8440205387 43528651785 10240563775 3615923337 667575973 8529931717 3826843278 2799591986 999714206 3573882130 171800425 944624814 49427000 1024569375 84907675686 95346092870 140032176122 10267472354 67247126931 8365029 ==================== 1.386406 0.126042 0.269902 2.141359 5.523515 0.731682 7.442618 0.846708 0.422527 0.881553 19.036037 39.639729 1.000000 100.833839 counter to derived the effects in determining

GenIDLEST, OptView, and the Intel IA-64 compiler

GenIDLEST is used frequently as an application driver for the development of tools to assist not only with this effort, but that are of general use. One example of this work is the PerfSuite OptView tool, which is being developed to assist in understanding the effectiveness of compiler optimizations.

An example of OptView screens that show compiler optimization effectiveness. In this case, a loop from within GenIDLEST's solvers failed to software pipeline and needs attention/reworking in order to enable more efficient execution on the Itanium processor.

GenIDLEST and VProf

GenIDLEST performance analysis includes other tools developed externally that are appropriate to assist in improving the application's performance. Source-level profiling is an important mechanism for isolating problem areas in the application based on data collected during an actual run. For example, researchers working on GenIDLEST have used the profiling tool VProf to relate performance information to the source code.

The VProf visual profiler provides the important capability of relating events measured during profiling runs back to the original source code.

GenIDLEST and FPMPIview

The initial port of GenIDLEST to the Itanium cluster showed very poor performance relative to other platforms. Using the FPMPI profiling library and the graphical tool FPMPIview (a graphical tool previously developed as a prototype in early versions of PerfSuite software), problems were diagnosed by noting excessive synchronization time between nodes due to I/O issues. Appropriate code modifications to GenIDLEST significantly reduced the total runtime.

GenIDLEST before modifications. Over half the run-time was due to processors waiting for I/O to complete. MPI profiling identified this wait time as synchronization overhead. GenIDLEST after modifications. Synchronization overhead was reduced to less than 20% of overall execution time, which in turn was reduced by more than 60%.

GenIDLEST Benchmarking and System Evaluation

The GenIDLEST team performs benchmarking of the application for scalability studies, not only of the code itself, but the underlying platform. These studies are useful in assessing how well the processor and communication facilities might be able to effectively support "Grid"-scale applications.

In the following case, the team noted substantial performance loss because of processor-to-memory bandwidth limits. When both processors on a node are employed, per-processor MFLOP rate is significantly reduced, due to high demands on the shared bus. The columns with a cross-hatched pattern show total MFLOP rate for a the two-processor per node configuration, while the columns with a solid pattern devote the node memory to a single CPU.

GenIDLEST performance analysis and benchmarking helps gain insight into strengths and limitations of the target processors and cluster environment.

PerfSuite
perfsuite@ncsa.uiuc.edu
National Center for Supercomputing Applications
University of Illinois at Urbana-Champaign

Last modified: February 20, 2004


GenIDLEST before modifications. Over half the run-time was due to processors waiting for I/O to complete. MPI profiling identified this wait time as synchronization overhead.	GenIDLEST after modifications. Synchronization overhead was reduced to less than 20% of overall execution time, which in turn was reduced by more than 60%.