The following graphics provide some examples of collaborative work in performance engineering and analysis on Linux clusters using PerfSuite.
Note: these examples are out-of-date and do not reflect current enhancements in PerfSuite. More up-to-date information will appear here soon.
The application driver is GenIDLEST (Danesh Tafti, Virginia Tech), a computational fluid dynamics application that is being developed and tuned for use on Pentium III, Pentium 4, Xeon, Itanium, and Itanium 2 Linux clusters located at the University of Illinois/NCSA and Virginia Tech. GenIDLEST is used for academic research in addition to use by community and corporate researchers.
GenIDLEST development lead Tafti has been working with colleagues to isolate, diagnose, and improve performance-related issues of this application in the Linux cluster environment. Several performance tools have been employed to identify problem areas and to help point the way towards improvement.
Reliably measuring performance is key to the GenIDLEST performance analysis effort. The team has measured the application with processor hardware performance counters using the PerfSuite hardware performance support library and the PAPI library. PAPI, which in turn uses IA-64 perfmon performance monitoring support, allows the team to effectively "X-ray" what is occurring in the Itanium processor as GenIDLEST executes. This makes it possible to accurately assess and improve processor utilization by examining and interpreting reports like the following (postprocessed for display by PerfSuite's psprocess utility):
PerfSuite 1.0 summary for execution of gen.ver2.3.inte (PID=9867, domain=user) Based on 800 MHz -1 GenuineIntel 0 CPU CPU revision 6.000 Event Counter Name Counter Value =================================================================================== 0 Conditional branch instructions mispredicted............. 3006093956 1 Conditional branch instructions correctly predicted...... 32974709880 2 Conditional branch instructions taken.................... 26952022279 3 Floating point instructions.............................. 44525980237 4 Total cycles............................................. 353262206234 5 Instructions completed................................... 489764680025 6 Level 1 data cache accesses.............................. 56390921533 7 Level 1 data cache hits.................................. 41911206947 8 Level 1 data cache misses................................ 14615753570 9 Level 1 load misses...................................... 17611912424 10 Level 1 cache misses..................................... 17597248300 11 Level 2 data cache accesses.............................. 53158617899 12 Level 2 data cache misses................................ 8440205387 13 Level 2 data cache reads................................. 43528651785 14 Level 2 data cache writes................................ 10240563775 15 Level 2 load misses...................................... 3615923337 16 Level 2 store misses..................................... 667575973 17 Level 2 cache misses..................................... 8529931717 18 Level 3 data cache accesses.............................. 3826843278 19 Level 3 data cache hits.................................. 2799591986 20 Level 3 data cache misses................................ 999714206 21 Level 3 data cache reads................................. 3573882130 22 Level 3 data cache writes................................ 171800425 23 Level 3 load misses...................................... 944624814 24 Level 3 store misses..................................... 49427000 25 Level 3 cache misses..................................... 1024569375 26 Load instructions........................................ 84907675686 27 Load/store instructions completed........................ 95346092870 28 Cycles Stalled Waiting for memory accesses............... 140032176122 29 Store instructions....................................... 10267472354 30 Cycles with no instruction issue......................... 67247126931 31 Data translation lookaside buffer misses................. 8365029 Statistics =================================================================================== Graduated instructions/cycle................................... 1.386406 Graduated floating point instructions/cycle.................... 0.126042 Graduated loads & stores/cycle................................. 0.269902 Graduated loads & stores/floating point instruction............ 2.141359 L1 Cache Line Reuse............................................ 5.523515 L2 Cache Line Reuse............................................ 0.731682 L3 Cache Line Reuse............................................ 7.442618 L1 Data Cache Hit Rate......................................... 0.846708 L2 Data Cache Hit Rate......................................... 0.422527 L3 Data Cache Hit Rate......................................... 0.881553 % cycles w/no instruction issue................................ 19.036037 % cycles waiting for memory access............................. 39.639729 Correct branch predictions/branches taken...................... 1.000000 MFLOPS......................................................... 100.833839 |
Pentium (IA-32) and Itanium (IA-64) hardware performance counter
summary information provides insight into the effective use of
the processor resources. Raw event counts in addition to derived
statistics (here, presented in a similar manner to the IRIX
perfex utility) are very useful in assessing the effects
of code, data, and algorithmic modifications as well as in determining
the compiler's ability to emit well-optimized code.
|
GenIDLEST is used frequently as an application driver for the development of tools to assist not only with this effort, but that are of general use. One example of this work is the PerfSuite OptView tool, which is being developed to assist in understanding the effectiveness of compiler optimizations.
An example of OptView screens that show compiler optimization effectiveness. In this case, a loop from within GenIDLEST's solvers failed to software pipeline and needs attention/reworking in order to enable more efficient execution on the Itanium processor. |
GenIDLEST performance analysis includes other tools developed externally that are appropriate to assist in improving the application's performance. Source-level profiling is an important mechanism for isolating problem areas in the application based on data collected during an actual run. For example, researchers working on GenIDLEST have used the profiling tool VProf to relate performance information to the source code.
The VProf visual profiler provides the important capability of relating events measured during profiling runs back to the original source code. |
The initial port of GenIDLEST to the Itanium cluster showed very poor performance relative to other platforms. Using the FPMPI profiling library and the graphical tool FPMPIview (a graphical tool previously developed as a prototype in early versions of PerfSuite software), problems were diagnosed by noting excessive synchronization time between nodes due to I/O issues. Appropriate code modifications to GenIDLEST significantly reduced the total runtime.
GenIDLEST before modifications. Over half the run-time was due to processors waiting for I/O to complete. MPI profiling identified this wait time as synchronization overhead. | GenIDLEST after modifications. Synchronization overhead was reduced to less than 20% of overall execution time, which in turn was reduced by more than 60%. |
The GenIDLEST team performs benchmarking of the application for scalability studies, not only of the code itself, but the underlying platform. These studies are useful in assessing how well the processor and communication facilities might be able to effectively support "Grid"-scale applications.
In the following case, the team noted substantial performance loss because of processor-to-memory bandwidth limits. When both processors on a node are employed, per-processor MFLOP rate is significantly reduced, due to high demands on the shared bus. The columns with a cross-hatched pattern show total MFLOP rate for a the two-processor per node configuration, while the columns with a solid pattern devote the node memory to a single CPU.
GenIDLEST performance analysis and benchmarking helps gain insight into strengths and limitations of the target processors and cluster environment. |
PerfSuite
perfsuite@ncsa.uiuc.edu
National Center for Supercomputing
Applications
University of Illinois at Urbana-Champaign
Last modified: February 20, 2004