PerfSuite Hardware Performance Monitoring Library


libpshwpc, the PerfSuite supporting software library for hardware performance event counting, contains a small number of routines that are used to collect hardware performance event data for use within your program or by the PerfSuite graphical, command-line, or web-based tools.

libpshwpc supports both single-threaded programs and programs that use the POSIX threads standard (pthreads) for multithreaded execution. For pthreads programs, each thread will maintain copies of its own performance counter data.

The library is targeted for Linux-Intel/AMD (x86/x86-64/ia64) platforms.

The routines within libpshwpc provide output and functionality that can be useful independently of the graphical tools and may be used solely in that way, too.

The routines currently contained in the library are summarized on this web page. All PerfSuite libpshwpc library routines begin with the prefix "ps_hwpc" (for C) and "PSF_hwpc" (for Fortran).

An Example Single-Threaded Program

Here's an example Fortran matrix-multiply loop that uses libpshwpc for (possibly multiplexed) hardware performance counting. Additions necessary to use libpshwpc are shown in bold:

      program mxm
      include 'fperfsuite.h'
      (... declare and initialize arrays ...)

c Initialize libpshwpc
      call PSF_hwpc_init(ierr)
      if ( then
	print*, 'Error initializing libpshwpc!'

c Start performance counting using libpshwpc
      call PSF_hwpc_start(ierr)
      if ( then
	print*, 'Error starting performance counting!'

c Do the matrix multiply
      do j = 1, n
	do i = 1, m
	  do k = 1, l
	    c(i,j) = c(i,j) + a(i,k)*b(k,j)
          end do
	end do
      end do

c Stop hardware performance counting and write the
c results to a file named 'perf.PID.xml' (PID will be
c replaced by the process ID of the program)

      call PSF_hwpc_stop('perf', ierr)
      if ( then
	print*, 'Error stopping hardware performance counting!'

c Shutdown use of libpshwpc and the underlying libraries

      call PSF_hwpc_shutdown(ierr)
      if ( then
	print*, 'Error terminating libpshwpc!'

What To Do With The Output

The output generated from a program that uses the libpshwpc library is an XML document in a standard PerfSuite format. Because it is based on the XML standard for data representation, there are many possibilities for working with this document to obtain insight into the behavior of your application.

The PerfSuite command-line tool psprocess is a convenient utility for post-processing the results; for some examples and suggestions, you can refer to the documentation for psrun.

Compiling and Linking with libpshwpc

These instructions are specific to the PerfSuite installation at NCSA, which is rooted at the directory /usr/apps/tools/perfsuite. For other installations, you should substitute the local PerfSuite top-level directory instead.


All C/C++-based applications should include the main PerfSuite header file <perfsuite.h> and also the libpshwpc header file <pshwpc.h>. Fortran-based applications should include <fperfsuite.h>. No other header files are necessary to use these routines.

When you compile your program, include the flag:



When you link your program, include the flags:

     -L/usr/apps/tools/perfsuite/lib -lpshwpc

Programs that use POSIX threads should instead link with the threaded version of libpshwpc, as follows:

     -L/usr/apps/tools/perfsuite/lib -lpshwpc_r

The libpshwpc shared library automatically links the necessary low-level hardware performance counter support library (default: PAPI). You'll still have to add the directory /usr/apps/tools/perfsuite/lib to your LD_LIBRARY_PATH environment variable in order for your program to locate the PerfSuite shared libraries (or use other linktime options).

If you link statically, you'll have to specify the PerfSuite core, PAPI, and Expat XML parser libraries also, as follows:

    -L/usr/apps/tools/perfsuite/lib -L/usr/apps/tools/papi/lib \
         -lpshwpc -lperfsuite -lpapi -lexpat

Also note that a static link will remove the requirement to set your LD_LIBRARY_PATH environment variable.

More complete information about PAPI and its installation at NCSA can be found on the PAPI at NCSA web page.

Running Your Program

Assuming that you've successfully compiled and linked your program as described above and that you've set your LD_LIBRARY_PATH environment variable if necessary, just run your program as you normally would, possibly setting run-specific environment variables (described next).

Note: you should not run a program linked with libpshwpc with psrun (or any other software that would simultaneously attempt to access the hardware performance counters). Doing so will result in a conflict and a run-time error.

Environment Variables

libpshwpc recognizes the following environment variables:


Controls the run-time behavior of the library. If set to "off" or "no" (case is not significant), then no hardware performance counting will take place and all libpshwpc routines will return a success status without actually doing anything.


This variable allows you to provide an optional annotation string to the resulting XML output file (you may want to use this to keep track of specific information regarding a particular run). The value of this variable is copied verbatim as the text associated with the element <annotation>.


Specifies an event configuration file that is used to determine which hardware performance events will be counted. The file name may be absolute or relative to the current working directory. See Selecting Events to Count for more information about the configuration file.


Specifies the "counting domain" at which measurement will take place. Recognized values (case is not significant) are "user" (default), "kernel", or "all".


Specifies the base prefix to be used for the resulting XML output document (see the routine ps_hwpc_stop() below for details).

Selecting Events to Count

libpshwpc uses an event configuration file to decide at runtime what performance events should be counted. This file is an XML document with a very simple syntax that can be modified with any text editor.

Here's a sample event configuration file and instructions for creating your own custom configuration file.

PerfSuite provides several default configuration files, each targeted to a different architecture, that are located in the directory share/perfsuite/xml/pshwpc (relative to the PerfSuite top-level installation directory). For Pentium (except Pentium 4/Xeon) and Itanium machines, these files are named papi2_p6.xml and papi2_itanium.xml, respectively. You can use these default files as a basis for creating your own desired configuration (just copy them to a private location and modify appropriately).

There's also a "do-nothing" configuration file called null.xml that can be used to obtain general run information without using performance counters at all. In this case, the resulting XML output will contain information about the machine and the date and wall clock time elapsed between the call to ps_hwpc_init() and ps_hwpc_stop(). See the C/Fortran API section for more details on these routines.

You can also use the graphical tool PSConfig to create or modify event configuration files. This tool provides a convenient point-and-click interface along with several other features to make it easy to work with event selection.

You cannot control the event selection programmatically - the only way to specify events other than the default is through the environment variable PS_HWPC_CONFIG and a custom configuration file.

Note that libpshwpc will not accept the PAPI 2 "rate events" (PAPI_FLOPS and PAPI_IPS) because they are not true events, but derived metrics that are provided anyway if you post-process your program with the PerfSuite utility psprocess (described in the documentation for psrun) and your event configuration includes the underlying raw events: PAPI_FP_INS, PAPI_TOT_INS, and PAPI_TOT_CYC.

Performance Counter Multiplexing

Processors typically have a limited set of registers for use in hardware performance counting. One technique for counting more performance events simultaneously than would otherwise be possible with the number of available registers is to use multiplexing, which causes the available physical counters to be time-shared over the desired events.

At the end of the measurement, the number of events read during each time-slice is then adjusted according to the total run time over all measurements to provide a statistical estimate of the actual number of events that would have been observed if the register had been devoted to a single event. This is a very convenient method of measuring a large number of events when only a few performance counters are available, especially if it's not convenient to make multiple non-multiplexed runs of the program.

By using PAPI as the default performance counter access method, which implements multiplexing based on John May's MPX software, libpshwpc provides support for multiplexing of the counters through PAPI. This is done automatically for you and is noted in the final output file. Multiplexing will only be enabled if required (i.e., the software detects more requested events than can be counted on the available counters).

C / Fortran API

The libpshwpc C/Fortran API allows you to insert calls to the library into your application, enabling you to control the collection of hardware event data. The API is intentionally simple but is sufficient for the needs of many people doing performance analysis in practice. More complex needs (e.g. writing tools, profilers, etc) are probably better served by one of the many academic, research, and commercial products that are available.

The API consists of 5 "core" routines that allow you to control and configure hardware performance measurement. Additionally, two convenience routines (ps_hwpc_PAPI_write and ps_hwpc_PAPI_hl_write) are provided that are intended for applications that already use PAPI directly. These routines convert an existing PAPI event set and associated counter values to PerfSuite's XML format and write the XML document to a disk file. The document can then be used by other PerfSuite command-line, graphical, or Web-based tools.

The following diagram shows the typical sequence of calls to the routines in libpshwpc. libpshwpc routines displayed in a red font may only be called by a single thread in a program (note: this need not be the same thread). libpshwpc routines displayed in a blue font may be called repeatedly by any thread (the other libpshwpc routines should be called exactly once). Dashed lines indicate the typical path for threads created with pthread_create().

Note that some variations are possible (for example, a thread may create other threads after having already started performance counting), but this diagram covers the most common case. The main things to keep in mind are:

  1. observe the restrictions on which routines can only be called once by a single thread (usually the main thread, but this isn't a requirement)
  2. don't call ps_hwpc_shutdown() until all threads have finished with performance counting (if this is a serious problem, then another option is to not call ps_hwpc_shutdown() at all)

int ps_hwpc_init(void)

subroutine PSF_hwpc_init(ierr)

integer ierr

This routine initializes the library, causes the event configuration file to be read and validated, and arranges for initialization of the underlying hardware counter support.

This must be the first routine from the library executed by your program.

When using libpshwpc in a multithreaded program, you should ensure that only one thread calls ps_hwpc_init(). Further, you should also make sure that no threads call any libpshwpc routines until ps_hwpc_init() has been called. The safest way to guarantee this is to call ps_hwpc_init() from the main thread, before any threads have been created. If you don't follow this approach, then you'll have to coordinate things in another manner (for example, via the standard pthread_once() POSIX routine)

Return status/ierr value: 0 on success, non-zero if an error occurred.

int ps_hwpc_start(void)

subroutine PSF_hwpc_start(ierr)

integer ierr

This routine causes hardware performance counting to begin or to resume after having been previously suspended.

Single- and multithreaded programs may call this routine any number of times, but should ensure that it is paired properly with ps_hwpc_suspend() or ps_hwpc_stop().

Return status/ierr value: 0 on success, non-zero if an error occurred.

int ps_hwpc_suspend(void)

subroutine PSF_hwpc_suspend(ierr)

integer ierr

This routine causes hardware performance counting to be suspended. Any events that occur after this call but before another call to ps_hwpc_start will not be recorded.

Single- and multithreaded programs may call this routine any number of times, but should ensure that it is paired properly with ps_hwpc_start().

Return status/ierr value: 0 on success, non-zero if an error occurred.

int ps_hwpc_stop(const char *filename)

subroutine PSF_hwpc_stop(filename, ierr)

character*(*) filename
integer ierr

This routine will terminate hardware performance counting from the calling application/thread and write the performance event data gathered during the run to a disk file with the prefix filename (or the current setting of the environment variable PS_HPWC_FILE, which will take precedence over the character string contained in your source code).

The resulting file will be named:


where .THREAD-ID will be used in the case of a POSIX threads-based application and will be an integer in the range 0 to P (where P is the number of additional threads that your program created). For both single and multithreaded applications, PID will be replaced by the process ID of the calling application/thread.

filename can be supplied as an absolute or relative path name, however you should ensure that all intermediate subdirectories already exist and that you have the proper access permissions for each.

There is no guarantee that the main thread of the program will correspond to THREAD-ID 0 (although that's usually the case). The ID is incremented each time any thread begins performance counting, so that if a thread created by pthread_create() is the first to start counting, then that thread will be assigned THREAD-ID 0.

There are a number of different ways that you can post-process the output files generated by libpshwpc. The PerfSuite command-line utility psprocess is a very convenient method for working with these files; please refer to the documentation for that utility for information about its use as well as suggestions for other techniques for generating useful information from the output of libpshwpc.

This routine (and should!) be called once by your application, as well as being called from any POSIX threads that your application creates. Failure to call this routine will result in no performance data being written.

Return status/ierr value: 0 on success, non-zero if an error occurred.

int ps_hwpc_shutdown(void)

subroutine PSF_hwpc_shutdown(ierr)

integer ierr

This routine ends the use of libpshwpc and any supporting software libraries it uses. It should be called once, and only by single thread in a multithreaded application after all other threads have finished counting. Usually, it's safest to have the main thread call this routine, but this isn't a requirement. If you don't follow this approach, you'll have to coordinate things properly on your own.

The main purpose of this routine is to free resources used for performance counting. If resource conservation isn't a concern to you, then you can also choose to not call ps_hwpc_shutdown() at all. In this case, resources will be freed when your program exits.

Return status/ierr value: 0 on success, non-zero if an error occurred.

int ps_hwpc_PAPI_write(const char *filename, int eventset, long long *values)

int ps_hwpc_PAPI_hl_write(const char *filename, int *eventcodes, int eventcodeslength, long long *values)

subroutine PSF_hwpc_PAPI_write(filename, eventset, values, ierr)

character*(*) filename
integer eventset, ierr
integer*8 values(*)

subroutine PSF_hwpc_PAPI_hl_write(filename, eventcodes, eventcodeslength, values, ierr)

character*(*) filename
integer eventcodeslength, eventcodes(eventcodeslength), ierr
integer*8 values(*)

These are convenience routines intended for use by applications that already use the PAPI libraries for performance counting.

Only PAPI "standard" (as opposed to "native") events are supported.

These routines accept a PAPI event set descriptor (or in the case of ps_hwpc_PAPI_hl_write, an array of event codes and its length) and an array of performance counter values that have been returned by the PAPI software. The calling program has the responsibility of ensuring that PAPI has been initialized, that PAPI event sets have been configured and that performance counter measurements have been properly done.

PAPI-based programs that use these routines should not call any of the other routines in libpshwpc (the routines listed above). Doing so will result in a run-time error.

ps_hwpc_PAPI_write and ps_hwpc_PAPI_hl_write translate the data from the PAPI-specific format to the PerfSuite XML format and write the result out to a disk file with the prefix filename (or the current setting of the environment variable PS_HWPC_FILE). You can then use the resulting PerfSuite XML file(s) with any of the appropriate PerfSuite command-line, graphical, or Web-based tools.

These routines may be called any number of times, although you should take care to provide a unique filename for each call (otherwise previous data will be overwritten).

Return status/ierr value: 0 on success, non-zero if an error occurred.

Guidelines / Suggestions for Use

One way of using the libpshwpc routines is as follows:

  1. Obtain a standard (timing-based) profile of your application using a good profiler such as gprof. This will let you know which routines in your program deserve your attention. Keep it simple and don't worry about hardware performance information at this stage.
  2. Insert a call to ps_hwpc_init() at the very beginning of your program. Insert calls to ps_hwpc_stop() and (optionally) ps_hwpc_shutdown() at the place(s) where your program normally exits. Put any string you like as the argument to ps_hwpc_stop() (you can always change it at runtime via the environment variable PS_HWPC_FILE).
  3. Once you've determined which code regions are the major time-consumers, then isolate those areas with ps_hwpc_start() and ps_hwpc_suspend() calls in order to "zoom in" on the region of interest. Try to keep these calls at the outermost level possible. Minimize the perturbation of your program's execution by only concentrating on one region at a time and minimize the calls you make to libpshwpc.
  4. Experiment with the events that you measure to see which are of most interest. You don't need to change the code to modify the measurements - just change your event configuration file and/or relevant environment variables. Experiment with available compiler options.
  5. Don't forget to include any other tools and resources you may have available to you during the process of optimizing your program. Keep track of changes in runtime by additional profiles. Pay attention to compiler optimization reports and listings as you make changes to your source code. Vary the conditions under which your program runs (e.g., use different input files, data sizes, parameters, etc). Keep in mind that after you've inserted calls to libpshwpc in your program, you can turn off hardware performance counting by setting the environment variable PS_HWPC to "off" in order to minimize effects of the library itself.
It's usually a good idea to not try to get too "fancy". If there's not much time being spent in a particular function or subroutine, then it's probably not necessary to go through the steps of instrumenting (by hand or other means) those functions.

The key phrase is: keep it simple. Focus on where the time is spent, and see what's happening in those portions of your application.

An Example POSIX Threads Program

Here's a complete POSIX threads program, with the modifications required to use libpshwpc shown in bold. We won't describe the program here, but this program is also used as an example in the documentation for the PerfSuite psrun tool (see the next section), so you can read about it in more depth in that document.

#include <pthread.h>
#include <stdio.h>

#include <perfsuite.h>
#include <pshwpc.h>

pthread_mutex_t reduction_mutex = PTHREAD_MUTEX_INITIALIZER;
pthread_t *tid;
int n, num_threads;
double pi, w;

f(double a)
  return ( 4.0 / (1.0 + a*a) );

void *
PIworker(void *arg)
  int i, myid;
  double sum, mypi, x;

  /* set individual id to start at 0 */
  myid = pthread_self() - tid[0];

  if (ps_hwpc_start() != 0) {
    fprintf(stderr, "Error starting performance counting!\n");

  /* integrate function */
  sum = 0.0;
  for (i=myid+1; i<=n; i+=num_threads) {
    x = w * ((double) i - 0.5);
    sum += f(x);

  if (ps_hwpc_stop("PIworker") != 0) {
    fprintf(stderr, "Error stopping performance counting!\n");

  mypi = w*sum;
  /* reduce value */
  pi += mypi;


main(int argc, char **argv)
  int i;
  /* check command line */
  if (argc != 3) {
    printf("Usage: %s num-intervals num-threads\n", argv[0]);

  /* get num intervals and num threads from command line */
  n = atoi(argv[1]);
  num_threads = atoi(argv[2]);
  w = 1.0 / (double) n;
  pi = 0.0;
  tid = (pthread_t *) calloc(num_threads, sizeof(pthread_t));

  if (ps_hwpc_init() != 0) {
    fprintf(stderr, "Error initializing libpshwpc!\n");
  if (ps_hwpc_start() != 0) {
    fprintf(stderr, "Error starting performance counting!\n");

  /* create the threads */
  for (i=0; i<num_threads; i++) {
    if (pthread_create(&tid[i], NULL, PIworker, NULL)) {
      fprintf(stderr, "Cannot create thread %d\n", i);

  /* join threads */
  for (i=0; i<num_threads; i++) {
    pthread_join(tid[i], NULL);

  printf("computed pi = %.16f\n", pi);

  if (ps_hwpc_stop("PImaster") != 0) {
    fprintf(stderr, "Error stopping performance counting!\n");
  if (ps_hwpc_shutdown() != 0) {
    fprintf(stderr, "Error terminating libpshwpc!\n");


Enabling Automatic Hardware Performance Counting with psrun

If you'd rather not modify your application's source code or relink your program, you can instead cause hardware performance counting based on libpshwpc to be enabled automatically for your program by using the PerfSuite command-line utility psrun. This utility is very convenient and simple to use, and arranges for performance counter measurement to be enabled just before your main program begins execution and to be reported when your program terminates (i.e., the entire application is monitored). psrun can be used with an unmodified executable and also supports POSIX threads.

Last modified: Sunday, 02-Jan-2005 13:50:03 CST

National Center for Supercomputing Applications
University of Illinois at Urbana-Champaign