PAPI FLOPS Calibration Data
Introduction
This page presents the results for a calibration program that displays
PAPI floating point instruction overhead for three different sample cases.
The tests were performed using the standard PAPI_flops() call from
the high level PAPI interface.
The output presents an iteration count, the actual PAPI measured instruction
count, the theoretical number of operations, and the difference.
The final column presents the percent error, reported to the ppm level.
Typical output is displayed as shown below:
-----------------------------------------------------------
Inner Product Test:
i
papi theory
diff %error
-----------------------------------------------------------
1
2 2
0 0.0000
2
4 4
0 0.0000
.
.
.
The three tests performed include:
-
Inner Product: (2*n)
-
for i = 1:n; a = a + x(i)*y(i); end
-
Matrix Vector Multiply: (2*n^2)
-
for i = 1:n; for j = 1:n; x(i) = x(i) + a(i,j)*y(j); end; end;
-
Matrix Matrix Multiplication: (2*n^3)
-
for i = 1:n; for j = 1:n; for k = 1:n; c(i,j) = c(i,j) + a(i,k)*b(k,j);
end; end; end;
The code for this test is found in calibrate.c,
which can be downloaded from the PAPI website.
This test is part of the standard PAPI test suite.
Errors
Errors can occur from several sources:
-
There can be an initial offset added to all the measured values
due to startup costs.
-
There can be a constant positive slope introduced by overhead associated
with the PAPI_flops() call.
-
There can be random quantized positive errors introduced by other processes
or threads if the code is not thread-safe.
-
There can be negative slopes associated with undercounting of combined
operations, such as FMA.
These errors can be minimized through several different approaches:
-
The initial offset error is usually an artifact of careless coding.
This was the case in early versions of calibrate.c. These errors
can usually be minimized or eliminated through careful coding during initialization.
-
A constant positive slope can be corrected by applying a correction factor
within the PAPI_flops() call itself. A series of such correction factors
were developed empirically for specific machines here at ICL. If you have
other architectures that exhibit this behaviour, let us know and we will
work with you to develop appropriate corrections.
-
The 'spurious process' error is uncorrectable, but usually small. The solution
is for us to implement PAPI with process or thread granularity on all processors.
This may be difficult or impossible on some systems, but remains a goal
of the project. Meanwhile, the magnitude of the error can be assessed by
examination of the output of the calibrate test. For example, in
the case of Windows 2000, the error is seen often to be 0 in small
counts, and converges to a limit of about 9 ppm in higher
order counts.
-
Currently, the only known combined operation for which undercounting occurs
is a combined Floating point Multiply and Add, or FMA instruction. Most
architectures that support FMA provide no way to count it explicitly. We
are then left with an unsavory choice:
-
Assume NO operations are FMA. Reported FLOPS will be low
by the fraction of FMAs in the sample;
-
Assume ALL operations are FMA. Reported FLOPS will be high
by the fraction of FMAs in the sample.
We have chosen the second option and assumed all operations are FMA for
those machines that support FMA. This is a reasonable choice for the matrix
manipulations represented in the calibrate test and found in many high
performance codes. It may produce large positive errors in other more general
cases where there is a broader distribution of floating point operations.
Summaries
Below are summaries of results for the processors and operating systems
on which the calibrate test was performed:
Windows / Intel
This was the first platform tested. Initial results indicated both
a small positive initial offset and a small postive slope.
Final results show no offset and a quantized positive slope induced
by other threads performing floating point operations.
This slope is random and often only appears in the higher order tests.
It appears to converge to about 9 ppm in the higher order limit.
AIX / Power
The Power series includes an FMA instruction that performs two floating
point operations in a single instruction cycle.
This is apparently accounted for in the (derived) floating point instruction
value produced by PAPI.
The results of these tests initially showed a significant offset, which
was eliminated by recoding the startup sequence.
A constant and inexplicable positive slope of 0.1 % appears in the
Matrix Vector multiply test.
This slope drops to 1 ppm or less for the Matrix Matrix Multiply. Because
it was small and variable, no correction was applied.
The Power series includes a separate FMA metric. It is apparently not
needed, since FMAs appear to be recorded as 2 FP operations.
Solaris / UltraSparc
The UltraSparc on which these tests were performed ordinarily optimizes
out most of the computations in simplistic for loops.
Thus calibrate.c must be compiled with optimization off (-x01
instead of -x04) to obtain meaningful results.
Even so, the higher order tests show a small and variable negative
slope of less than 0.2% for the Matrix Vector Multiply test.
For the larger counts of that test and for the Matrix Matrix Multiply
Test, this negative slope dropped to between 0.05% and 0.02%.
The source of this negative slope is unexplained and uncorrected, but
may still result from some unknown optimization.
UltraSparc supports an FMA instruction and typically undercounts FMAs
by a factor of 2. Because of this, actual floating point counts are multiplied
by 2 in the PAPI_flops call before reporting.
Irix / MIPS
Irix / MIPS initially showed a small positive offset of about 34,
which was eliminated by careful coding as described above for the Power
architecture.
Also, a constant positive slope of 9 was observed for each call to
PAPI_flops().
Further, MIPS like UltraSpacrc supports an FMA instruction and under-reports
floating point operations as a result.
By applying a slope correction and an FMA scaling factor, the error
in all three tests was reduced to 0.
Remember that actual floating point counts are multiplied by 2 in the
PAPI_flops call before reporting.
Linux / Intel
Not tested yet.
Linux / Itanium
Not tested yet.
Linux / AMD
The Athlon does not have a floating point metric.
Tru64 / Alpha
The Alpha does not have a floating point metric.
Cray T3E
Not tested yet.