Table of Contents
The standard caveats of profiling apply in interpreting the results from OProfile: profile realistic situations, profile different scenarios, profile for as long as a time as possible, avoid system-specific artifacts, don't trust the profile data too much. Also bear in mind the comments on the performance counters above - you cannot rely on totally accurate instruction-level profiling. However, for almost all circumstances the data can be useful. Ideally a utility such as Intel's VTUNE would be available to allow careful instruction-level analysis; go hassle Intel for this, not me ;)
This is an example of how the latency of delivery of profiling interrupts can impact the reliability of the profiling data. This is pretty much a worst-case-scenario example: these problems are fairly rare.
double fun(double a, double b, double c)
{
double result = 0;
for (int i = 0 ; i < 10000; ++i) {
result += a;
result *= b;
result /= c;
}
return result;
}
|
Here the last instruction of the loop is very costly, and you would expect the result reflecting that - but (cutting the instructions inside the loop):
$ opannotate -a -t 10 ./a.out
88 15.38% : 8048337: fadd %st(3),%st
48 8.391% : 8048339: fmul %st(2),%st
68 11.88% : 804833b: fdiv %st(1),%st
368 64.33% : 804833d: inc %eax
: 804833e: cmp $0x270f,%eax
: 8048343: jle 8048337
|
The problem comes from the x86 hardware; when the counter overflows the IRQ is asserted but the hardware has features that can delay the NMI interrupt: x86 hardware is synchronous (i.e. cannot interrupt during an instruction); there is also a latency when the IRQ is asserted, and the multiple execution units and the out-of-order model of modern x86 CPUs also causes problems. This is the same function, with annotation :
$ opannotate -s -t 10 ./a.out
:double fun(double a, double b, double c)
:{ /* _Z3funddd total: 572 100.0% */
: double result = 0;
368 64.33% : for (int i = 0 ; i < 10000; ++i) {
88 15.38% : result += a;
48 8.391% : result *= b;
68 11.88% : result /= c;
: }
: return result;
:}
|
The conclusion: don't trust samples coming at the end of a loop, particularly if the last instruction generated by the compiler is costly. This case can also occur for branches. Always bear in mind that samples can be delayed by a few cycles from its real position. That's a hardware problem and OProfile can do nothing about it.