Table of Contents
The standard caveats of profiling apply in interpreting the results from OProfile: profile realistic situations, profile different scenarios, profile for as long as a time as possible, avoid system-specific artifacts, don't trust the profile data too much. Also bear in mind the comments on the performance counters above - you cannot rely on totally accurate instruction-level profiling. However, for almost all circumstances the data can be useful. Ideally a utility such as Intel's VTUNE would be available to allow careful instruction-level analysis; go hassle Intel for this, not me ;)
This is an example of how the latency of delivery of profiling interrupts can impact the reliability of the profiling data. This is pretty much a worst-case-scenario example: these problems are fairly rare.
double fun(double a, double b, double c) { double result = 0; for (int i = 0 ; i < 10000; ++i) { result += a; result *= b; result /= c; } return result; } |
Here the last instruction of the loop is very costly, and you would expect the result reflecting that - but (cutting the instructions inside the loop):
$ opannotate -a -t 10 ./a.out 88 15.38% : 8048337: fadd %st(3),%st 48 8.391% : 8048339: fmul %st(2),%st 68 11.88% : 804833b: fdiv %st(1),%st 368 64.33% : 804833d: inc %eax : 804833e: cmp $0x270f,%eax : 8048343: jle 8048337 |
The problem comes from the x86 hardware; when the counter overflows the IRQ is asserted but the hardware has features that can delay the NMI interrupt: x86 hardware is synchronous (i.e. cannot interrupt during an instruction); there is also a latency when the IRQ is asserted, and the multiple execution units and the out-of-order model of modern x86 CPUs also causes problems. This is the same function, with annotation :
$ opannotate -s -t 10 ./a.out :double fun(double a, double b, double c) :{ /* _Z3funddd total: 572 100.0% */ : double result = 0; 368 64.33% : for (int i = 0 ; i < 10000; ++i) { 88 15.38% : result += a; 48 8.391% : result *= b; 68 11.88% : result /= c; : } : return result; :} |
The conclusion: don't trust samples coming at the end of a loop, particularly if the last instruction generated by the compiler is costly. This case can also occur for branches. Always bear in mind that samples can be delayed by a few cycles from its real position. That's a hardware problem and OProfile can do nothing about it.