Chapter 5. Interpreting profiling results

Table of Contents

1. Profiling interrupt latency
2. Kernel profiling
2.1. Interrupt masking
2.2. Idle time
2.3. Profiling kernel modules
3. Interpreting call-graph profiles
4. Inaccuracies in annotated source
4.1. Side effects of optimizations
4.2. Prologues and epilogues
4.3. Inlined functions
4.4. Inaccuracy in line number information
5. Assembly functions
6. Overlapping symbols in JITed code
7. Using operf to profile fork/execs
8. Other discrepancies

The standard caveats of profiling apply in interpreting the results from OProfile: profile realistic situations, profile different scenarios, profile for as long as a time as possible, avoid system-specific artifacts, don't trust the profile data too much. Also bear in mind the comments on the performance counters above - you cannot rely on totally accurate instruction-level profiling. However, for almost all circumstances the data can be useful. Ideally a utility such as Intel's VTUNE would be available to allow careful instruction-level analysis; go hassle Intel for this, not me ;)

1. Profiling interrupt latency

This is an example of how the latency of delivery of profiling interrupts can impact the reliability of the profiling data. This is pretty much a worst-case-scenario example: these problems are fairly rare.

double fun(double a, double b, double c)
{
 double result = 0;
 for (int i = 0 ; i < 10000; ++i) {
  result += a;
  result *= b;
  result /= c;
 }
 return result;
}

Here the last instruction of the loop is very costly, and you would expect the result reflecting that - but (cutting the instructions inside the loop):

$ opannotate -a -t 10 ./a.out

     88 15.38% : 8048337:       fadd   %st(3),%st
     48 8.391% : 8048339:       fmul   %st(2),%st
     68 11.88% : 804833b:       fdiv   %st(1),%st
    368 64.33% : 804833d:       inc    %eax
               : 804833e:       cmp    $0x270f,%eax
               : 8048343:       jle    8048337

The problem comes from the x86 hardware; when the counter overflows the IRQ is asserted but the hardware has features that can delay the NMI interrupt: x86 hardware is synchronous (i.e. cannot interrupt during an instruction); there is also a latency when the IRQ is asserted, and the multiple execution units and the out-of-order model of modern x86 CPUs also causes problems. This is the same function, with annotation :

$ opannotate -s -t 10 ./a.out

               :double fun(double a, double b, double c)
               :{ /* _Z3funddd total:     572 100.0% */
               : double result = 0;
    368 64.33% : for (int i = 0 ; i < 10000; ++i) {
     88 15.38% :  result += a;
     48 8.391% :  result *= b;
     68 11.88% :  result /= c;
               : }
               : return result;
               :}

The conclusion: don't trust samples coming at the end of a loop, particularly if the last instruction generated by the compiler is costly. This case can also occur for branches. Always bear in mind that samples can be delayed by a few cycles from its real position. That's a hardware problem and OProfile can do nothing about it.