dprof
dprof executes a program and samples the current operand address of the interrupted instruction.
It accumulates a histogram of access to data addresses, grouped in units of virtual page.
The time base is either the interval timer (-itimer) or the R10000 cycle counter (-hwpc).
The best performance on parallel programs that sustain a lot of cache misses is achieved if each thread primarily accesses its local memory.