I know, and that usually work, I you remember Newton had a runtime profiler.
what happen was that as multithreading start to become more important the profile keep braking each time, worse than that many time giving wrong information because of tread timing.
The OI send lot of time on writing visualization and explain people how to use it. at some point there the profiler was an API on itself with about half a dozen functions and even with that was extremely limited. I became a nightmare to maintain and early for 3.14 I decide to get rid of it al together.
The cause a problem because I was unable to optimize the engine, I was using Very Sleepy and some time vtune, but sampling profilers are only good for optimizing functions individually, not for optimize engine architecture.
I am very happy with this simple implementation and grateful to Google Chrome for writing a visualization tool.
I believe that as people start to discover Google Chrome sample visualized it will become the standard.
Intel, Ragtool and all the their party who make commercial profilers, charge ridiculous amount of money for tools that are in fact quite mediocre.
What i miss is info about bandwidth limits and cache misses
It is a long time I do no really care about Cache missed. I used to, but in my experience that some that is better left to the compiler and the hardware.
Take for example the Prefect instruction, in the Pentium 4 time is was actually a good thong to use.
But today is you use that function there is a high chance almost 90% that you add a several performance penalty to you code. In fact that function in new hardware is just ignored.
The reason is that modern have few CPU have streaming channels, basically what it is that each time a memory location does more than tow consecutive memory fetches, the go ahead and prefect the next location. for that is has several channels. but if you go and execute a hardware prefect then the streaming hardware is nullified.
The moral is that with today hardware, it better to concentrate on algorithm optimization and leave the hardcore hardware stuff to the Hardware and the compiler.