what pofiler people use?

by **Julio Jerez** » Fri Mar 11, 2016 6:49 pm

darn that too is coollllll :mrgreen:

it is every thing I was looking for. and if you ask me is far superior that RAG's Telemetry.
and the want $7000 + grand for a year license. It is awesome.

But Google placing the even into the file format, it is very eassy to expose the information you want, plus you can use any data base reader the present the data.
not that it need a better one the chrome one is outstanding.
I got the fist pass working and now I only nee to place the macros on the place that I want to profile.

Thanks for that info, it save me week of work, :mrgreen:

by **godlike** » Sat Mar 12, 2016 1:04 pm

Julio Jerez wrote:Thanks for that info, it save me week of work,

My pleasure

by **Julio Jerez** » Sat Mar 12, 2016 4:32 pm

this is how the profiles look so far.

: profile.png (41.76 KiB) Viewed 11828 times

now I nee to place more strategic call on the critical places of the engine

by **Julio Jerez** » Sat Mar 12, 2016 6:49 pm

wow this is a huge problem. Goggle chrome has a really hard time opening a relatively large file.
when I add even for the dynamics part of the engine the json file is 95mg but when I add even for collision it explode to 751 meg, and chrome just silently reject the file for 500 hundred frames.
then if I try reopen I get a funny aw, some when wrong.

anyway I will keep making small traces, and then maybe in a few day I write the trace viewer in C sharp.

one think I like to do is adding compression, that way writing the out with not slow down so much.
a raw text file is nice, but it generate huge amount of data.

by **Stucuk** » Sat Mar 12, 2016 11:23 pm

I would imagine that people using Chrome to view profile information would not be doing it on a per frame basis but a greater interval.

If you are capturing the data per frame wouldn't that cause slowdown in the simulation?

by **Julio Jerez** » Sat Mar 12, 2016 11:59 pm

actually no, because I am buffering the event in a queue and saving the per frame.
This eliminate the slow down in between events, by slow down a lot form frame to frame. so I added a double buffer queue,
wit the this is newer perfect, is the second queue is like a level tow cache that get flush with is full and at that point stop that application to write like a 3 megabyte file. but this do not interfare with the relative timing between frame.
It is pretty cool for such a simple tool

here is the latest capture that compare newton in it own thread versus netwon with three workers threads. remarkable newton load balance is near perfect.
I manage to distribute the work load almost equal to each thread.
I also found a surprise, the collision is slower that the solver when the solve is no dealing with large island.
I also saw which part of the engine are no parallel, so that a future task because there take some sigfican portion of the frame.

I still have no solve the problem with loading very long files like 1 gbyte or larger, so my only option is to profile smaller scenes, but all in all this is a very good tool.

Also the tool is stand alone anyone can use it is they should too.

: profile.png (130.52 KiB) Viewed 11820 times

by **godlike** » Mon Mar 14, 2016 4:49 am

Hi Julio. I briefly checked your code. There is a way to save some space in the json file. Currently you record events by using one line to start the event and one line to end it. The first line is with "ph":"B" and the second with "ph":"E". There is a way to write that in a single line by using "ph":"X" and providing both the start of the event and its duration. For example:

Code: Select all: {"name": "GL_THREAD", "cat": "PERF", "ph": "X", "pid": 666, "tid": 139793878943488, "ts": 376671, "dur": 10},

It's true that the json file tends to become quite huge and sometimes chrome chokes.

by **Julio Jerez** » Mon Mar 14, 2016 9:07 am

Oh thanks I will use that option.
still I do no mind hat chrome chock it is still quiet good.
I can reduce to profile a limited number of frames, now I have a 500, by I can make say 10 frames.
I already determined hot place on the engine that need attention, so to me this a very good investment of time.
and like I sad form here I can always write my own dedicated viewer in C sharp.

by **Julio Jerez** » Mon Mar 14, 2016 10:31 am

I made that changed and I this is how the engine look when running on asynchronous one thread.

: profile.png (34.57 KiB) Viewed 11790 times

It shows a consistently that for a simple scene, with everything active, the collision cost more than a single solve pass.
I was always under the impression that is was the other way around.

I need to test stacks to see if this holds true, but I first nee to fist and make some optimization to the collision. and this toll is helping a lot.

by **godlike** » Mon Mar 14, 2016 11:47 am

Looks interesting. Is that only 2 threads? Another question do you take mutex locks into account?

I wish I could test it but at the moment the linux build is broken.

by **Julio Jerez** » Mon Mar 14, 2016 1:35 pm

yes that just the main tread which does graphics and the high level logic, and the Newton thread.

On the mutex, no I have no added those kind of even yet, but us should be easy by adding single events, I though of doing as a even with a special label, but Is better to use the special feature of the format because the may represent it differently.
we can place around calls like waitForSingleObject or dorks, Semphores and Sleep.
that will labels when the thread foes to sleep and when wake up as child events of the parent track.
this will indicate what part of the time was the thread sleeping.

yes I have not make work for Linux and Mac yet, I need to make use pThread, but that is very eassy, I will do during the week.

by **Julio Jerez** » Tue Mar 15, 2016 2:34 pm

check this out, here is a listing of the number of event executed by the engine is a 20 frame capture.

I found a huge optimization point, look at the function that check if to shape are close enopudh the thee need to calculate contacts

dgBroadPhase::TestOverlaping 152643
dgWorld::CalculateContacts 133171

the overlap test was call 15264 time, and of those 133171 pass the test.
this is because the test is a AABB test on the shape, those are the pair prone by the broadphase, however if you look at the number of pairs that actually made to the solver it was only

dgWorldDynamicUpdate::BuildJacobianMatrix 11591

we are talking a facto of more that ten to one. this is what explain whey the collision is takin so much more longer that the solver.

what happen is that many of the calls to calculate call are coming back with zero contact.
what this means is that is I change the AABB test for and OBB Test the number of pair should reduce but a fact of half or more. I suspect that may a factor of five

so the result should be something like

dgBroadPhase::ObbOverlapingTest 152643
dgWorld::CalculateContacts 30000

but calculate contact is the expensive call while ObbOverlapingTest is negligible
Newton could get a 50% performance booth. :shock:

I am going to work on that tonigh until I get right. I am loving this tool :mrgreen:

by **JoeJ** » Tue Mar 15, 2016 5:30 pm

Hmm, i always thought my way of profiling is lazy and unprofessional, but i detect cases like this much earlier, so i'll mention...
What i do is displaying info like your listing on screen all the time in a debug window.
That includes counters and also time duration measurements.
I average durations over multiple frames to get preciser values, and if i want to see fluctuations i plot curves (like in your sandbox).
So i get most of the information without the need of a visulization tool and it's always there - no surprises (but also, no exciting surprises, hehe).

What i miss is info about bandwidth limits and cache misses, just guessing about that.

by **Julio Jerez** » Tue Mar 15, 2016 5:57 pm

I know, and that usually work, I you remember Newton had a runtime profiler.

what happen was that as multithreading start to become more important the profile keep braking each time, worse than that many time giving wrong information because of tread timing.

The OI send lot of time on writing visualization and explain people how to use it. at some point there the profiler was an API on itself with about half a dozen functions and even with that was extremely limited. I became a nightmare to maintain and early for 3.14 I decide to get rid of it al together.

The cause a problem because I was unable to optimize the engine, I was using Very Sleepy and some time vtune, but sampling profilers are only good for optimizing functions individually, not for optimize engine architecture.
I am very happy with this simple implementation and grateful to Google Chrome for writing a visualization tool.

I believe that as people start to discover Google Chrome sample visualized it will become the standard.
Intel, Ragtool and all the their party who make commercial profilers, charge ridiculous amount of money for tools that are in fact quite mediocre.

What i miss is info about bandwidth limits and cache misses

It is a long time I do no really care about Cache missed. I used to, but in my experience that some that is better left to the compiler and the hardware.
Take for example the Prefect instruction, in the Pentium 4 time is was actually a good thong to use.
But today is you use that function there is a high chance almost 90% that you add a several performance penalty to you code. In fact that function in new hardware is just ignored.

The reason is that modern have few CPU have streaming channels, basically what it is that each time a memory location does more than tow consecutive memory fetches, the go ahead and prefect the next location. for that is has several channels. but if you go and execute a hardware prefect then the streaming hardware is nullified.
The moral is that with today hardware, it better to concentrate on algorithm optimization and leave the hardcore hardware stuff to the Hardware and the compiler.

by **Julio Jerez** » Tue Mar 15, 2016 7:16 pm

newton just god almost 50% faster :mrgreen:

check this out

dgBroadPhase::AddPair 62825
dgBroadPhase::TestOverlaping 62825
dgWorld::CalculateContacts 12263

I reduce the number of frame form 20 to 10, because Chrome does no really like big files. I am guessing the have a cut of of 100mg of so, because as soon as ther file reach that size it simple fails.

anyway, I wrote the OBB test, and the cost of for 300 to 600 nano secunds. but in return it prunes about five time the number of calls to calculate contact which is about form 5 to 10 time slower depend on the collision shape.

dgBroadPhase::TestOverlaping 62825
dgWorld::CalculateContacts 12263

this now bring the cost of collision to be just about the same as the cost of the solver while before was much more expensive.
There are two more optimization still that I can make so I estimate that the cost of collision will be always less than the solver as it should be.

what pofiler people use?

Re: what pofiler people use?

Re: what pofiler people use?

Re: what pofiler people use?

Re: what pofiler people use?

Re: what pofiler people use?

Re: what pofiler people use?

Re: what pofiler people use?

Re: what pofiler people use?

Re: what pofiler people use?

Re: what pofiler people use?

Re: what pofiler people use?

Re: what pofiler people use?

Re: what pofiler people use?

Re: what pofiler people use?

Re: what pofiler people use?

Who is online