optizations tricks

A place to discuss everything related to Newton Dynamics.

Moderators: Sascha Willems, walaber

optizations tricks

Postby Julio Jerez » Fri Jun 02, 2017 10:24 am

I was reading the intel optimization guide, and they talk of some thing that can be done to optimize code for some architectures, most the stuff the programmer has very little control over since it can only be enforced by using assembly language which now is not allowed in 64 bit mode, or can only be use with intrinsic functions if one is available. I would never go back to write assembly, the most I do is intrinsic and even that I have do not like.

Anyway on the generic optimizations I found the institution mm_pause, which pause the core executing the thread until the pipe line is completed.
This is much better that using yield because it does not has to release the thread by calling sleep(0).
I tried on newton and it seems to be very good, multithreaded code is a lot smother and faster, it does approach the expected performance gains.

I modified the spin like this and in totally ignored the yeild call which is sleep(0), meaning relinquish the tread to another thread and is an expensive kernel call.
Code: Select all
DG_INLINE void dgSpinLock (dgInt32* const ptr, bool yield)
{
   #ifndef DG_USE_THREAD_EMULATION
/*
      while (dgInterlockedExchange(ptr, 1)) {
         if (yield) {
            dgThreadYield();
         }
      }
*/
   do {
      _mm_pause();
   } while (dgInterlockedExchange(ptr, 1));

   #endif
}


if any one has experience with mm_pause, let me know

Another thing that I saw is that intel is making the core i9 with upto 18 cores.
that's 36 hard ware threads!! and the mid rage is 10 cores 20 threads.
they claim that is can do one terafloats of ops, this is GPU class performance on a CPU.

I was trying to look at GPU again (openCL) but is makes me sick each time I have to think of all the mombojumbo that is needed.

What I want to do is that I will refactor the parallel solver to use AVX2 and large number of threads, with low overheads but for CPU. Maybe we can get close to GPU physics but with higher quality than what the commercial engine are doing using CPU even if it is few bodies.
We do not really have to do 100000 all jittery bodies if we can do 10000 stable bodies that can be enough for almost anything.
Julio Jerez
Moderator
Moderator
 
Posts: 12249
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: optizations tricks

Postby JoeJ » Sun Jun 04, 2017 4:26 pm

I was not aware of _mm_pause(), highly interesting! Seems a great must to do, so thanks for sharing.

Tested this with my lighting stuff on old i7:
Using std::this_thread::yield(): 1159 ms
_mm_pause(): 953 ms
:mrgreen:

Guess it's even a bigger win on Ryzen, if they expose this too?

Edit:
My numbers above are a bit meaningles because i forgot to disable OpenCL GPGPU stuff that runs just to verify some things.
However, after disabling GPU both methods result in 270ms.
I do not understand what's going on here. I do not run CPU and GPU at the same time here - one of each should be idle. Maybe some driver work.
User avatar
JoeJ
 
Posts: 1453
Joined: Tue Dec 21, 2010 6:18 pm

Re: optizations tricks

Postby Julio Jerez » Sun Jun 04, 2017 8:51 pm

that instruction should be available on all CPU supporting SSE2
basically the way I understand that improve the performance is in two way.
say you have a spin lock like this
Code: Select all
void spin (int* ptr)
{
    while (intechage (*ptr, 1));
}


because intechage (*ptr, 1) has exclusive asses to the bus, with modern CPU that loop prefect the entire loop, and the loop is execute but one core until the deep pile is executed completely prev3ention any of the pother cores to have asset to the memory bus.

in order to make the better a dale nee to be insert in the loop so that the other core has time to read an write code. somethong like this

Code: Select all
void spin (int* ptr)
{
    while (intechage (*ptr, 1))
    {
            Sleep(0)
    }
}


here each time the code enter the loop, is excute the sleep(0), whi is a thread yield, this call is of the order of tenth of thousands of cycles at best, basically this makes a kernell call to see if another thread of equal priority can is available, and if it is, the cost is whet ever the other thread take which is what you see a spikes in a thread profiler.
The result is that at best each time a thread enter the loop, hundred o thousand of cycles are lost

what some people do Is that the add delay functions with nop, but this is hard to calibrate

here is where the function mm_pause because very usefull, the function can be re written like

Code: Select all
void spin (int* ptr)
{
    do {
    {
         mm_pause();
    } while (intechage (*ptr, 1))
}


this causes that each time the thread hit the function mm_pause(), the core flush the prefect queue, but running all the prefect code but not doing anything. this will so the bus is only assessed once per clock circle. and any other thread can have asses to the bus when the instruction is executed.

the result is that the thread does not has to relinquish control to another thread and only few dozen cycles are wasted.
There is also another big effect and that is that the core execute that function a low power consumption.

In newton eliminating Sleep() does makes a significant difference, because in newton the thread model is like a queue, where thread are competing for items, so the Sleep is hit quite a few time,
when I run the vtune at work, using sleep almost all the time is spend on that function, and if I set 8 thread I see all thread cycling a lot.
now is I set 8 thread only 4 thread use time and dgSpin almost disappear.
the way I see it this is the closest a CPU can get to the thread sync on GPU programming.
Julio Jerez
Moderator
Moderator
 
Posts: 12249
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: optizations tricks

Postby Julio Jerez » Fri Nov 10, 2017 1:33 pm

does anyone know what t5hsi function does
0001:000c5130 __twoToTOS 100c6130 f LIBCMT:common.obj

in the newton profiles it takes

profiler.png
profiler.png (28.79 KiB) Viewed 5151 times


_twoToTOS and _pow_default account for 6% of the total time on execution in the newton update.
that function is more expensive the the solver itself, and I dot not know where is called form,
google searchs come out with nothing.
Julio Jerez
Moderator
Moderator
 
Posts: 12249
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: optizations tricks

Postby JernejL » Sat Nov 11, 2017 1:07 pm

Looks like sort of math functions, maybe check debug symbols and disassembly?
Help improving the Newton Game Dynamics WIKI
User avatar
JernejL
 
Posts: 1578
Joined: Mon Dec 06, 2004 2:00 pm
Location: Slovenia

Re: optizations tricks

Postby Julio Jerez » Tue Nov 28, 2017 9:26 pm

I found this project that may be useful for hard core optimization in the future
https://github.com/opcm/pcm
Julio Jerez
Moderator
Moderator
 
Posts: 12249
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles


Return to General Discussion

Who is online

Users browsing this forum: No registered users and 21 guests

cron