Anyway on the generic optimizations I found the institution mm_pause, which pause the core executing the thread until the pipe line is completed.
This is much better that using yield because it does not has to release the thread by calling sleep(0).
I tried on newton and it seems to be very good, multithreaded code is a lot smother and faster, it does approach the expected performance gains.
I modified the spin like this and in totally ignored the yeild call which is sleep(0), meaning relinquish the tread to another thread and is an expensive kernel call.
- Code: Select all
DG_INLINE void dgSpinLock (dgInt32* const ptr, bool yield)
{
#ifndef DG_USE_THREAD_EMULATION
/*
while (dgInterlockedExchange(ptr, 1)) {
if (yield) {
dgThreadYield();
}
}
*/
do {
_mm_pause();
} while (dgInterlockedExchange(ptr, 1));
#endif
}
if any one has experience with mm_pause, let me know
Another thing that I saw is that intel is making the core i9 with upto 18 cores.
that's 36 hard ware threads!! and the mid rage is 10 cores 20 threads.
they claim that is can do one terafloats of ops, this is GPU class performance on a CPU.
I was trying to look at GPU again (openCL) but is makes me sick each time I have to think of all the mombojumbo that is needed.
What I want to do is that I will refactor the parallel solver to use AVX2 and large number of threads, with low overheads but for CPU. Maybe we can get close to GPU physics but with higher quality than what the commercial engine are doing using CPU even if it is few bodies.
We do not really have to do 100000 all jittery bodies if we can do 10000 stable bodies that can be enough for almost anything.