Parallel solver experiments

A place to discuss everything related to Newton Dynamics.

Moderators: Sascha Willems, walaber

Re: Parallel solver experiments

Postby Julio Jerez » Sun Apr 08, 2018 4:21 pm

Joe I do not think is generates binary code, it simply translate C++ to the compile native language, so whatever is supported on the platform can be supported by the high level if it is exposed.

Maybe I do not understand well, but they are not doing themself a favor by marketing as a CUDA translator, when the CUDA to HIP is just a small feature.

but anyway I downloaded and as usual it was too good to be true, it is one of those opens source BS that is not complet unless you download every other open source app in the internet.
I get this error
Selecting Windows SDK version to target Windows 10.0.16299.
CMake Error at CMakeLists.txt:26 (string):
string sub-command REPLACE requires at least four arguments.


HIP Platform:
HIP will be installed in: C:/Program Files (x86)/hip
CMake Error at hipify-clang/CMakeLists.txt:4 (find_package):
By not providing "FindLLVM.cmake" in CMAKE_MODULE_PATH this project has
asked CMake to find a package configuration file provided by "LLVM", but
CMake did not find one.

Could not find a package configuration file provided by "LLVM" with any of
the following names:

LLVMConfig.cmake
llvm-config.cmake


The one LLVM, and if you get LLVM, then LLVM ask you for * like Booth, Google stuff, and it is all a rabit hole of downloading and running CMake and installing. I am not spending a second on that. if there is not a prebuild runtime then I am not using it.
Julio Jerez
Moderator
Moderator
 
Posts: 12249
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: Parallel solver experiments

Postby JoeJ » Sun Apr 08, 2018 4:41 pm

Not really a surprise, did not expect it to be easy. AMD is not really a software company and marketing is even worse, or did you ever heard them pointing out their compute advantage? GCN was unbelievable 5 times faster per buck than Kepler, and i never knew until it saw it myself. Trial to compete against CUDA comes really late...
User avatar
JoeJ
 
Posts: 1453
Joined: Tue Dec 21, 2010 6:18 pm

Re: Parallel solver experiments

Postby JoeJ » Mon Apr 09, 2018 12:48 am

runtime.JPG
runtime.JPG (53.71 KiB) Viewed 4906 times


Just got the idea to look for what i have installed from Intel. Runtime is all i have and OpenCL works.
User avatar
JoeJ
 
Posts: 1453
Joined: Tue Dec 21, 2010 6:18 pm

Re: Parallel solver experiments

Postby Julio Jerez » Mon Apr 09, 2018 5:51 pm

working on the had written avx module, here is a code comparison of function

SSE version one body per call
Code: Select all
void dgDynamicBody::AddDampingAcceleration(dgFloat32 timestep)
00205770  push        ebp 
00205771  mov         ebp,esp 
00205773  push        ecx 
00205774  movss       xmm2,dword ptr [timestep] 
00205779  mov         eax,ecx 
0020577B  mov         dword ptr [ebp-4],eax 
0020577E  movss       xmm0,dword ptr [eax+210h] 
00205786  subss       xmm0,xmm2 
0020578A  andps       xmm0,xmmword ptr [__xmm@7fffffff7fffffff7fffffff7fffffff (0258A40h)] 
00205791  comiss      xmm0,dword ptr [__real@358637bd (0257EA4h)] 

002057E3  mov         eax,dword ptr [this] 
002057E6  pop         edi 
002057E7  pop         esi 
002057E8  movaps      xmm3,xmmword ptr [eax+200h] 
002057EF  movaps      xmm0,xmm3 
002057F2  shufps      xmm0,xmm3,0FFh 
002057F6  shufps      xmm0,xmm0,0 
002057FA  mulps       xmm0,xmmword ptr [eax+0C0h] 
00205801  movups      xmmword ptr [eax+0C0h],xmm0 
00205808  movaps      xmm2,xmmword ptr [eax+0D0h] 
0020580F  movaps      xmm1,xmmword ptr [eax+70h] 
00205813  movaps      xmm0,xmmword ptr [eax+60h] 
00205817  mulps       xmm1,xmm2 
0020581A  mulps       xmm0,xmm2 
0020581D  mulps       xmm2,xmmword ptr [eax+50h] 
00205821  andps       xmm3,xmmword ptr [dgVector::m_triplexMask (0280B30h)] 
00205828  haddps      xmm0,xmm0 
0020582C  haddps      xmm2,xmm2 
00205830  haddps      xmm1,xmm1 
00205834  haddps      xmm2,xmm2 
00205838  haddps      xmm0,xmm0 
0020583C  andps       xmm2,xmmword ptr [dgVector::m_xMask (0280AF0h)] 
00205843  andps       xmm0,xmmword ptr [dgVector::m_yMask (0280B00h)] 
0020584A  haddps      xmm1,xmm1 
0020584E  addps       xmm2,xmm0 
00205851  andps       xmm1,xmmword ptr [dgVector::m_zMask (0280B10h)] 
00205858  addps       xmm2,xmm1 
0020585B  mulps       xmm2,xmm3 
0020585E  movaps      xmm1,xmm2 
00205861  movaps      xmm0,xmm2 
00205864  shufps      xmm1,xmm2,55h 
00205868  mulps       xmm1,xmmword ptr [eax+60h] 
0020586C  shufps      xmm0,xmm2,0 
00205870  mulps       xmm0,xmmword ptr [eax+50h] 
00205874  shufps      xmm2,xmm2,0AAh 
00205878  mulps       xmm2,xmmword ptr [eax+70h] 
0020587C  addps       xmm1,xmm0 
0020587F  addps       xmm1,xmm2 
00205882  movups      xmmword ptr [eax+0D0h],xmm1 
00205889  mov         esp,ebp 
0020588B  pop         ebp 
0020588C  ret         4 


avx 8 bodies per call
Code: Select all
void dgDynamicBody::AddDampingAcceleration(dgFloat32 timestep)
54CA248E  xor         ecx,ecx 
54CA2490  mov         dword ptr [esp+20h],ecx 
54CA2494  mov         eax,dword ptr [edi+0CCh] 
54CA249A  vmovups     ymm0,ymmword ptr [eax+ecx] 
54CA249F  mov         eax,dword ptr [edi+0E0h] 
54CA24A5  vmulps      ymm0,ymm3,ymm0 
54CA24A9  vmovups     ymm1,ymmword ptr [eax+ecx] 
54CA24AE  mov         eax,dword ptr [edi+0F4h] 
54CA24B4  vmulps      ymm1,ymm3,ymm1 
54CA24B8  vmulps      ymm2,ymm3,ymmword ptr [eax+ecx] 
54CA24BD  mov         eax,dword ptr [edi+0CCh] 
54CA24C3  vmovdqu     ymmword ptr [eax+ecx],ymm0 
54CA24C8  mov         eax,dword ptr [edi+0E0h] 
54CA24CE  vmovdqu     ymmword ptr [eax+ecx],ymm1 
54CA24D3  mov         eax,dword ptr [edi+0F4h] 
54CA24D9  vmovdqu     ymmword ptr [eax+ecx],ymm2 
54CA24DE  mov         eax,dword ptr [edi+180h] 
54CA24E4  mov         esi,dword ptr [edi+108h] 
54CA24EA  mov         ecx,dword ptr [edi+194h] 
54CA24F0  mov         edx,dword ptr [edi+1A8h] 
54CA24F6  mov         edi,dword ptr [edi+11Ch] 
54CA24FC  mov         dword ptr [esp+34h],eax 
54CA2500  mov         eax,dword ptr [esp+20h] 
54CA2504  mov         dword ptr [esp+30h],esi 
54CA2508  mov         dword ptr [esp+38h],ecx 
54CA250C  mov         dword ptr [esp+3Ch],edx 
54CA2510  vmovups     ymm2,ymmword ptr [esi+eax] 
54CA2515  mov         esi,dword ptr [esp+24h] 
54CA2519  vmovups     ymm5,ymmword ptr [edi+eax] 
54CA251E  mov         edx,dword ptr [esp+20h] 
54CA2522  mov         ecx,dword ptr [esi+54h] 
54CA2525  mov         edi,dword ptr [esi+130h] 
54CA252B  vmovups     ymm7,ymmword ptr [ecx+edx] 
54CA2530  mov         ecx,dword ptr [esi+68h] 
54CA2533  vmovups     ymm6,ymmword ptr [edi+eax] 
54CA2538  mov         edi,dword ptr [esi+90h] 
54CA253E  mov         eax,dword ptr [esi+0A4h] 
54CA2544  mov         dword ptr [esp+2Ch],ecx 
54CA2548  mov         ecx,edx 
54CA254A  mov         edx,dword ptr [esi+18h] 
54CA254D  vmovups     ymm4,ymmword ptr [edx+ecx] 
54CA2552  mov         edx,dword ptr [esi+2Ch] 
54CA2555  mov         esi,ecx 
54CA2557  vmulps      ymm1,ymm5,ymmword ptr [eax+esi] 
54CA255C  vmulps      ymm0,ymm2,ymmword ptr [edi+esi] 
54CA2561  mov         esi,dword ptr [esp+24h] 
54CA2565  vaddps      ymm1,ymm1,ymm0 
54CA2569  mov         eax,dword ptr [esi+0B8h] 
54CA256F  mov         esi,dword ptr [esp+20h] 
54CA2573  vmulps      ymm0,ymm6,ymmword ptr [eax+ecx] 
54CA2578  mov         ecx,dword ptr [esp+2Ch] 
54CA257C  vaddps      ymm3,ymm1,ymm0 
54CA2580  vmulps      ymm0,ymm2,ymm7 
54CA2584  vmulps      ymm1,ymm5,ymmword ptr [ecx+esi] 
54CA2589  mov         ecx,dword ptr [esp+24h] 
54CA258D  vaddps      ymm1,ymm1,ymm0 
54CA2591  mov         eax,dword ptr [ecx+7Ch] 
54CA2594  vmulps      ymm0,ymm6,ymmword ptr [eax+esi] 
54CA2599  mov         eax,dword ptr [esp+34h] 
54CA259D  vaddps      ymm2,ymm1,ymm0 
54CA25A1  vmulps      ymm1,ymm5,ymmword ptr [edx+esi] 
54CA25A6  mov         edx,dword ptr [esp+20h] 
54CA25AA  mov         esi,dword ptr [esp+30h] 
54CA25AE  vmulps      ymm0,ymm4,ymmword ptr [esi+edx] 
54CA25B3  mov         esi,dword ptr [ecx+40h] 
54CA25B6  mov         ecx,dword ptr [esp+38h] 
54CA25BA  vaddps      ymm1,ymm1,ymm0 
54CA25BE  vmulps      ymm0,ymm6,ymmword ptr [esi+edx] 
54CA25C3  vmulps      ymm6,ymm3,ymmword ptr [eax+edx] 
54CA25C8  vmulps      ymm2,ymm2,ymmword ptr [ecx+edx] 
54CA25CD  mov         ecx,dword ptr [esp+20h] 
54CA25D1  mov         edx,dword ptr [esp+3Ch] 
54CA25D5  vaddps      ymm0,ymm1,ymm0 
54CA25D9  vmulps      ymm1,ymm2,ymm7 
54CA25DD  vmulps      ymm3,ymm0,ymmword ptr [edx+ecx] 
54CA25E2  vmulps      ymm0,ymm6,ymmword ptr [edi+ecx] 
54CA25E7  mov         edi,dword ptr [esp+24h] 
54CA25EB  vaddps      ymm1,ymm1,ymm0 
54CA25EF  vmulps      ymm0,ymm3,ymm4 
54CA25F3  vaddps      ymm5,ymm1,ymm0 
54CA25F7  mov         eax,dword ptr [edi+68h] 
54CA25FA  vmulps      ymm1,ymm2,ymmword ptr [eax+ecx] 
54CA25FF  mov         eax,dword ptr [edi+0A4h] 
54CA2605  vmulps      ymm0,ymm6,ymmword ptr [eax+ecx] 
54CA260A  mov         eax,dword ptr [edi+2Ch] 
54CA260D  vaddps      ymm1,ymm1,ymm0 
54CA2611  vmulps      ymm0,ymm3,ymmword ptr [eax+ecx] 
54CA2616  mov         eax,dword ptr [edi+7Ch] 
54CA2619  vaddps      ymm4,ymm1,ymm0 
54CA261D  vmulps      ymm1,ymm2,ymmword ptr [eax+ecx] 
54CA2622  mov         eax,dword ptr [edi+0B8h] 
54CA2628  vmulps      ymm0,ymm6,ymmword ptr [eax+ecx] 
54CA262D  mov         eax,dword ptr [edi+108h] 
54CA2633  vaddps      ymm1,ymm1,ymm0 
54CA2637  vmulps      ymm0,ymm3,ymmword ptr [esi+ecx] 
54CA263C  vmovups     ymm3,ymmword ptr [esp+40h] 
54CA2642  vmovdqu     ymmword ptr [eax+ecx],ymm5 
54CA2647  mov         eax,dword ptr [edi+11Ch] 
54CA264D  vaddps      ymm0,ymm1,ymm0 
54CA2651  vmovdqu     ymmword ptr [eax+ecx],ymm4 
54CA2656  mov         eax,dword ptr [edi+130h] 
54CA265C  vmovdqu     ymmword ptr [eax+ecx],ymm0 
54CA2661  add         ecx,20h 
54CA2664  sub         dword ptr [esp+28h],1 
54CA2669  mov         dword ptr [esp+20h],ecx 
54CA266D  jne         dgNewtonCpu::InityBodyArray+104h (54CA2494h) 
54CA2673  vzeroupper 


as you can see the AVX comes out to be a littles than twice the number of instruction, but is executes 8 bodies per call.
On paper this look muc faster, but I do no know about memory letancy
the way I see it the avx has a lot more coherence in eh avx since the data is all adjacent in memory.
I expect these code to be about three to four time faster, than the cpu normal one.

In theory one code should be faster than for cores doing the sequential code.
Julio Jerez
Moderator
Moderator
 
Posts: 12249
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: Parallel solver experiments

Postby Julio Jerez » Tue Apr 10, 2018 7:35 am

ok now is far move compact and better, it does not run out of integers, at least this one,
It is remarkable is has almost the same number of operation as the non SAA simd

Code: Select all
void dgDynamicBody::AddDampingAcceleration(dgFloat32 timestep)
58772517  jmp         dgNewtonCpu::InityBodyArray+110h (58772520h) 
58772519  lea         esp,[esp] 
58772520  mov         eax,dword ptr [esi+40h] 
58772523  mov         ecx,dword ptr [esi+18h] 
58772526  vmovups     ymm1,ymmword ptr [eax+edx] 
5877252B  mov         eax,dword ptr [esp+10h] 
5877252F  vmulps      ymm3,ymm1,ymmword ptr [eax+ecx] 
58772534  vmulps      ymm2,ymm1,ymmword ptr [eax+ecx+20h] 
5877253A  vmulps      ymm0,ymm1,ymmword ptr [eax+ecx+40h] 
58772540  vmovdqu     ymmword ptr [eax+ecx],ymm3 
58772545  vmovdqu     ymmword ptr [eax+ecx+20h],ymm2 
5877254B  vmovdqu     ymmword ptr [eax+ecx+40h],ymm0 
58772551  mov         ecx,dword ptr [esi+90h] 
58772557  mov         edx,dword ptr [esi+2Ch] 
5877255A  vmovups     ymm4,ymmword ptr [eax+edx+20h] 
58772560  vmovups     ymm3,ymmword ptr [eax+edx] 
58772565  vmovups     ymm5,ymmword ptr [eax+edx+40h] 
5877256B  mov         eax,dword ptr [esp+14h] 
5877256F  vmulps      ymm0,ymm3,ymmword ptr [eax+ecx] 
58772574  vmulps      ymm1,ymm4,ymmword ptr [eax+ecx+20h] 
5877257A  vaddps      ymm1,ymm1,ymm0 
5877257E  vmulps      ymm0,ymm5,ymmword ptr [eax+ecx+40h] 
58772584  vaddps      ymm6,ymm1,ymm0 
58772588  vmulps      ymm0,ymm4,ymmword ptr [eax+ecx+80h] 
58772591  vmulps      ymm1,ymm3,ymmword ptr [eax+ecx+60h] 
58772597  vaddps      ymm1,ymm1,ymm0 
5877259B  vmulps      ymm0,ymm5,ymmword ptr [eax+ecx+0A0h] 
587725A4  vaddps      ymm2,ymm1,ymm0 
587725A8  vmulps      ymm0,ymm4,ymmword ptr [eax+ecx+0E0h] 
587725B1  vmulps      ymm1,ymm3,ymmword ptr [eax+ecx+0C0h] 
587725BA  vaddps      ymm1,ymm1,ymm0 
587725BE  vmulps      ymm0,ymm5,ymmword ptr [eax+ecx+100h] 
587725C7  mov         eax,dword ptr [esi+54h] 
587725CA  add         eax,dword ptr [esp+10h] 
587725CE  vaddps      ymm1,ymm1,ymm0 
587725D2  vmulps      ymm5,ymm1,ymmword ptr [eax+40h] 
587725D7  vmulps      ymm4,ymm2,ymmword ptr [eax+20h] 
587725DC  vmulps      ymm6,ymm6,ymmword ptr [eax] 
587725E0  mov         eax,dword ptr [esp+14h] 
587725E4  vmulps      ymm0,ymm6,ymmword ptr [eax+ecx] 
587725E9  vmulps      ymm1,ymm4,ymmword ptr [eax+ecx+60h] 
587725EF  vaddps      ymm1,ymm1,ymm0 
587725F3  vmulps      ymm0,ymm5,ymmword ptr [eax+ecx+0C0h] 
587725FC  vaddps      ymm3,ymm1,ymm0 
58772600  vmulps      ymm0,ymm6,ymmword ptr [eax+ecx+20h] 
58772606  vmulps      ymm1,ymm4,ymmword ptr [eax+ecx+80h] 
5877260F  vaddps      ymm1,ymm1,ymm0 
58772613  vmulps      ymm0,ymm5,ymmword ptr [eax+ecx+0E0h] 
5877261C  vaddps      ymm2,ymm1,ymm0 
58772620  vmulps      ymm0,ymm6,ymmword ptr [eax+ecx+40h] 
58772626  vmulps      ymm1,ymm4,ymmword ptr [eax+ecx+0A0h] 
5877262F  vaddps      ymm1,ymm1,ymm0 
58772633  vmulps      ymm0,ymm5,ymmword ptr [eax+ecx+100h] 
5877263C  mov         ecx,dword ptr [esp+10h] 
58772640  add         eax,120h 
58772645  vaddps      ymm0,ymm1,ymm0 
58772649  mov         dword ptr [esp+14h],eax 
5877264D  vmovdqu     ymmword ptr [ecx+edx],ymm3 
58772652  vmovdqu     ymmword ptr [ecx+edx+20h],ymm2 
58772658  vmovdqu     ymmword ptr [ecx+edx+40h],ymm0 
5877265E  mov         edx,dword ptr [esp+18h] 
58772662  add         ecx,60h 
58772665  add         edx,20h 
58772668  mov         dword ptr [esp+10h],ecx 
5877266C  mov         dword ptr [esp+18h],edx 
58772670  dec         edi 
58772671  jne         dgNewtonCpu::InityBodyArray+110h (58772520h) 
58772677  vzeroupper 


one think is clear, the plugin version will be far better in 64 bit because some of the more complex kernel will run out of registers with only 8, but with 16 they will all be in cached memory and register, therefore thsi code soudl run at full theoretical speed.
so I expect that avx solver to be much faster using one core that the cpu version using 4 cores.
and about 3 to 4 time faster using 4 cores.
Julio Jerez
Moderator
Moderator
 
Posts: 12249
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: Parallel solver experiments

Postby JoeJ » Tue Apr 10, 2018 10:14 am

I like this. Also thinking a CPU core is 4 times faster than GPU core at least, additional complexity of GPU contact generation or transfer... maybe AVX turns out to be most practical. Also almost everybody has it.
User avatar
JoeJ
 
Posts: 1453
Joined: Tue Dec 21, 2010 6:18 pm

Re: Parallel solver experiments

Postby Julio Jerez » Tue Apr 10, 2018 3:57 pm

Yes I am very optimistic about this, It could come close to a mid range GPU provided that the code aproach the theoretical throughout of operations per clock, but that is hard to archived with sequential programming.
The trick is to have tight loops that preload data to level one cache and do a significan number of multiply and add, non of that shuffle of special instrutions. This what the plugin solver is designed to do.

One thing that baffled me is that I was using the operation fmuladd, which I though was standard for all AVX ready CPU, but it turns out that some do not support it, It appears that for the processor that do support it the troughput is not just double, is four time since an add using fmuadd with a const of 1.0 is twice as fast as a simple add, plus it doesy two operations.
but for the first version I will stick to simple add and simple mul operations.

I was searching to see if is was true that some icore 7 do not support mul add, and it seems that sandy bridge do not.
Intel Core 2 and Nehalem:

4 DP FLOPs/cycle: 2-wide SSE2 addition + 2-wide SSE2 multiplication
8 SP FLOPs/cycle: 4-wide SSE addition + 4-wide SSE multiplication
Intel Sandy Bridge/Ivy Bridge:

8 DP FLOPs/cycle: 4-wide AVX addition + 4-wide AVX multiplication
16 SP FLOPs/cycle: 8-wide AVX addition + 8-wide AVX multiplication
Intel Haswell/Broadwell/Skylake/Kaby Lake:

16 DP FLOPs/cycle: two 4-wide FMA (fused multiply-add) instructions
32 SP FLOPs/cycle: two 8-wide FMA (fused multiply-add) instructions
AMD K10:

4 DP FLOPs/cycle: 2-wide SSE2 addition + 2-wide SSE2 multiplication
8 SP FLOPs/cycle: 4-wide SSE addition + 4-wide SSE multiplication
AMD Bulldozer/Piledriver/Steamroller/Excavator, per module (two cores):

8 DP FLOPs/cycle: 4-wide FMA
16 SP FLOPs/cycle: 8-wide FMA
AMD Ryzen

8 DP FLOPs/cycle: 4-wide FMA
16 SP FLOPs/cycle: 8-wide FMA
Intel Atom (Bonnell/45nm, Saltwell/32nm, Silvermont/22nm):

1.5 DP FLOPs/cycle: scalar SSE2 addition + scalar SSE2 multiplication every other cycle
6 SP FLOPs/cycle: 4-wide SSE addition + 4-wide SSE multiplication every other cycle
AMD Bobcat:

1.5 DP FLOPs/cycle: scalar SSE2 addition + scalar SSE2 multiplication every other cycle
4 SP FLOPs/cycle: 4-wide SSE addition every other cycle + 4-wide SSE multiplication every other cycle
AMD Jaguar:

3 DP FLOPs/cycle: 4-wide AVX addition every other cycle + 4-wide AVX multiplication in four cycles
8 SP FLOPs/cycle: 8-wide AVX addition every other cycle + 8-wide AVX multiplication every other cycle


what this mean is that since thsi si a extranl plugin we can have diffrent versions that chek teh hardwres support.
I seem that the trick to get close to theoretical performance is to have enough instruction in the loop body, so that the compiler can schedule muls and adds in parallel and keep the units busy.

This appear to be true, the profile is showing that the same code is no just faster, this seem order of magnitude faster so far. some thing can be very good, I am very optimistic about this now.
Julio Jerez
Moderator
Moderator
 
Posts: 12249
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: Parallel solver experiments

Postby Julio Jerez » Thu Apr 12, 2018 9:18 pm

I am doing the avx and avx2 version in one plugin.
according to this doc
[url]https://computing.llnl.gov/tutorials/linux_clusters/intelAVXperformanceWhitePaper.pdf
[/url]

the expected performance gain from going from SSE to avx2 is not just double, is about 2.8
I expect about 3.5 or better since the plunging is all pack dat that fix in level one cache out branches.

In fact if we made a SSE plugin in that should be faster since the SSE is AOS whie can only use 75% of a register and required many swizzling of data, whiel the plugin is all AOS, not swizling and not branchs
but we will see.
Julio Jerez
Moderator
Moderator
 
Posts: 12249
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: Parallel solver experiments

Postby Julio Jerez » Sat Apr 14, 2018 1:07 pm

Well I now added the first of the four functions the make the solver.
The solve has four major parts.
Setup bodies
Setup joints
Solve for joint forces
Integrate bodies

I completed the first and I added a flotion point counter to measure the effective flotation points per seconds.
Do far with the first function I am getting an anemic 3.6 gigaflops.
And so difference when using avx2, if anything it seems to be slower.
The papers say, a single core should yield about 70, but I know that's a ridiculous exgeration, I was expecting from 10 to 20, but less than 4 is great disappointment.
Maybe it is because this part is dominated by memory bandwidth
Julio Jerez
Moderator
Moderator
 
Posts: 12249
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: Parallel solver experiments

Postby Julio Jerez » Sat Apr 14, 2018 3:42 pm

what I nee to do is add the SSE module first, that way we can use as the controller to measure too.
3.7 G floats, is not really a bad figure. I was just expecting something better because of the hype.
but in reality this is quite good.
The SSE module will indicate of the low value is because of bandwidth.
Julio Jerez
Moderator
Moderator
 
Posts: 12249
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: Parallel solver experiments

Postby Julio Jerez » Sat Apr 14, 2018 11:39 pm

at least there is a substantial gain between SSE and Avx
avx can do about about 1.7 time the flops of Sse.
Untitled.png
Untitled.png (80.84 KiB) Viewed 4796 times


This seem a good result so far, and it soudl get better as I write the next parts that are more floating point operation heavier than memory read and right.
I also think the code is far better than the sequential solver, but that is hard to measure since is too difficult top measure flops in teh sequential solver.
I will continue developing the SSE solve first.
Julio Jerez
Moderator
Moderator
 
Posts: 12249
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: Parallel solver experiments

Postby JoeJ » Sun Apr 15, 2018 3:10 am

Just stumbled about this https://troddenpath.wixsite.com/troddenpath/msc-thesis
Did not look yet, but having penetration volume sounds very good.
(I still think a single contact joint can be generated easily from that and multiple contacts between two bodies could be avoided. I've got much better stacks by doing so in the past.)
User avatar
JoeJ
 
Posts: 1453
Joined: Tue Dec 21, 2010 6:18 pm

Re: Parallel solver experiments

Postby Julio Jerez » Sun Apr 15, 2018 9:12 am

I click the pdf and read the intro, they say this
Since30FPS isconsidered the bareminimum for any real-time application (i.e. a maximum of 33ms per frame), any pairwise computation in the order of 1ms constitutes a serious performance bottleneck where many such pairs exist.


and further down they say this:
When processing multiple collisions simultaneously on a 4-core processor, the average running cost is as low as 5µs.


The first statement is simple not true, not even teh newtion ebngine has a cost of 1ms per colleion pair no even in debug mode.
and I am very skeptical of the secund claim 5us per pair independent of the mesh topology complexity including concave meshed, that I have to see. Their videos are not indication that kind of performance even in the simple shapes.

I do agree teh quick hold is quite expensive for collision specially for newton that is use heavily since collision in newton is based of body intersection rather that body distance. therefore a faster way to calculate penetration depth could be quite a help on tha area.
I will read the paper, but that two claim are not giving me lot of confidence.

they are saying that their system is of the order of 500 times faster that other collision systems and in independent of shape complexity and that a really extraordinary claim but their evidence is not extraordinary.
Julio Jerez
Moderator
Moderator
 
Posts: 12249
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: Parallel solver experiments

Postby JoeJ » Sun Apr 15, 2018 12:48 pm

Julio Jerez wrote:newton that is use heavily since collision in newton is based of body intersection rather that body distance.


So you calculate the intersection volume? Did not know that - sometime i'll try the single contact idea with this...

Maybe the paper requires one precomputed database per potential pair, that wouldn't be practical. Still did not read it.
User avatar
JoeJ
 
Posts: 1453
Joined: Tue Dec 21, 2010 6:18 pm

Re: Parallel solver experiments

Postby Julio Jerez » Fri Jun 15, 2018 2:23 pm

Hey Joe are you still there?
I finally perfected that the production version of the parallel solver.

Remarkably it seem to have the same stability than the sequential solver at that same number of iterations.
It is slower because it has more overhead, buy the slow down is about 10 to 20% as posse the before that it was 400% slower.

what these mean is that when we move the parallel solver to use simd, since the sequential version use only 3 floats of a simd, while the parallel solver use all the float, that will make for the 20% lost per joints.
but the parallel solver will solver 8 joint per iteration, so in theory it should be 4 time better in SSE and 8 time better in AVX, I take 2 to 4 time better.

I committed the test with the 30 x 30 pyramid, and is manage to bring it to sleep with not problems.
I do no know if it is me, but it seem to be even more stable,

There is some overhead because it operate on the island generated by there sequential solver, but if this is so good the maybe we can optimize that out and use only the parallel solve for every thing.

Of course for the it nee to handle joints which I have not done yet but I will work on that next week.
the week I will finis the simd parallelization to see where we stand.

Please check it out, so no test joint jet because is does no support then yet, and there is not filter for that yet.
Julio Jerez
Moderator
Moderator
 
Posts: 12249
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

PreviousNext

Return to General Discussion

Who is online

Users browsing this forum: No registered users and 13 guests

cron