Parallel solver experiments

by **Julio Jerez** » Sun Apr 08, 2018 4:21 pm

Joe I do not think is generates binary code, it simply translate C++ to the compile native language, so whatever is supported on the platform can be supported by the high level if it is exposed.

Maybe I do not understand well, but they are not doing themself a favor by marketing as a CUDA translator, when the CUDA to HIP is just a small feature.

but anyway I downloaded and as usual it was too good to be true, it is one of those opens source BS that is not complet unless you download every other open source app in the internet.
I get this error

Selecting Windows SDK version to target Windows 10.0.16299.
CMake Error at CMakeLists.txt:26 (string):
string sub-command REPLACE requires at least four arguments.

HIP Platform:
HIP will be installed in: C:/Program Files (x86)/hip
CMake Error at hipify-clang/CMakeLists.txt:4 (find_package):
By not providing "FindLLVM.cmake" in CMAKE_MODULE_PATH this project has
asked CMake to find a package configuration file provided by "LLVM", but
CMake did not find one.

Could not find a package configuration file provided by "LLVM" with any of
the following names:

LLVMConfig.cmake
llvm-config.cmake

The one LLVM, and if you get LLVM, then LLVM ask you for * like Booth, Google stuff, and it is all a rabit hole of downloading and running CMake and installing. I am not spending a second on that. if there is not a prebuild runtime then I am not using it.

by **JoeJ** » Sun Apr 08, 2018 4:41 pm

Not really a surprise, did not expect it to be easy. AMD is not really a software company and marketing is even worse, or did you ever heard them pointing out their compute advantage? GCN was unbelievable 5 times faster per buck than Kepler, and i never knew until it saw it myself. Trial to compete against CUDA comes really late...

by **JoeJ** » Mon Apr 09, 2018 12:48 am

: runtime.JPG (53.71 KiB) Viewed 4931 times

Just got the idea to look for what i have installed from Intel. Runtime is all i have and OpenCL works.

by **Julio Jerez** » Mon Apr 09, 2018 5:51 pm

working on the had written avx module, here is a code comparison of function

SSE version one body per call

Code: Select all: void dgDynamicBody::AddDampingAcceleration(dgFloat32 timestep) 00205770 push ebp 00205771 mov ebp,esp 00205773 push ecx 00205774 movss xmm2,dword ptr [timestep] 00205779 mov eax,ecx 0020577B mov dword ptr [ebp-4],eax 0020577E movss xmm0,dword ptr [eax+210h] 00205786 subss xmm0,xmm2 0020578A andps xmm0,xmmword ptr [__xmm@7fffffff7fffffff7fffffff7fffffff (0258A40h)] 00205791 comiss xmm0,dword ptr [__real@358637bd (0257EA4h)] 002057E3 mov eax,dword ptr [this] 002057E6 pop edi 002057E7 pop esi 002057E8 movaps xmm3,xmmword ptr [eax+200h] 002057EF movaps xmm0,xmm3 002057F2 shufps xmm0,xmm3,0FFh 002057F6 shufps xmm0,xmm0,0 002057FA mulps xmm0,xmmword ptr [eax+0C0h] 00205801 movups xmmword ptr [eax+0C0h],xmm0 00205808 movaps xmm2,xmmword ptr [eax+0D0h] 0020580F movaps xmm1,xmmword ptr [eax+70h] 00205813 movaps xmm0,xmmword ptr [eax+60h] 00205817 mulps xmm1,xmm2 0020581A mulps xmm0,xmm2 0020581D mulps xmm2,xmmword ptr [eax+50h] 00205821 andps xmm3,xmmword ptr [dgVector::m_triplexMask (0280B30h)] 00205828 haddps xmm0,xmm0 0020582C haddps xmm2,xmm2 00205830 haddps xmm1,xmm1 00205834 haddps xmm2,xmm2 00205838 haddps xmm0,xmm0 0020583C andps xmm2,xmmword ptr [dgVector::m_xMask (0280AF0h)] 00205843 andps xmm0,xmmword ptr [dgVector::m_yMask (0280B00h)] 0020584A haddps xmm1,xmm1 0020584E addps xmm2,xmm0 00205851 andps xmm1,xmmword ptr [dgVector::m_zMask (0280B10h)] 00205858 addps xmm2,xmm1 0020585B mulps xmm2,xmm3 0020585E movaps xmm1,xmm2 00205861 movaps xmm0,xmm2 00205864 shufps xmm1,xmm2,55h 00205868 mulps xmm1,xmmword ptr [eax+60h] 0020586C shufps xmm0,xmm2,0 00205870 mulps xmm0,xmmword ptr [eax+50h] 00205874 shufps xmm2,xmm2,0AAh 00205878 mulps xmm2,xmmword ptr [eax+70h] 0020587C addps xmm1,xmm0 0020587F addps xmm1,xmm2 00205882 movups xmmword ptr [eax+0D0h],xmm1 00205889 mov esp,ebp 0020588B pop ebp 0020588C ret 4

avx 8 bodies per call

Code: Select all: void dgDynamicBody::AddDampingAcceleration(dgFloat32 timestep) 54CA248E xor ecx,ecx 54CA2490 mov dword ptr [esp+20h],ecx 54CA2494 mov eax,dword ptr [edi+0CCh] 54CA249A vmovups ymm0,ymmword ptr [eax+ecx] 54CA249F mov eax,dword ptr [edi+0E0h] 54CA24A5 vmulps ymm0,ymm3,ymm0 54CA24A9 vmovups ymm1,ymmword ptr [eax+ecx] 54CA24AE mov eax,dword ptr [edi+0F4h] 54CA24B4 vmulps ymm1,ymm3,ymm1 54CA24B8 vmulps ymm2,ymm3,ymmword ptr [eax+ecx] 54CA24BD mov eax,dword ptr [edi+0CCh] 54CA24C3 vmovdqu ymmword ptr [eax+ecx],ymm0 54CA24C8 mov eax,dword ptr [edi+0E0h] 54CA24CE vmovdqu ymmword ptr [eax+ecx],ymm1 54CA24D3 mov eax,dword ptr [edi+0F4h] 54CA24D9 vmovdqu ymmword ptr [eax+ecx],ymm2 54CA24DE mov eax,dword ptr [edi+180h] 54CA24E4 mov esi,dword ptr [edi+108h] 54CA24EA mov ecx,dword ptr [edi+194h] 54CA24F0 mov edx,dword ptr [edi+1A8h] 54CA24F6 mov edi,dword ptr [edi+11Ch] 54CA24FC mov dword ptr [esp+34h],eax 54CA2500 mov eax,dword ptr [esp+20h] 54CA2504 mov dword ptr [esp+30h],esi 54CA2508 mov dword ptr [esp+38h],ecx 54CA250C mov dword ptr [esp+3Ch],edx 54CA2510 vmovups ymm2,ymmword ptr [esi+eax] 54CA2515 mov esi,dword ptr [esp+24h] 54CA2519 vmovups ymm5,ymmword ptr [edi+eax] 54CA251E mov edx,dword ptr [esp+20h] 54CA2522 mov ecx,dword ptr [esi+54h] 54CA2525 mov edi,dword ptr [esi+130h] 54CA252B vmovups ymm7,ymmword ptr [ecx+edx] 54CA2530 mov ecx,dword ptr [esi+68h] 54CA2533 vmovups ymm6,ymmword ptr [edi+eax] 54CA2538 mov edi,dword ptr [esi+90h] 54CA253E mov eax,dword ptr [esi+0A4h] 54CA2544 mov dword ptr [esp+2Ch],ecx 54CA2548 mov ecx,edx 54CA254A mov edx,dword ptr [esi+18h] 54CA254D vmovups ymm4,ymmword ptr [edx+ecx] 54CA2552 mov edx,dword ptr [esi+2Ch] 54CA2555 mov esi,ecx 54CA2557 vmulps ymm1,ymm5,ymmword ptr [eax+esi] 54CA255C vmulps ymm0,ymm2,ymmword ptr [edi+esi] 54CA2561 mov esi,dword ptr [esp+24h] 54CA2565 vaddps ymm1,ymm1,ymm0 54CA2569 mov eax,dword ptr [esi+0B8h] 54CA256F mov esi,dword ptr [esp+20h] 54CA2573 vmulps ymm0,ymm6,ymmword ptr [eax+ecx] 54CA2578 mov ecx,dword ptr [esp+2Ch] 54CA257C vaddps ymm3,ymm1,ymm0 54CA2580 vmulps ymm0,ymm2,ymm7 54CA2584 vmulps ymm1,ymm5,ymmword ptr [ecx+esi] 54CA2589 mov ecx,dword ptr [esp+24h] 54CA258D vaddps ymm1,ymm1,ymm0 54CA2591 mov eax,dword ptr [ecx+7Ch] 54CA2594 vmulps ymm0,ymm6,ymmword ptr [eax+esi] 54CA2599 mov eax,dword ptr [esp+34h] 54CA259D vaddps ymm2,ymm1,ymm0 54CA25A1 vmulps ymm1,ymm5,ymmword ptr [edx+esi] 54CA25A6 mov edx,dword ptr [esp+20h] 54CA25AA mov esi,dword ptr [esp+30h] 54CA25AE vmulps ymm0,ymm4,ymmword ptr [esi+edx] 54CA25B3 mov esi,dword ptr [ecx+40h] 54CA25B6 mov ecx,dword ptr [esp+38h] 54CA25BA vaddps ymm1,ymm1,ymm0 54CA25BE vmulps ymm0,ymm6,ymmword ptr [esi+edx] 54CA25C3 vmulps ymm6,ymm3,ymmword ptr [eax+edx] 54CA25C8 vmulps ymm2,ymm2,ymmword ptr [ecx+edx] 54CA25CD mov ecx,dword ptr [esp+20h] 54CA25D1 mov edx,dword ptr [esp+3Ch] 54CA25D5 vaddps ymm0,ymm1,ymm0 54CA25D9 vmulps ymm1,ymm2,ymm7 54CA25DD vmulps ymm3,ymm0,ymmword ptr [edx+ecx] 54CA25E2 vmulps ymm0,ymm6,ymmword ptr [edi+ecx] 54CA25E7 mov edi,dword ptr [esp+24h] 54CA25EB vaddps ymm1,ymm1,ymm0 54CA25EF vmulps ymm0,ymm3,ymm4 54CA25F3 vaddps ymm5,ymm1,ymm0 54CA25F7 mov eax,dword ptr [edi+68h] 54CA25FA vmulps ymm1,ymm2,ymmword ptr [eax+ecx] 54CA25FF mov eax,dword ptr [edi+0A4h] 54CA2605 vmulps ymm0,ymm6,ymmword ptr [eax+ecx] 54CA260A mov eax,dword ptr [edi+2Ch] 54CA260D vaddps ymm1,ymm1,ymm0 54CA2611 vmulps ymm0,ymm3,ymmword ptr [eax+ecx] 54CA2616 mov eax,dword ptr [edi+7Ch] 54CA2619 vaddps ymm4,ymm1,ymm0 54CA261D vmulps ymm1,ymm2,ymmword ptr [eax+ecx] 54CA2622 mov eax,dword ptr [edi+0B8h] 54CA2628 vmulps ymm0,ymm6,ymmword ptr [eax+ecx] 54CA262D mov eax,dword ptr [edi+108h] 54CA2633 vaddps ymm1,ymm1,ymm0 54CA2637 vmulps ymm0,ymm3,ymmword ptr [esi+ecx] 54CA263C vmovups ymm3,ymmword ptr [esp+40h] 54CA2642 vmovdqu ymmword ptr [eax+ecx],ymm5 54CA2647 mov eax,dword ptr [edi+11Ch] 54CA264D vaddps ymm0,ymm1,ymm0 54CA2651 vmovdqu ymmword ptr [eax+ecx],ymm4 54CA2656 mov eax,dword ptr [edi+130h] 54CA265C vmovdqu ymmword ptr [eax+ecx],ymm0 54CA2661 add ecx,20h 54CA2664 sub dword ptr [esp+28h],1 54CA2669 mov dword ptr [esp+20h],ecx 54CA266D jne dgNewtonCpu::InityBodyArray+104h (54CA2494h) 54CA2673 vzeroupper

as you can see the AVX comes out to be a littles than twice the number of instruction, but is executes 8 bodies per call.
On paper this look muc faster, but I do no know about memory letancy
the way I see it the avx has a lot more coherence in eh avx since the data is all adjacent in memory.
I expect these code to be about three to four time faster, than the cpu normal one.

In theory one code should be faster than for cores doing the sequential code.

by **Julio Jerez** » Tue Apr 10, 2018 7:35 am

ok now is far move compact and better, it does not run out of integers, at least this one,
It is remarkable is has almost the same number of operation as the non SAA simd

Code: Select all: void dgDynamicBody::AddDampingAcceleration(dgFloat32 timestep) 58772517 jmp dgNewtonCpu::InityBodyArray+110h (58772520h) 58772519 lea esp,[esp] 58772520 mov eax,dword ptr [esi+40h] 58772523 mov ecx,dword ptr [esi+18h] 58772526 vmovups ymm1,ymmword ptr [eax+edx] 5877252B mov eax,dword ptr [esp+10h] 5877252F vmulps ymm3,ymm1,ymmword ptr [eax+ecx] 58772534 vmulps ymm2,ymm1,ymmword ptr [eax+ecx+20h] 5877253A vmulps ymm0,ymm1,ymmword ptr [eax+ecx+40h] 58772540 vmovdqu ymmword ptr [eax+ecx],ymm3 58772545 vmovdqu ymmword ptr [eax+ecx+20h],ymm2 5877254B vmovdqu ymmword ptr [eax+ecx+40h],ymm0 58772551 mov ecx,dword ptr [esi+90h] 58772557 mov edx,dword ptr [esi+2Ch] 5877255A vmovups ymm4,ymmword ptr [eax+edx+20h] 58772560 vmovups ymm3,ymmword ptr [eax+edx] 58772565 vmovups ymm5,ymmword ptr [eax+edx+40h] 5877256B mov eax,dword ptr [esp+14h] 5877256F vmulps ymm0,ymm3,ymmword ptr [eax+ecx] 58772574 vmulps ymm1,ymm4,ymmword ptr [eax+ecx+20h] 5877257A vaddps ymm1,ymm1,ymm0 5877257E vmulps ymm0,ymm5,ymmword ptr [eax+ecx+40h] 58772584 vaddps ymm6,ymm1,ymm0 58772588 vmulps ymm0,ymm4,ymmword ptr [eax+ecx+80h] 58772591 vmulps ymm1,ymm3,ymmword ptr [eax+ecx+60h] 58772597 vaddps ymm1,ymm1,ymm0 5877259B vmulps ymm0,ymm5,ymmword ptr [eax+ecx+0A0h] 587725A4 vaddps ymm2,ymm1,ymm0 587725A8 vmulps ymm0,ymm4,ymmword ptr [eax+ecx+0E0h] 587725B1 vmulps ymm1,ymm3,ymmword ptr [eax+ecx+0C0h] 587725BA vaddps ymm1,ymm1,ymm0 587725BE vmulps ymm0,ymm5,ymmword ptr [eax+ecx+100h] 587725C7 mov eax,dword ptr [esi+54h] 587725CA add eax,dword ptr [esp+10h] 587725CE vaddps ymm1,ymm1,ymm0 587725D2 vmulps ymm5,ymm1,ymmword ptr [eax+40h] 587725D7 vmulps ymm4,ymm2,ymmword ptr [eax+20h] 587725DC vmulps ymm6,ymm6,ymmword ptr [eax] 587725E0 mov eax,dword ptr [esp+14h] 587725E4 vmulps ymm0,ymm6,ymmword ptr [eax+ecx] 587725E9 vmulps ymm1,ymm4,ymmword ptr [eax+ecx+60h] 587725EF vaddps ymm1,ymm1,ymm0 587725F3 vmulps ymm0,ymm5,ymmword ptr [eax+ecx+0C0h] 587725FC vaddps ymm3,ymm1,ymm0 58772600 vmulps ymm0,ymm6,ymmword ptr [eax+ecx+20h] 58772606 vmulps ymm1,ymm4,ymmword ptr [eax+ecx+80h] 5877260F vaddps ymm1,ymm1,ymm0 58772613 vmulps ymm0,ymm5,ymmword ptr [eax+ecx+0E0h] 5877261C vaddps ymm2,ymm1,ymm0 58772620 vmulps ymm0,ymm6,ymmword ptr [eax+ecx+40h] 58772626 vmulps ymm1,ymm4,ymmword ptr [eax+ecx+0A0h] 5877262F vaddps ymm1,ymm1,ymm0 58772633 vmulps ymm0,ymm5,ymmword ptr [eax+ecx+100h] 5877263C mov ecx,dword ptr [esp+10h] 58772640 add eax,120h 58772645 vaddps ymm0,ymm1,ymm0 58772649 mov dword ptr [esp+14h],eax 5877264D vmovdqu ymmword ptr [ecx+edx],ymm3 58772652 vmovdqu ymmword ptr [ecx+edx+20h],ymm2 58772658 vmovdqu ymmword ptr [ecx+edx+40h],ymm0 5877265E mov edx,dword ptr [esp+18h] 58772662 add ecx,60h 58772665 add edx,20h 58772668 mov dword ptr [esp+10h],ecx 5877266C mov dword ptr [esp+18h],edx 58772670 dec edi 58772671 jne dgNewtonCpu::InityBodyArray+110h (58772520h) 58772677 vzeroupper

one think is clear, the plugin version will be far better in 64 bit because some of the more complex kernel will run out of registers with only 8, but with 16 they will all be in cached memory and register, therefore thsi code soudl run at full theoretical speed.
so I expect that avx solver to be much faster using one core that the cpu version using 4 cores.
and about 3 to 4 time faster using 4 cores.

by **JoeJ** » Tue Apr 10, 2018 10:14 am

I like this. Also thinking a CPU core is 4 times faster than GPU core at least, additional complexity of GPU contact generation or transfer... maybe AVX turns out to be most practical. Also almost everybody has it.

by **Julio Jerez** » Tue Apr 10, 2018 3:57 pm

Yes I am very optimistic about this, It could come close to a mid range GPU provided that the code aproach the theoretical throughout of operations per clock, but that is hard to archived with sequential programming.
The trick is to have tight loops that preload data to level one cache and do a significan number of multiply and add, non of that shuffle of special instrutions. This what the plugin solver is designed to do.

One thing that baffled me is that I was using the operation fmuladd, which I though was standard for all AVX ready CPU, but it turns out that some do not support it, It appears that for the processor that do support it the troughput is not just double, is four time since an add using fmuadd with a const of 1.0 is twice as fast as a simple add, plus it doesy two operations.
but for the first version I will stick to simple add and simple mul operations.

I was searching to see if is was true that some icore 7 do not support mul add, and it seems that sandy bridge do not.

Intel Core 2 and Nehalem:

4 DP FLOPs/cycle: 2-wide SSE2 addition + 2-wide SSE2 multiplication
8 SP FLOPs/cycle: 4-wide SSE addition + 4-wide SSE multiplication
Intel Sandy Bridge/Ivy Bridge:

8 DP FLOPs/cycle: 4-wide AVX addition + 4-wide AVX multiplication
16 SP FLOPs/cycle: 8-wide AVX addition + 8-wide AVX multiplication
Intel Haswell/Broadwell/Skylake/Kaby Lake:

16 DP FLOPs/cycle: two 4-wide FMA (fused multiply-add) instructions
32 SP FLOPs/cycle: two 8-wide FMA (fused multiply-add) instructions
AMD K10:

4 DP FLOPs/cycle: 2-wide SSE2 addition + 2-wide SSE2 multiplication
8 SP FLOPs/cycle: 4-wide SSE addition + 4-wide SSE multiplication
AMD Bulldozer/Piledriver/Steamroller/Excavator, per module (two cores):

8 DP FLOPs/cycle: 4-wide FMA
16 SP FLOPs/cycle: 8-wide FMA
AMD Ryzen

8 DP FLOPs/cycle: 4-wide FMA
16 SP FLOPs/cycle: 8-wide FMA
Intel Atom (Bonnell/45nm, Saltwell/32nm, Silvermont/22nm):

1.5 DP FLOPs/cycle: scalar SSE2 addition + scalar SSE2 multiplication every other cycle
6 SP FLOPs/cycle: 4-wide SSE addition + 4-wide SSE multiplication every other cycle
AMD Bobcat:

1.5 DP FLOPs/cycle: scalar SSE2 addition + scalar SSE2 multiplication every other cycle
4 SP FLOPs/cycle: 4-wide SSE addition every other cycle + 4-wide SSE multiplication every other cycle
AMD Jaguar:

3 DP FLOPs/cycle: 4-wide AVX addition every other cycle + 4-wide AVX multiplication in four cycles
8 SP FLOPs/cycle: 8-wide AVX addition every other cycle + 8-wide AVX multiplication every other cycle

what this mean is that since thsi si a extranl plugin we can have diffrent versions that chek teh hardwres support.
I seem that the trick to get close to theoretical performance is to have enough instruction in the loop body, so that the compiler can schedule muls and adds in parallel and keep the units busy.

This appear to be true, the profile is showing that the same code is no just faster, this seem order of magnitude faster so far. some thing can be very good, I am very optimistic about this now.

by **Julio Jerez** » Thu Apr 12, 2018 9:18 pm

I am doing the avx and avx2 version in one plugin.
according to this doc
[url]https://computing.llnl.gov/tutorials/linux_clusters/intelAVXperformanceWhitePaper.pdf
[/url]

the expected performance gain from going from SSE to avx2 is not just double, is about 2.8
I expect about 3.5 or better since the plunging is all pack dat that fix in level one cache out branches.

In fact if we made a SSE plugin in that should be faster since the SSE is AOS whie can only use 75% of a register and required many swizzling of data, whiel the plugin is all AOS, not swizling and not branchs
but we will see.

by **Julio Jerez** » Sat Apr 14, 2018 1:07 pm

Well I now added the first of the four functions the make the solver.
The solve has four major parts.
Setup bodies
Setup joints
Solve for joint forces
Integrate bodies

I completed the first and I added a flotion point counter to measure the effective flotation points per seconds.
Do far with the first function I am getting an anemic 3.6 gigaflops.
And so difference when using avx2, if anything it seems to be slower.
The papers say, a single core should yield about 70, but I know that's a ridiculous exgeration, I was expecting from 10 to 20, but less than 4 is great disappointment.
Maybe it is because this part is dominated by memory bandwidth

by **Julio Jerez** » Sat Apr 14, 2018 3:42 pm

what I nee to do is add the SSE module first, that way we can use as the controller to measure too.
3.7 G floats, is not really a bad figure. I was just expecting something better because of the hype.
but in reality this is quite good.
The SSE module will indicate of the low value is because of bandwidth.

by **Julio Jerez** » Sat Apr 14, 2018 11:39 pm

at least there is a substantial gain between SSE and Avx
avx can do about about 1.7 time the flops of Sse.

: Untitled.png (80.84 KiB) Viewed 4821 times

This seem a good result so far, and it soudl get better as I write the next parts that are more floating point operation heavier than memory read and right.
I also think the code is far better than the sequential solver, but that is hard to measure since is too difficult top measure flops in teh sequential solver.
I will continue developing the SSE solve first.

by **JoeJ** » Sun Apr 15, 2018 3:10 am

Just stumbled about this https://troddenpath.wixsite.com/troddenpath/msc-thesis
Did not look yet, but having penetration volume sounds very good.
(I still think a single contact joint can be generated easily from that and multiple contacts between two bodies could be avoided. I've got much better stacks by doing so in the past.)

by **Julio Jerez** » Sun Apr 15, 2018 9:12 am

I click the pdf and read the intro, they say this

Since30FPS isconsidered the bareminimum for any real-time application (i.e. a maximum of 33ms per frame), any pairwise computation in the order of 1ms constitutes a serious performance bottleneck where many such pairs exist.

and further down they say this:

When processing multiple collisions simultaneously on a 4-core processor, the average running cost is as low as 5µs.

The first statement is simple not true, not even teh newtion ebngine has a cost of 1ms per colleion pair no even in debug mode.
and I am very skeptical of the secund claim 5us per pair independent of the mesh topology complexity including concave meshed, that I have to see. Their videos are not indication that kind of performance even in the simple shapes.

I do agree teh quick hold is quite expensive for collision specially for newton that is use heavily since collision in newton is based of body intersection rather that body distance. therefore a faster way to calculate penetration depth could be quite a help on tha area.
I will read the paper, but that two claim are not giving me lot of confidence.

they are saying that their system is of the order of 500 times faster that other collision systems and in independent of shape complexity and that a really extraordinary claim but their evidence is not extraordinary.

by **JoeJ** » Sun Apr 15, 2018 12:48 pm

Julio Jerez wrote:newton that is use heavily since collision in newton is based of body intersection rather that body distance.

So you calculate the intersection volume? Did not know that - sometime i'll try the single contact idea with this...

Maybe the paper requires one precomputed database per potential pair, that wouldn't be practical. Still did not read it.

by **Julio Jerez** » Fri Jun 15, 2018 2:23 pm

Hey Joe are you still there?
I finally perfected that the production version of the parallel solver.

Remarkably it seem to have the same stability than the sequential solver at that same number of iterations.
It is slower because it has more overhead, buy the slow down is about 10 to 20% as posse the before that it was 400% slower.

what these mean is that when we move the parallel solver to use simd, since the sequential version use only 3 floats of a simd, while the parallel solver use all the float, that will make for the 20% lost per joints.
but the parallel solver will solver 8 joint per iteration, so in theory it should be 4 time better in SSE and 8 time better in AVX, I take 2 to 4 time better.

I committed the test with the 30 x 30 pyramid, and is manage to bring it to sleep with not problems.
I do no know if it is me, but it seem to be even more stable,

There is some overhead because it operate on the island generated by there sequential solver, but if this is so good the maybe we can optimize that out and use only the parallel solve for every thing.

Of course for the it nee to handle joints which I have not done yet but I will work on that next week.
the week I will finis the simd parallelization to see where we stand.

Please check it out, so no test joint jet because is does no support then yet, and there is not filter for that yet.

Parallel solver experiments

Re: Parallel solver experiments

Re: Parallel solver experiments

Re: Parallel solver experiments

Re: Parallel solver experiments

Re: Parallel solver experiments

Re: Parallel solver experiments

Re: Parallel solver experiments

Re: Parallel solver experiments

Re: Parallel solver experiments

Re: Parallel solver experiments

Re: Parallel solver experiments

Re: Parallel solver experiments

Re: Parallel solver experiments

Re: Parallel solver experiments

Re: Parallel solver experiments

Who is online