Parallel solver experiments

by **Julio Jerez** » Tue Jun 19, 2018 11:12 am

Well at least it is not a regression in single core mode and we can claim is equal or better.
But when adding extra resources, we do see a significant gains, therofore it is good.
Also remember with SSE registers we only has 25% headroom for improvement (vector operations map a three floats array onto a four floats register) so this is the only thing that can counter the extra overhead.
When we move this to AVX registers we should expect a theoretical four time speed up, but more realistically I expect from 25% to 50% gain.

Also like I said before this was the brute force, I now committed the next optimization.

Basically this solver needs to deal with dynamic sleeping or else large Islams will cause unesessary slow down, this may not be much help on tall stacks, but is a huge on piles of rubber.

The second optimization is the small islands. the overhead is too big for small island, so the are now discriminated. An inslad has to be more than 128 joints large to be dispatched to that solver.
This may later be a parameter but for now is fix.

This is all committed now, can you please try again?

I will play with it this week days, then the weekend I will start adding the support for joints.

by **Dave Gravel** » Tue Jun 19, 2018 11:59 am

It is very slow now, with one core I get 0 to 3 fps.
If I enable cores I get around 5fps, at that speed it is hard to open the option menu.

by **Julio Jerez** » Tue Jun 19, 2018 12:26 pm

Do you mean is slow all around?
With and without the island solver?
Are you sure you are not compiling a debug build?

by **Dave Gravel** » Tue Jun 19, 2018 12:35 pm

Yes very sorry, it is set to debug mode :oops:

.
Now in release mode the speed is back at the normal.

The speed don't seen to have change in one core mode for me.
With 4 or 8 threads ... I think I get around 5 fps more

by **Julio Jerez** » Tue Jun 19, 2018 1:33 pm

Ha ok, the important part is that is not worse and it is a gain is some configuration.
Is funny how that works, I was expecting better performance on AMD in single core but I guess that this exposes that fact that the intel arch is better at float that AMD, at least on older generations.
basically until the new Ryzen, AMD processor share a floating point unit per two cores, and this is why they underperform the Intel.
However the achievement is that there is not regression for doing this so we can continue.
It is possible that the single core get better when we re enable the plugins.

this is why, in the base line solver I can only use instruction that are SSE, or it will crash in some system, but that does not apply to the DLLs.

the SSE version of the DLL will use the FMA instruction that will literally cut the number of instruction to half, for the cpu that support it.
then there will be the AVX that will cu the register pressure but half against,
and finally there will be AVX two the will cut the registers and intrusion by 2
all this on a high end system should yodel a threrical 4x gain in single core, with zero lost.
I will be happing if it make faster that the normal even by a lithe bit.

here is the row map from now for the next few days (weeks)
-Add joint support
-Re enable the plugins,
-Start work on cloth solver.

by **Dave Gravel** » Tue Jun 19, 2018 3:56 pm

Yes amd have always get a bit lower speed.
I have play a lot with emulation and with virtualization and linux kernel before and I can say intel have always give better result. It is more critical for older amd model, some models need sse and some importants instructions for speedup or multithreading. Amd is winner about giving pretty good and stable and longlife product for low price. When I disable virtualization and enable the turbo or overclock my cpu to 4.8ghz it performing pretty nice with good gain. I wish to can get the hands on a good Ryzen model for my next configuration.

by **JoeJ** » Wed Jun 20, 2018 3:04 am

Just tried the new build with the smaller stack, which settles too quickly to do detailed comparisons.
But i can see parallel solver still seems as fast as disabling it, both peak at 11-12 ms, using only one core.

So do i get this right and Jacobi is as fast as Gauss-Seidel? I'm still curious how you managed this. Better SSE utilization for the former?

by **Julio Jerez** » Wed Jun 20, 2018 3:56 am

If doing on joint at a time, the Jacobi solver is slower that gauss sidle because it has the overhead of calculations of weights plus converge slowly.
However since Jacob is not recurrent, then you can rearrange the joints is such way that you can use each lane of a Sims register as a independent arithmetic scalar unit. Therefore an see register is like four cores that can do four joint per iteration.

Because vector only use three element when use as array of structure, this means that only 75 % of the computing power is use at must,
My gamble is that that 25 % not used is sufficient to make up for the overhead for arranging the data.

In the end it is marginally better, in fact for the very large island you start and to see the spa solver pulling a head.

This is for see, when it comes to acc now we get. 5 extra lanes, (62%) more float. Something the guass sidle can never do.

I just finish the acc but I have not tested yet, I do it tomorrow morning, I expect a subtanctial measurable gain.
And from there it can only get better as more resources are available. For example avx2 can do fmadd. Which double the floats.

by **Julio Jerez** » Wed Jun 20, 2018 11:22 am

Ok Joe and David, I just committed the AVX plugin.
Did not have time to make a movie, I am just going to say :mrgreen:

I still in shock,
Of course not all cpu that support AVX do it equally, for example AMD do it in emulation in hardware by issuing two SSE instructions, but I think that should be better than not doing it at all.
I committed with the 40 x40 pyramid an in my system it picks at 22 ms in single core.

Joe your system was crashing before and you days it did not soppert avx, but that was a bug in the plug in.
When you have time if you test it, if the option menu show avx solver the it should work in you system.

Any way tonight I will add the AVX2 which I hope it can resolve that tea at about 16 ms, in single core.
That's probably too optimistic, but it is my goal.

The one thing for sure is that now it can do it at over 60, with only two threads, and that's some serious performance in even in entry levels systems.

by **Dave Gravel** » Wed Jun 20, 2018 12:45 pm

Cool, Ok I have test here on my amd.
The avx plugin run a bit slower.

I get around 2 or 3 ms faster without the avx plugin.

by **Julio Jerez** » Wed Jun 20, 2018 2:37 pm

yes that's why I was so surprised when I saw almost 50% speed up.
but I did not expect a regression on amd.

here is how I can explain it, essentially both AMD and Intel are 64 bit processes with 126 bit internal busses.
But AMD decided to keep the internal busses at 128 bit, whoever Intel when a head and made the 256 bit internal busses to support AVX. I believe this is the case for CPU after Sandy Bridge.

AMD when for more cores, and more integer performance while Intel when for wider float and integer units. So what AMD did is they the support the AVX instruction set, but did not do anything to the micro architecture other than widening the register set, but not the execution units.
Instead AMD put all the silicon on the APUs.

Her is two image that compare the Bulldozer and the IVY bridge microarch, as you can see the main difference the internal memory.

: amdvsintel.png (345.52 KiB) Viewed 8007 times

for us what mean is that if we want to take full advantage of the arch we nee to make three plugins
sse4.2, avx, avx2

the SSE4.2 is the one that will take advance of the mul-add instruction that are available to AMD and that will cut the number of instruction almost by half while keeping the float unit busy.

the AVX will take advance of higher band with, but can no use mul-add, because most AMD older ship do not support AVX2.

the AVX2 which will take advantage of both bandwidth and mul-add instruction that are available to intel core 7

therefore the workhorse for icore7 and bulldozer should be the sse4 solver.
the ASVX is for newer generation like AMD zen and icore9

here is what I do not understand why AMD keep making these mistakes, for the ryzen arch instead of making the internal bus 256 bit, the keep it 128 bit and added tow more 128 bit float units.

Intel when different round the made 256 bus, and tow full 256 float units that are fully othotgomnal, AMD is always on step behind Intel in vesion to the future, that bet that the Ryzen will beat the Intel because there are more app using SSE than AVX, therefore sine the can issue 4 SSE instruction on fly while intell can only issue two using AVD float units but what the don count is that in the intel the SSE unit and the AVX float unit are different,
This is the reason Intel consistently beats AMD in a core per core benchmark, while we may see some Ryzen Benchmark do better that Intel in multithreaded app just because they have more cores.

In my option the AMD made a huge mistake keeping 128 bus and float units, but I am always wrong on the prediction.

for use is reduced that we sould make a SSE4.4 plugin so ta we can use fmadd instructions.

by **JoeJ** » Wed Jun 20, 2018 2:59 pm

Edit: This is in response to you answering my jacobi vs. gauss seidel question. I've missed tho posts in between and now i can't quote anything.

Thanks.
This makes me think that using AoS simd vectors for gauss seidel is only a small improvement over using regular old school FPU instructions. Did you ever compare those two options?

(Personally i never did - i use simd vectors sometimes but i guess this alone has no benefit.)

by **Julio Jerez** » Wed Jun 20, 2018 3:13 pm

JoeJ wrote:Thanks.
This makes me think that using AoS simd vectors for gauss seidel is only a small improvement over using regular old school FPU instructions. Did you ever compare those two options?
(Personally i never did - i use simd vectors sometimes but i guess this alone has no benefit.)

as far as I know this is not possible with recurrent algorism, Gauss Sidle is based on using there result of the previous row for calculation the new row result, so not it is virtually not possible, there are tricks like running red black tree coloring but what they do is the reduce the converge rate by a great deal plus they has a high overhead.

on the simd de you get better performance but is never even close to the theatrical because all the data swizeling required because SSE instructions set does not have gathering and scattering instructions.

Anyway in case you sync, you should do it again I made the same mistake of adding global constant that call constructors to initialized their values. that a mistake because the plugin will call this function before for support for the instruction set. this was why the old plug crashed.

by **JoeJ** » Wed Jun 20, 2018 3:27 pm

I still get a crash from this release 'fixed initlization of plugin globals before validation of instruction…'
But only in release. Debug starts up without issues.

: crash2.JPG (205.5 KiB) Viewed 8005 times

by **JoeJ** » Wed Jun 20, 2018 3:35 pm

Julio Jerez wrote:here is what I do not understand why AMD keep making these mistakes, for the ryzen arch instead of making the internal bus 256 bit, the keep it 128 bit and added tow more 128 bit float units.

I assume they did it because software needs to be written to utilize SIMD properly, and they know this is rarely done.
Better spending half chip area on branch prediction, so they can process more useless work

Parallel solver experiments

Re: Parallel solver experiments

Re: Parallel solver experiments

Re: Parallel solver experiments

Re: Parallel solver experiments

Re: Parallel solver experiments

Re: Parallel solver experiments

Re: Parallel solver experiments

Re: Parallel solver experiments

Re: Parallel solver experiments

Re: Parallel solver experiments

Re: Parallel solver experiments

Re: Parallel solver experiments

Re: Parallel solver experiments

Re: Parallel solver experiments

Re: Parallel solver experiments

Who is online