Parallel solver experiments

by **Julio Jerez** » Thu Mar 29, 2018 1:25 pm

I am opening this thread to talk about the frustrations implementing a Jacobi parallel solver.

I spent the last two weeks working on that solver just to find out that is was fundamnatially wrong, basically is showed that is was possible that after one interation step the solution diverged instead of converge to a solution.
In general this is not a problem if you can run a sufficiently large number of iteration, and some iteration the residual decreases on average, but that is one of the problems with real time you can no run a unlimited number of iterations, so what is needed is an algorithm with a monotonically decreasing convergent rate per iteration.
for PSD matrices Gauss sidle and gradient descent fall on that category.
Jacobi and conjugant gradient do not. so the are far slower because they need more iterations for the type of matrices for which the are known to converge.

It is know that for diagonal dominant matrices Jacobi converge, however not all PSD matrices are diagonal dominant, and that was my huge mistake

People has tried to implement variants of Jacobi, to make it fixable for high perforce computing, by threading convergence rate for parallel execution. I tried one of those variance without success, or so I thought.
I spent some time debugging the code, and there are not bug, the solver simply diverge, so I spent time reading some of the literature again to see what I have wrong.

The variant of the Jacobi solver I am trying is called Weighted Jacobi method
https://en.wikipedia.org/wiki/Jacobi_method

basically what this is, is that each partial variable is calculated, they to do not replace old variable value, instead only a fraction contribut to the total solution. like this.
say x0 is a partiltial solution and you calculate an new x1, you do not replace x0 with x1, instead you do
x0 = x0 + (x1 - x0) * w
where w is a weight factor between 0 and 2.0

as you can see a value of w = 0 make make the solution always x0, and value of 1 is a full Jacobi and a value larger than 1.0 make the solution very unstable.

for PSD matrices using Gauss sidle, a value of w from 1.1 to 1.3 accelerates converge a great deal,
however for Jacobi w should always be less than 1.0, many books on numerical method recommend 2/3

my big mistake was how much lower than 2/3 than value should be, it turned out that values should be very, very small, something like 0.66 or even 0.05
So as you can see is not only that Jacobi has poor convergence rate is that it is much, much worse that I could possible imagine it was going should be.

I was trying with the fix values of 2/3 (0.66) as many people recommend and all I got was explosions for any relatively complex set ups, this wikipead article say that a way to calute the weigh is by doing this.

Weighted Jacobi method[edit]
The weighted Jacobi iteration uses a parameter
ω {\displaystyle \omega }
to compute the iteration as
x ( k + 1 ) = ω D − 1 ( b − R x ( k ) ) + ( 1 − ω ) x ( k ) {\displaystyle \mathbf {x} ^{(k+1)}=\omega D^{-1}(\mathbf {b} -R\mathbf {x} ^{(k)})+\left(1-\omega \right)\mathbf {x} ^{(k)}}

with
ω = 2 / 3 {\displaystyle \omega =2/3}
being the usual choice.[2]
Convergence in the symmetric positive definite case[edit]
In case that the system matrix
A {\displaystyle A}
is of symmetric positive-definite type one can show convergence.
Let
C = C ω = I − ω D − 1 A {\displaystyle C=C_{\omega }=I-\omega D^{-1}A}
be the iteration matrix. Then, convergence is guarenteed for
ρ ( C ω ) < 1 ⟺ 0 < ω < 2 λ max ( D − 1 A ) , {\displaystyle \rho (C_{\omega })<1\quad \Longleftrightarrow \quad 0<\omega <{\frac {2}{\lambda _{\text{max}}(D^{-1}A)}}\,,}

where
λ max {\displaystyle \lambda _{\text{max}}}
is the maximal eigenvalue.
The spectral radius can be minimized for a particular choice of
ω = ω opt {\displaystyle \omega =\omega _{\text{opt}}}
as follows
min ω ρ ( C ω ) = ρ ( C ω opt ) = 1 − 2 κ ( D − 1 A ) + 1 for ω opt := 2 λ min ( D − 1 A ) + λ max ( D − 1 A ) , {\displaystyle \min _{\omega }\rho (C_{\omega })=\rho (C_{\omega _{\text{opt}}})=1-{\frac {2}{\kappa (D^{-1}A)+1}}\quad {\text{for}}\quad \omega _{\text{opt}}:={\frac {2}{\lambda _{\text{min}}(D^{-1}A)+\lambda _{\text{max}}(D^{-1}A)}}\,,}

where
κ {\displaystyle \kappa }
is the matrix condition number.
Recent developments[edit]

so I tried to experiment with arbitrary low w value to see is I get something stable. and I got at 0.2f

I you guy sync I committed the solve using 4 thread and solving a stack of variable size masses and at a value of 0.2 the system beaocme stable.

It just happens that the inverse of the highest eigen value of teh interationmass matrix is about 1 / 7 = 0.143

but as you can see the solver apply that value all joints, this is why people has come up with schedules relaxation, this is jsut a fancy way to apply different weighs to different joints base on th inverse of the Eigen value.

anyway the moral is that as you can see is possible to have a parallel solver working stable if we accept that we need many more iterations. and this is the solver become an option.
Basically for CPU with low core count is only of theoretical value.

But if we move this to a GPU with say 32 or more cores, then it becomes a practical solution.

Imagine a GPU with 128 cores, doing say 4 time as many passes, that's like using 32 cores on a cpu for the same perfomance per core.
there is no way even a 16 cores CPU can compete with that.
Therefore the parallel solve project is not dead.

Joe or sweenie if you sync, there is a stack that run with run in for threads, if you switch to
"solve large island in parallel" you will see that is about twice as fast, but soft and smoggy, not too jittery which is important.
the stack does not explode or colapses, it does not goes to rest either but we knew that.

This seem like a good sign.

by **JoeJ** » Thu Mar 29, 2018 3:14 pm

I'll try out later, but have you tried to increase your weight after each iteration, e.g. 1st 0.02, 2nd 0.1, 3rd 0.3, 4th 0.5...

I usually do this when using a kind of solver that math guys would probably name 'jacobi method'

, and i end up with some hand tuned formula to calculate it looking somehow like this:

float corrF = sqrt(float(iter+0.5f) / float(iterations));
corrF = 0.15f + corrF * 0.35f;

But just to mention - I assume the weighted jacobi method is already like this but better.
(Still wondering what 'eigen Value' and 'spectral' stuff means, reading that often... :mrgreen:

)

by **Julio Jerez** » Thu Mar 29, 2018 3:59 pm

No I do not try those tricks, because that the weight is not a function of the residual,
for example say you has

A * x = B

this mean that

to find x you do this

A * x - B= r

now r is a new vector that represent the name the residual or error.
when r = zero, it means, there is a value of X that solves the equation.

a monotically convergent iterative algorithm is one that given some initial guess solution X0
a new iteration compute a new solution X1,
so we have

A * x0 - B = r0
and an new

A * x1 - B = r1

if mag(r1) < mag(r0)

then the algorith converge monotonically and asymptotically to a solution.

for Jacobi it happens that the only way to guarantee that Mag(r1) < Mag(r0) is if
we make
x1 = x0 + (x1 - x0) * w

where w is a vectors of coefficient that all have to be less than 1.0

Making w larger after some iteration means that some residual will diverge by some amont the effect will be small only is because r is already small for somecorresponding x, and that is unpredictable.
we need each w required to make sure the r decreased after each iteration regardless of the value of r.

a simple system is to make make w a vectors of constant value so is just a number by selecting the worse w. for some well behave system w is set to 2/3 and this comes from spectral analysis of the mass matrix, by just use the largest Eigen value.

a better system is to come up with a heuristic than can estimate vectors w from the properties of the system and use variable w for different variable.

for a rigid body system is easy to show that the Eigen values are correlated to how many joints are entries of the off diagonal matrix are non zero, which is the same as how many joints are connected to the bodies connected by a joint.
this is naïve because is does not consider the value of the masses but is close enough to the actual solution which is reducing the matrix to a QR system, but QR is extremely memory and time consuming.

what it will happened is that for stacks with relatively equal masses the w vectors is the number of joint connected to each body because the masses are more of less equal therefore they factor out and the system converge nicely and slowly. for systems with uneven masses the system will have hard time converging and it may even explode.
this is why this will be an option only for many cores systems where convergence will be achieved by doing more passes at the expense of many cores.
But remember the condition is that in each pass |r(I + 1)| < |r(I)|
unconditionally, if this is no quarante the system blows up.

That is the theory, the goal now is just to have a solve that do not explode. and that the quality is a function of the number of passes.

Here is a question that I have, why do we insist is using programming like CUDA, OpenCL or AMP for high performance computing?
what is wrong with Direct Computer and compute shaders.

for what I can see Direct Computer and compute shaders are far more flexible that those languages and have more features and do not lack any feature of those languages.

can we just try Compute shader with opengl 4.xx?

by **JoeJ** » Thu Mar 29, 2018 6:29 pm

Julio Jerez wrote:Here is a question that I have, why do we insist is using programming like CUDA, OpenCL or AMP for high performance computing?
what is wrong with Direct Computer and compute shaders.

for what I can see Direct Computer and compute shaders are far more flexible that those languages and have more features and do not lack any feature of those languages.

can we just try Compute shader with opengl 4.xx?

Yep - there is no more need for GPGPU API for games.

Compute shaders have the same functionality than OpenCL 1.x, (they miss 2.0 features like device side enqueue, but because NV does not support 2.0 that's no option anyways).
But with compute shader, you can write the workload of an upcoming dispatch from a current shader to avoid the need to know the workload on CPU (Indirect dispatch).

With low level API (DX12 or Vulkan), you can further prerecord all these dispatches to a command buffer, upload it to GPU and execute all with a single call from CPU, so the CPU <-> GPU interaction can be limited to almost zero, even for complex algorithms. To make this work you need to prerecord all potential dispatches once. At runtime some of them might end up having zero or little work, which is the remaining problem with current low level APIs. Device side enqueue or other options to start shaders directly from within shaders would help here, currently async compute can help as well.

Vulkan Compute Shader code == OpenGL Compute Shader code, both use GLSL.
My experience with OpenGL compute is dated, but back the time it was slow. NV was 2 times faster in OpenCL.
Now with Vulkan it's the opposite. Vulkan compute is faster than OpenCL both on NV and AMD, due to precompiled command buffers and also compilers being simply better. (My VK branch is almost 2 times faster than my CL branch.)

For research and testing, i still stick to OpenCL because it's so much easier to use than VK and also OpenGL. Shaders are pretty easy to port anyways (i do this mostly with preprocessor and little extra work to write function headers twice, once for OpenCL and once for GLSL).

by **Julio Jerez** » Sun Apr 01, 2018 10:14 am

I do not know why by I installed intel opencl, and it say that does no find any device, and is all disabled in VS 2013 and 2015. This is very annoying, I am inclined to do this with Compute shaders.

anyway, I did some more work on getting weigh per variable instead of an one fix all weigh, and the results seem very encouraging. It is a lot more tricky to get this going, I have to do a lot of brute force implementation by baking some values and the running an extra loop to recover the result. later for a production solver this variables can be maintained incrementally. And I expect
that optimization to yield from 10 to 20% faster, but for now I will keep like that until I add the joints.
if you sync you will see the wood stack runs a little over twice as fast.
It still hurts, but the jitter now is not the kind of random vibration that we saw with the fix weigh, is more like the tower moves up an down periodically because the forces are not close enough to the solution, but this expected of a jacoby step.
This is a very good sign that indicates the solution is stable and monotonically decrease the error.
basically it means that more iterations will make it more stable, and if the solver is faster the this is possible.

The moves the solver from theoretical value category when using the CPU to an optional possibility that application can enable it want to.
The next step after optimizations and some more tweaks is to see if we can make
a default options for cpu the user can disabled if he wants for large stacks.

Since most games and simulations do not really make large stacks, if we we can make this stable say on stacks that are 10 body high, them we can make it a default option, and for High performance computing mandatory.

I took some traces of the difference, here is with sequential solver, the entire stack goes to one thread. with the little slivers on teh other thread doing the contacts calculation.

: sequential.png (45.04 KiB) Viewed 9395 times

here is the same problem with parallel solver, you can see thet work load goes almost even to all four thread, Even with two core it is a win, therefore the only limitation now is convergence.

: parallel.png (50 KiB) Viewed 9395 times

I am going to add the Joint before moving on, so that we have the full picture of what it is going to be

by **JoeJ** » Sun Apr 01, 2018 10:55 am

Intel CL SDK messed some things up for me too, but it is the only way to debug CL kernels (!), so you might keep trying (have not tried the debugger myself.)
You can install multiple or even none OpenCL SDKs, it should still work. But you need runtime drivers - i think they are a seperate download, and hard to find.
I've used AMD driver on Intel (CPU only) and this worked (and sometimes was much faster, sometimes slower than Intel driver.)

Edit https://software.intel.com/en-us/articles/opencl-drivers
Site says driver comes with GPU driver, so should already have it

Just running your new commit 'getting working parallel solver'.
(Note: See my joint test jittering heavy with the parallel solver enabled.)

Some numbers using the default demo (wooden box stacks):

VS2017 x64 release: Fluctuating runtime up to 15 ms, then sleeping at 0.19 ms after some seconds.
Enabling parallel quickly after start up: consistent 5-6 ms, no sleeping. After a while stack starts vibrating vertically, 6-7 ms.

VS2015 x64 release: Smoother runtime peaking at 13 ms, then sleeping at 0.19 ms after some seconds.
Parallel as above.

Repeating tests with 2017 i also see smoother runtimes, so no difference for me. (AMD says fur Ryzen 2017 can be 50% better.)

It's very interesting the parallel solver is faster in the worst case. I think runtime peaks are Newtons biggest performance problem. I don't have big hopes you can ever fix this, because in physics you can not distribute workloads over time as in graphics, but 2 times speed up would be awesome!

by **Julio Jerez** » Sun Apr 01, 2018 11:39 am

tso you got about three time speed up,

That is awesome.
Remember this is a killer stress test, that stack is very hard to get stables because the high dregree of coupling between the prices.
For example the long log are connected to 14 small pieces, and the have a nasty skew inertia, the make that very small changes in torque translate to large changes on angular velocity in some axis, this is probably the most difficult stack in all of the demos.

Now imagine the solver optimized, and some normal pile of debry, in these cases what st get is a very large island that is not more than 4 or 5 body high, maybe 6, but that is very wide.
Or this cases as the simulation advance, many large pieces go to rest, but some do not.
The solver has to continue working as if all are active, because that optimization is not on.

The sequential solver has all those tweak, but the are not very effective since this happens on one thread.
With this solver once we add the same tweaks,
Plus more iteratione, them as the pieces his to rest the extra iteration only apply to where they are needed.
I start to ave a high hope for this, but I am getting a head of myself.

by **Julio Jerez** » Sun Apr 01, 2018 2:17 pm

Hey Joe can you do the test again.

To test my hypeotesi that more iteration most make this solver converge assytotically, I doube teh passes for 4 to 8, and Badabin Badaboom teh stack is almos at rest, and the perfoemcne is about teh same as teh sequntial solver.

so with get for 130 joint with 4 cores teh same pefromce that teh sequancial solver doing half the work in one cores.
Her ss teh kicked tshi si teh bracking psoint at whic teh break even, the parallal solver will no slow down linerally by a factor of #ofJoints / #ofcores
while the sequantial will slow down by a fact of #ofjoints.

Now I cna do focus on few things
1-see who to add teh sleeping mechanix
2-add teh inceralemat optimizations
3-add teh joints support.

for the Joint supoort I was thinking of this algorthm.

Modie the teh isaland generatos so that is generates joint and teh immidiste contacts as separete islands.
then the parallel solver will only oprated wih contacts joints.

teh idea for tshi si athat if we are planing to do teh High Perfpmace computing, we will nee to do some part on CPU, teh contact joint call back, and some oethe stuff tah I do no remember now.

if we can seperate the insland into Joints + contacts and Contacts only islands
the we can write GOPU kernles for the contacts Joints callback, sinc ethey are similar,
the we can dispach the paralles solver to eth GPU, and whiel teh paralles solever work on large constact island, the CPU work on solving the small island and the aritulated joints.
that was witeh hiode teh GPU latancy.

also I onle se teh pass to 8, whci meand is a lot fo rteh CPU, but if we have 128 cores (on teh small GPU size these days) we can mak ethe number say 32, an dteh solver will be rock solid an dteh expnace of harwared. This could be really, really good.

what do you think of that?

by **Julio Jerez** » Sun Apr 01, 2018 3:36 pm

Ha an with more passes the solve is capable of even handling the joints.
if you play the demo servo joint you can even drive the forklift vehicle and pick stuff,
of course I imagine the is will be soft but this is a really good sight.
this mean that teh joint will work quite well,

but even that is the case, for GPU I believe is better to do what I said before.
I believe we are in a good track now Joe.

I changed the demo to the building with 1000 bolders falling for the sky, and both solver are almost equally fast, with the sequential eventually beating the parallel but that's because the sequential is bring the islands to sleep. when the are not sleeping in my machine the parallel is about 26 ms and the sequential 31, so if we cannot optimized maybe this should be optional and opted in for CPU.

But I believe there are some optimizations to make still, I would like to see it with a 10 or 16 cores system :shock:

by **JoeJ** » Sun Apr 01, 2018 5:02 pm

Ha, i see it works well now

I would not wonder if this ends up with some advantages other than performance.
If i'm right, Gauss Seidel means sloving variables in order where solved variables affect remaining unsolved ones.
And Jakobi means you solve for each indipendent, store the correction but apply it only after the iteration has finished?
If so, the Jakobi method appears more correct because it is order independent like reality is as well.
(I preferred it for my own physics engine because of this.)

So maybe joints end up more soft but less wrong in extreme cases. Reminds me of how iterative solver worked always better for me than exact solver, even if unexpected.

by **Julio Jerez** » Sun Apr 01, 2018 5:19 pm

Oh man, just to go crazy and see what the end game will be, I set the max count to 16, and guess what the stack goes to sleep without any more tuning and it is still marginally faster than the sequential solver.

you know what at this mean, This mean that the GPU solve had a kick into high priority.
I would like to see this baby running with say 128 cores, that will be a Gforce 8800, tah today you probable get paied to take out if you fodn any.
These days any GPU comes with at least 32 compute units, and that a very vey entry level.

by **JoeJ** » Sun Apr 01, 2018 5:25 pm

Julio Jerez wrote:I would like to see this baby running with say 128 cores, that will be a Gforce 8800

You sound like you wanna get hired as a PhysX developer, hihihi :mrgreen:

by **Julio Jerez** » Sun Apr 01, 2018 6:13 pm

JoeJ wrote:Ha, i see it works well now

I would not wonder if this ends up with some advantages other than performance.
If i'm right, Gauss Seidel means solving variables in order where solved variables affect remaining unsolved ones.
And Jakobi means you solve for each independent, store the correction but apply it only after the iteration has finished?
If so, the Jakobi method appears more correct because it is order independent like reality is as well.
(I preferred it for my own physics engine because of this.)

yes you are right but that's because you are seen the problem from the physics point of view,
or from what mathematician call form the application point of view.
That naïve approach, even when correct form the mathematical point of vies is a recipe for total disaster.

To see whey Jacobi has better converges you need to apply those pure mathematical concepts that at first seem to have no relation to physics whatsoever, but the when you get the practical solution the the math start to make sense.

That is where the spectral radio of the iteration matrix is. mathematically this seem like a pedant exercise only a nerd will be interested in doing, but after the solution works, you can see that what happens with Jacobi is that the iteration is too strong, so you need to apply a damping effect to each contribution.
an analogy to this is what you make a low-high pass filter, there are some values of the spring and the damper that make stable and some other values don't not. here is very similar.
in the end it comes down to this, if you solve a joint in insulation, the joint will produce a force that will push the two connected bodies.
say there no other bodies are connected to those two by other joints, then you can apply that full reaction force to each body.
Now let us say that there are other bodies attached to any of the two bodies. when this is the case, each other joint calculate a full contribution without knowing that there are other bodies connected.
the contribution to a bodies that has more than one incident joint will be summation of all the incident joints, and that obviously will be too strong and the

total force will overshoot but if you take all force and you average by the number of incidents joints, now you apply a lower contribution by it will never overshoot.

This immediately suggest that if we take the body with the maximum number of incident joints and we use that and the weight factor we get a stable Jacobi solver, but one that will converge a lot more slowly that it really is necessary.
A more sophisticated solver is take each joint and calculation a table of weighs, that is the inverse fo eth number of incident joint to each body and use that a variable weight table.

The reason this works is that for a system where the mass matrix is form by bodies of more of less equal masses, there will be an average mass that can be facto out foe the mass matrix, and the result matrix will be one with a main diagonal equal 1 and few of diagonal value that will be in the vicinity of 1.0
When you do the calculation of the Eigen values by reduce to QR matrix, we will see that the Eigen values are propotional to the number of non zero of diagonal elements of the original matrix.
Basically we get a lower triangular matrix with the mantic diagonal set to say 2.x if a body only have tow joints, 3.x if si has three joint, 4.x for forei joint and so on.
This brake down really bad if the masses are of very on equal sizes, so that is a criteria for determining is an island can be solve using this solver.
But this intuitive explanation you will never get without the mathematical analysis disregarding the physics.

For a Gauss Sidle the solver do not overshoot because every calculation is applied to the next, reducing the net acceleration, so Gauss Sidle will converge a lot faster and therefore is preferable for small number of iteration and small core count.
naive Jacobi will overshoot, unless we do what we did, knowing that it will reduce convergence rate but we will make up with more cores.
once you get it, it is so intuitive and simple that you wonder who it is that was not done before. Ha but like I say this is where math and physics part ways.

for thousands of years philosophers, mathematicians and physicist has always wondered why it is that mathematics since to be way the natural world woks, but no one has been able to come up with an answer, is just works. Like Carl Sagan once said Mathematics seems to be the language of the Universe and we do not know why.

by **Julio Jerez** » Sun Apr 01, 2018 11:00 pm

when I try to make a opencl project, it crate one but no matter what I do it fail to build and open cl program, I get thsi error..

1>------ Build started: Project: OpenCLProject1, Configuration: Debug Win32 ------
1> Preprocessing: Template.cl
1> Setting target instruction set architecture to: Default (Advanced Vector Extension 2 (AVX2))
1>C:\Users\julio\Downloads\OpenCLProject1\Template.cl : error CL: Failed to get platform id...: 0 (CL_SUCCESS)
1> Build failed!
1>
1>C:\Program Files (x86)\MSBuild\Microsoft.Cpp\v4.0\V120\BuildCustomizations\IntelOpenCL.targets(98,5): error MSB3721: The command ""C:\Intel\OpenCL\sdk\bin\x86\ioc32.exe" -cmd=build -input="C:\Users\julio\Downloads\OpenCLProject1\Template.cl" -output="Debug\Template.out" -VS -device=CPU -simd=default -targetos=current -bo=" "" exited with code -1.
========== Build: 0 succeeded, 1 failed, 0 up-to-date, 0 skipped ==========

I read that you say the drive has to be installed separate? maybe that's the problem.

by **JoeJ** » Mon Apr 02, 2018 1:25 am

Is this a project coming with the Intel SDK?

Just changed one of my solvers to Gauss Seidel method - you're right, converges a lot faster. I really did not knew or forgot about it, awesome! :mrgreen:

Parallel solver experiments

Parallel solver experiments

Re: Parallel solver experiments

Re: Parallel solver experiments

Re: Parallel solver experiments

Re: Parallel solver experiments

Re: Parallel solver experiments

Re: Parallel solver experiments

Re: Parallel solver experiments

Re: Parallel solver experiments

Re: Parallel solver experiments

Re: Parallel solver experiments

Re: Parallel solver experiments

Re: Parallel solver experiments

Re: Parallel solver experiments

Re: Parallel solver experiments

Who is online