Parallel solver experiments

A place to discuss everything related to Newton Dynamics.

Moderators: Sascha Willems, walaber

Re: Parallel solver experiments

Postby Julio Jerez » Mon Apr 02, 2018 7:12 am

no after the SDK is installed, the add a cu[le of menu items to visual studio:
one is code builder, that is use for project setting and othe stuff.

the other is in create new project the is the new build new project "OpenCL"
and that guide you to make a new solution,

after click all the defaults, it makes a new solution with a very simple cl funtion named template.cl

Code: Select all
/*****************************************************************************
 * Copyright (c) 2013-2016 Intel Corporation
 * All rights reserved.
 *
 * WARRANTY DISCLAIMER
 *
 * THESE MATERIALS ARE PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
 * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
 * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
 * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL INTEL OR ITS
 * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
 * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
 * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
 * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
 * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY OR TORT (INCLUDING
 * NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THESE
 * MATERIALS, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 *
 * Intel Corporation is the author of the Materials, and requests that all
 * problem reports or change requests be submitted to it directly
 *****************************************************************************/

__kernel void Add(__global int* pA, __global int* pB, __global int* pC)
{
    const int x     = get_global_id(0);
    const int y     = get_global_id(1);
    const int width = get_global_size(0);

    const int id = y * width + x;

    pC[id] = pA[id] + pB[id];
}


the click build fail with tha error message,
basically the opencl compiler C:\Intel\OpenCL\sdk\bin\x86\ioc32.exe
fail a compiling template.cl

but ta is teh same erro I get each time I launch visual studio, it say the there are no recognizable devices in the machine,
bu t this is an pure core i7-7700 cpu, which has an 630 gpu
but even if it does it still has avx2 so it should recognize opencl on CPU configuration
but they all fail.
is as if the SDK things this is an AMD cpu

I try to run the drive setup, and it say the drive is already installed.

This is extremity frustration, I will try to make a simple compute sharer extranal DLL and see if tha work.

basically I will make an astarct interface for plugin external solver to the engine.

teh first one will be the same one that is in file ../sdk\dgPhysics\dgWorldDynamicsParallelSolver.cpp
just as it is now, and that will be the base one, then after it working I will try to covert that to use avx2 so that will be the CPU solution to emulate from.

the from that we can make basics one we can make gpu or versions, but I do no wnat to mess up the engine but adding external header like opencls, cuda, opengl, vulcan of directx
instead there will be an abstract interface and the engine will load those solvers as runtime dlls if they are in some folder. then the user can select them.

that's the idea but so far the first attend is a failure.
Julio Jerez
Moderator
Moderator
 
Posts: 12249
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: Parallel solver experiments

Postby Julio Jerez » Mon Apr 02, 2018 7:40 am

If I can no determine why is no working I am giving up on the intel opencl remove it from the system and try AMD.
after all my GPU is an AMD
even if a try to remove the SDK, or to repair I get this messages, so that a sign somethong is very wrong with the installation or some how the SDK does no recognize my CPU as an intel, bu thsi is a genuine Icore7 7700 whi I knwo for a fact is has a GPU build in, if I plug a monitor in the HDMI that is in teh mother board, the graphic are even better that teh AMD, at least the font is shaper

opencl.png
opencl.png (47.25 KiB) Viewed 4739 times
Julio Jerez
Moderator
Moderator
 
Posts: 12249
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: Parallel solver experiments

Postby JoeJ » Mon Apr 02, 2018 8:24 am

https://github.com/JoeJGit/OpenCL_Fiji_Bug_Report

this is a small project in zip file i've sent to AMD to show a bug, you could try to compile this.
Maybe just the VS template is wrong.

I've had installed Intel, AMD and NV SDKs and there where no conflicts.
User avatar
JoeJ
 
Posts: 1453
Joined: Tue Dec 21, 2010 6:18 pm

Re: Parallel solver experiments

Postby Julio Jerez » Mon Apr 02, 2018 9:53 am

well I uninstalled teh Intel and I installed the AMD SDK.
the amd is far, far less developed, it does not integrate with VS studio. but maybe that a good thing.

anyway I am no doing this right away. it will take a while, I now knwo teh parallel solver work, so I can now complete the vehicle and teh balancing player, the resume the GPU solver.

on this Joe:
It's very interesting the parallel solver is faster in the worst case. I think runtime peaks are Newtons biggest performance problem. I don't have big hopes you can ever fix this, because in physics you can not distribute workloads over time as in graphics, but 2 times speed up would be awesome!


The solve proves that is possible, to distribute the work load, if using the proper algorithm and using a powerful enough hardware to make up for the poor of converge rate.

But tthsi open teh door to a whole new ran of possibility joe, we now can integrate other type of solvers tha all cooperate with the rigid body solver, think of fluid simulator, particle based spring mass system for soft bodies. particle rigid bodies for thing like ropes.

up until now these solver do not like, because they use different algorithms, but with the Jacobi, basically the work in phase, on pass do all the rigid body, and accumulate the forces the teh spring can go over and do the spring mass solver and accumulate the forces, then the fluid and so on,
in general a unifoed solve that will solver all stuff at once, and we can set to say 100 passes, for every thing.

The only part that is on teh air is the Joints, because of the callback but we can thing about that later, for we can start by singling out contacts
Julio Jerez
Moderator
Moderator
 
Posts: 12249
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: Parallel solver experiments

Postby JoeJ » Mon Apr 02, 2018 10:25 am

Julio Jerez wrote:But tthsi open teh door to a whole new ran of possibility joe, we now can integrate other type of solvers tha all cooperate with the rigid body solver, think of fluid simulator, particle based spring mass system for soft bodies. particle rigid bodies for thing like ropes.


Yeah, i see this. The question is how much GPU cycles can we affort to move from graphics to physics. But on the long run you're right...

Julio Jerez wrote:The only part that is on teh air is the Joints, because of the callback but we can thing about that later, for we can start by singling out contacts


I'd be happy with coding those things on GPU. It would be very nice to have your own shader language that automatically translates to HLSL, GLSL (or SPIR-V directly) and eventually OpenCL C, to fit everyones needs. Like those UE4 or Unity shaders.

But first you could try to make callbacks mostly unnecessary by offering defaults for some things:
Gravity.
User Forces for a selected list of bodies.
Contact filtering with some bit masks like Havok does.
...?

Julio Jerez wrote:the amd is far, far less developed, it does not integrate with VS studio. but maybe that a good thing.

To me it is a good thing. Intel SDK messed up auto-completition for me. A drop-down menu opens after each key stroke. I was unable to fix that and ended up editing outside Visual Studio :)

But i don't know for what those SDKs are good for at all other than providing examples (and Intels debugger).

But you should use CodeXL for AMD GPU profiling - it's awesome. Tell's what to improve to become faster.
User avatar
JoeJ
 
Posts: 1453
Joined: Tue Dec 21, 2010 6:18 pm

Re: Parallel solver experiments

Postby Julio Jerez » Mon Apr 02, 2018 2:43 pm

JoeJ wrote:Yeah, i see this. The question is how much GPU cycles can we affort to move from graphics to physics. But on the long run you're right...

in my experience, most game that claim to be GPU bound aren't really GPU bound, it is that the have a rending thread dispatching 10's of thousands of draw calls with very small payload to keep the GPU busy.

Graphics programmer for the most part are very run of the meal programmers, the undead vertex shades and pixel shaders and that it, if you ask the about compute or geometry shade, the will immidiatlly give you the same answer they are slow. At Game program will rather do a render to texture image possessing, like a cascade shadow, or an occlusion or a by doing a render pass and get the suface back to the system, rather than writing a compute shader, to generate the image in memory and with a compute shader and use it in the render pipeline.

The only way they would do something like that is if they read it from some self appointed expert at GDC, of form an Ncvideia or AMD website. that the reason whei yor do no see compute shaders catching on.

anyway I rudn some profiler and I have to come up with a lock free dada base to producing collidn pairs. Check out this image
profile.png
profile.png (52.94 KiB) Viewed 4727 times


you can see that the second most expensive call is in spin lock or thread yield, I was thiobnking hwo to mak etah better and it turns out that, the seracj for new pair in teh body/joint graph can be unsafed, if you use this strategy.
-new joint are place of a temporaty list,
-find new joint now can be unsafe if we know that no joint wuill be added or remove so the graph topology will not change,
-added all new joints sequetically

this will required that the broad phase is no longer incremental instead is does a sweep over the list in which is quarantine that not duplicates will be generated.

at the moment the sweep produce duplicates, so it the function add pair use the fond pair to see if the joint was already in the graphs.

for what I can see the code spend more than haft the time in locks, so it make essence to add more time to a full sweep know that the scan will be lock free, with the exaction of the allocation part that must be thread safe.

on a side note there are memory mangers that use ether campest of channel to do pool based and lock free allocation, JEmalloc is one such system, maybe one day we should look at that if it turns out that memory become a bottler neck
Julio Jerez
Moderator
Moderator
 
Posts: 12249
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: Parallel solver experiments

Postby Julio Jerez » Mon Apr 02, 2018 4:24 pm

profile1.png
profile1.png (45.94 KiB) Viewed 4721 times

wow, I just quickly made that change an did a huge win, the function add pair when from 13.xx % to less than 3% that's a huge win, in the test scene is about 5 ms win.
I have to review the usage of lock in the update.
This is not commitred yet because is not robust but I will fix it tonight,
Julio Jerez
Moderator
Moderator
 
Posts: 12249
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: Parallel solver experiments

Postby JoeJ » Mon Apr 02, 2018 4:50 pm

Julio Jerez wrote:The only way they would do something like that is if they read it from some self appointed expert at GDC, of form an Ncvideia or AMD website. that the reason whei yor do no see compute shaders catching on.


Haha, there is this guy claiming that only 20 people in the world know how to optimize graphics engines. I thought he's arrogant, but maybe he is right?
But i agree compute is underutilized and that's a real shame. Even worse graphics programmers seem to think compute (and graphics at a whole) is still subject to brute force, as if the rise of pixel shaders has frozen our brains forever. :)

Personally i'm hard to convince of GPU physics because i have high compute demand for my GI, so i'm totally not your average user here and my doubts have little weight. :wink:
I'm fine as long as you do not drop support for ancient CPUs - the need some work to do as well :mrgreen:
User avatar
JoeJ
 
Posts: 1453
Joined: Tue Dec 21, 2010 6:18 pm

Re: Parallel solver experiments

Postby Julio Jerez » Mon Apr 02, 2018 5:22 pm

Well let us see how it goes, cup support will never be dropped.
This is what we have now that we did not have before is and algorithm that is data parallel, but it is clear that it need lot of hardwired sopport. We see that coins 4 time as many passes it nearly break even with the sequential solver, so we need even more passes.
On top of that I see that using atomic even for a very short time result in drastic perfoman cost, this mean that force acumutions can be done by adding joint contribution, instead they have to be calculated by iteration over all joint,
Therofore the optimization I was planning for that does not work well with muticores.
On top of that stimics on gpu are far more costly than in pc.
So the parallel solver is of practical value, but only for many cores system.

This changes the plan a little.
What I will do is the full interface for parallel solution, and the first solver will be this solver but set up for more iteration and more threads.
The that will be the template for other gpu solver, and we can try opencl, or even cuda or compute shader.

There also two other points. Gpu has the possibility of interoperabity but I do not put too much hope there.
The other is that, I believe that the new generations of gpu, has support for interruption. For example before a compute shader would has to be queued way for the gpu to finish the current queue of commands.
Not a queue can preempt the gpu, save the state, execute a shader, and the resume what was doing before. This seriously reduces latancy.
Julio Jerez
Moderator
Moderator
 
Posts: 12249
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: Parallel solver experiments

Postby JoeJ » Mon Apr 02, 2018 6:10 pm

Julio Jerez wrote:Not a queue can preempt the gpu, save the state, execute a shader, and the resume what was doing before. This seriously reduces latancy.


That's what's still missing for compute shaders. :( I hoped the new raytracing API would expose those things to general compute, but unfortunately that's not the case (you can start new rays within a thread and continue using the results, but you can't do anything different than raytracing).
Cuda / OpenCL 2.0 seem the only options if you want to experiment there (i never did - the loss of being able to do compute in parallel with graphics probably outweights the win i guess.)
User avatar
JoeJ
 
Posts: 1453
Joined: Tue Dec 21, 2010 6:18 pm

Re: Parallel solver experiments

Postby Julio Jerez » Wed Apr 04, 2018 12:00 pm

Ok Joe I start the parallel solve project.
I added the first full, the will serve as the template to make other platforms all.
It does not do anything yes, it just adding the project.
After I completed the set up, I will implement the cup version, that will be similar to the one in the physics library. Except that it will only use proxy representation.

The the goal is that the gpu solver will use bad and will solve 8 joint per cup, as opposed to the correct one the is one joint per thread.

So in the end the solve will do 8 × 4 = 32
joints per call, all scalar operations.
This will required some extra overhead, but the point is to get as close to a gpu model as possible. So that the port to gpu is straight forward.

The second part of this is that we probably need a different contact solver for gpu, jgk is to complex for solving in gpu, it will end up doing all sequential, but let us see, that is still far away. The thing I will not do is particle collision, that approach always looked full retarded to me. And we do not want to go full regard.
Julio Jerez
Moderator
Moderator
 
Posts: 12249
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: Parallel solver experiments

Postby Julio Jerez » Wed Apr 04, 2018 2:47 pm

this is the high level interface that I am thinking about
Code: Select all
void* NewtonCurrentPlugin(const NewtonWorld* const newtonWorld);
void* NewtonGetFirstPlugin(const NewtonWorld* const newtonWorld);
void* NewtonGetNextPlugin(const NewtonWorld* const newtonWorld, const void* const plugin);
const char* NewtonGetPluginString(const NewtonWorld* const newtonWorld, const void* const plugin);
void NewtonSelectPlugin(const NewtonWorld* const newtonWorld, const void* const plugin);


basically when the engine is launched, it will scan the current directory for plugins, and it will load the in. the application can use the interface to select any on the found solvers
selecting the current to NULL will run the default.
and there will be not solve larger island in parallel mode, that will be done by the reference solver whi will be one mode.
Julio Jerez
Moderator
Moderator
 
Posts: 12249
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: Parallel solver experiments

Postby Julio Jerez » Thu Apr 05, 2018 4:07 pm

Ok Joe, I now have the plugin interface design in.

this is show a plugin in look like
Code: Select all
class dgWorldPlugin
{
   public:
   dgWorldPlugin() {};
   virtual ~dgWorldPlugin() {};
   virtual const char* GetId() const = 0;
};


of course it does not do anything other that identified itself, but what we do is that we add abstract functions and we implement then in the respective module.

It is only hooked to the visual studion 2015 project is debug, later I will added to all project usne teh proper macros.
So all the end use have to do to get a plgin is to copy teh DLL is the NewtonPlugin, relative to where the executable is.

after I have all teh configurations set up I will move to implement the CPU solver which will be the one use as reference for other plugins.
there could be as many as we want, CPU, SSE, AVX, OpenCL, CUDA, direct Cumpute, OpenGL compute shader, etc. This includes multiple versions of the same platform so comparison will be eassy.
Julio Jerez
Moderator
Moderator
 
Posts: 12249
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: Parallel solver experiments

Postby Julio Jerez » Sun Apr 08, 2018 2:07 pm

Hey Joe has you read about HIP.
Is thought it was mombo jumbo translator from cuda to opencl.
But apparently this is AMD respond to CUDA.

It tranlaslate from high level C++ like language, to native high performance computing, so it can make code that runs on AMD hardware or Nvidia hardware. I guess that will also include some cup a what not.

Has you any experience with that? Sound too good to be true, but they say is even open source.
If this is true, them that is our solution.
I am making progress on the plugin, but maybe I should consider this HIP.
Julio Jerez
Moderator
Moderator
 
Posts: 12249
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: Parallel solver experiments

Postby JoeJ » Sun Apr 08, 2018 3:07 pm

Seems interesting. I did not consider this yet - pretty new and probably not intended for games, but let me know if you try... (could have more features than compute)
You should use what you like most anyways, and C++ sounds nicer than low level C like language. Users could contribute ports to other APIs.


This is something i always wanted to have (accessing registers of other threads), and HIP can do it: https://gpuopen.com/amd-gcn-assembly-cross-lane-operations/

But newest Vulkan seems to have it as well finally, and there is a catch anyways:
Code depends on CU thread width (AMG 64 NV 32 Intel 8/16)
Newest AMD (Vega) does not support the instruction anymore, but i may be wrong.
So its very hardware dependent and not something you use from the beginning, but i expect huge speedups. Doing a prefix sum in LDS is still quite slow for example.

The only doubt i have for HIP is, probably it misses prerecorded command buffers like low level graphics APIs have. That's not very interesting for business applications, but for games it is - if you have many dispatches at least.
User avatar
JoeJ
 
Posts: 1453
Joined: Tue Dec 21, 2010 6:18 pm

PreviousNext

Return to General Discussion

Who is online

Users browsing this forum: No registered users and 15 guests

cron