About OpenCl solver

A place to discuss everything related to Newton Dynamics.

Moderators: Sascha Willems, walaber

Re: About OpenCl solver

Postby Julio Jerez » Sat Mar 20, 2021 3:11 pm

JoeJ wrote:
Julio Jerez wrote:another thing I read is that, apparently 2.0 support memory map for GPU to cpu, so that you do not need to copy the data.

Oh, never paid attention to this because i have no big data to transfer, so IDK.
But i know AMDs GCN has a memory range of 256 MB which can be addressed quickly from both GPU and CPU, and Vulkan exposes this while DX12 does not. (DX12 seems simpler overall, with less details.)


yeah that's one of my concerns, take for example the proxy that represent a body.
to start the first thong is the matrix and the mass. thsi si a class like this

Code: Select all
class ndOpenclBodyProxy
{
   public:
   cl_float4 m_matrix[4];
   cl_float4 m_invMass;
   ndBodyKinematic* m_body;
};


that pointer is already 88 byte is size.

if we are aiming for some alone teh line of 8k bodies that's
88 * (1024*8) = 720892 bytes

that will grow to probably twice the size. so about 1.5 to 2 mbyte of dat just to load the bodies representation, joint are about 10 to 20 time that.
so is the implementation is too naive it could be a problem.
I do not thing a pci has that kind of bandwidth, by we will see.

I will go with naïve version first and them we can think of sophisticated caching techniques.

I was successful adding all the boiler plate code to initialize the context, create one test kernel and buffers. let is see if this can simulate 10,000 bodies.
I am nwo writing the fist conversion whi is simple integration free bodies
Julio Jerez
Moderator
Moderator
 
Posts: 12249
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: About OpenCl solver

Postby Julio Jerez » Mon Mar 29, 2021 10:42 am

I have to take some step back and replace some algorithms with o (n log(n)) time complexity in newton with less sofisticated ones with o(n) before going on with the gpu solver.

This is one of the reason newton has a dramatic performance drop after more than 1000 rigid bodies.

So I will spend about a week optimizing that them keep going. I have to do it anyway because some those algorithm do not translate eassy to gpu hardware.
One nasty thing is that I use too many pointers for example.
So even if an algorithm translates, I still need to write a equivalent using indices.
So I might as well replace the cpu version with indices.
Julio Jerez
Moderator
Moderator
 
Posts: 12249
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: About OpenCl solver

Postby Julio Jerez » Thu Apr 15, 2021 12:49 pm

It is taking longer that I estimated, but I am working on it.

on a side note I am trying Visual Studio 2019

and I see that now among teh platforms we can now select teh Intel platforms as default and optional we can install clang compiler. two versions: the real one with the backend code generator and the one that is just a front end but still use Visual studio as back end whish to me is useless.
Visual studio 2017 has the experimental one that was which seem useless too.

for what I hear in teh internet the Clang version really make very binaries, that generate far better code, that teh native visual studio code generator.

I have seems the visual studio code generating and indeed leaves a lot to be desired.
anyway I am trying the LLVM toolset but I get few warnings like these

1>C:\newton-dynamics\newton-dynamics-master\newton-4.00\sdk\dCollision/ndJointBilateralConstraint.h(51,10): warning : 'const' type qualifier on return type has no effect [-Wignored-qualifiers]
1>C:\newton-dynamics\newton-dynamics-master\newton-4.00\sdk\dCollision/ndJointBilateralConstraint.h(135,8): warning : 'const' type qualifier on return type


and these two errors.
1>C:\newton-dynamics\newton-dynamics-master\newton-4.00\sdk\dCore/dVectorSimd.h(276,15): error : always_inline function '_mm_hadd_ps' requires target feature 'sse3', but would be inlined into function 'AddHorizontal' that is compiled without support for 'sse3'
1>C:\newton-dynamics\newton-dynamics-master\newton-4.00\sdk\dCore/dVectorSimd.h(277,10): error : always_inline function '_mm_hadd_ps' requires target feature 'sse3', but would be inlined into function 'AddHorizontal' that is compiled without support for 'sse3'


the warning are ridicules, and my guess probla a LLVM pedante bug. having a funtion return a const in a perfect c++ ansii compliance that tell teh compile that is can cache the variable in a register. no having the const will actually force the compiler to either issue a funtion call or to read the variable from memory on each loop iteration.


the error, I do not know how to set these "requires target feature 'sse3'"
ther does seem to be any compile option to select that when selecting the LLVM toolset.
Julio Jerez
Moderator
Moderator
 
Posts: 12249
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: About OpenCl solver

Postby Julio Jerez » Fri Apr 16, 2021 1:15 pm

well It seem that the rumor's about clang been better that visual studio is true.

I just have the engine compiling with Clang in VS 2019 and looking a t eth generated code, I seem indeed far better that VS.
It is clear to seen that teh code generated does no explicit register dependencies like VS.
Intel claim that register tendencies does not matter because the super scalar design and register renaming solve those problems, but I have seem test that thsi is not complete true, in fact seem far form true.

to me given a complier that generate consecutive muladds using different destination register is better than one the generate teh same target register to accumulate the result. and that is what VS does even when the c++ code explicitly uses two different tmp variables.

teh onel regret is that does not seem to be a way to set CMake to generate a solution that select a compiler other than Visual Studio.
Julio Jerez
Moderator
Moderator
 
Posts: 12249
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: About OpenCl solver

Postby JoeJ » Sat Apr 17, 2021 8:43 am

Did you do any general perf. comparison of clang vs. MSVC?
User avatar
JoeJ
 
Posts: 1453
Joined: Tue Dec 21, 2010 6:18 pm

Re: About OpenCl solver

Postby Julio Jerez » Sat Apr 17, 2021 1:09 pm

I have no yet, but to give an idea of how poor the code generated by VS2019 using VS compiler vs Clang here are tow code sniped. of this CPP code,

Code: Select all
ndAvxFloat a0(row->m_JMinv.m_jacobianM0.m_linear.m_x * forceM0.m_linear.m_x);
ndAvxFloat a1(row->m_JMinv.m_jacobianM1.m_linear.m_x * forceM1.m_linear.m_x);
a0 = a0.MulAdd(row->m_JMinv.m_jacobianM0.m_angular.m_x, forceM0.m_angular.m_x);
a1 = a1.MulAdd(row->m_JMinv.m_jacobianM1.m_angular.m_x, forceM1.m_angular.m_x);
a0 = a0.MulAdd(row->m_JMinv.m_jacobianM0.m_linear.m_y, forceM0.m_linear.m_y);
a1 = a1.MulAdd(row->m_JMinv.m_jacobianM1.m_linear.m_y, forceM1.m_linear.m_y);
a0 = a0.MulAdd(row->m_JMinv.m_jacobianM0.m_angular.m_y, forceM0.m_angular.m_y);
a1 = a1.MulAdd(row->m_JMinv.m_jacobianM1.m_angular.m_y, forceM1.m_angular.m_y);
a0 = a0.MulAdd(row->m_JMinv.m_jacobianM0.m_linear.m_z, forceM0.m_linear.m_z);
a1 = a1.MulAdd(row->m_JMinv.m_jacobianM1.m_linear.m_z, forceM1.m_linear.m_z);
a0 = a0.MulAdd(row->m_JMinv.m_jacobianM0.m_angular.m_z, forceM0.m_angular.m_z);
a1 = a1.MulAdd(row->m_JMinv.m_jacobianM1.m_angular.m_z, forceM1.m_angular.m_z);


as you can see I even try to write the code so that there are tow temp to hope it produce code that has not register dependencies. according to Intel and AMD, the internal independent and channel should generate code that maximizes the float throughput to up to 32 floats per clock, using the mauled instructions, instead the Microsoft compiled produces this, which clear should half the throughput put size it uses the same accumulator.

Microsoft vs2018 code generations.
Code: Select all
00007FFD7CE0BCD7  vmovups     ymm7,ymmword ptr [rax+120h] 
00007FFD7CE0BCDF  vmulps      ymm1,ymm15,ymmword ptr [rax+60h] 
00007FFD7CE0BCE4  vfmadd231ps ymm1,ymm6,ymmword ptr [rax+0C0h] 
00007FFD7CE0BCED  vfmadd231ps ymm1,ymm4,ymmword ptr [rax+80h] 
00007FFD7CE0BCF6  vfmadd231ps ymm1,ymm3,ymmword ptr [rax+0E0h] 
00007FFD7CE0BCFF  vfmadd231ps ymm1,ymm5,ymmword ptr [rax+0A0h] 
00007FFD7CE0BD08  vfmadd231ps ymm1,ymm8,ymmword ptr [rax+100h] 
00007FFD7CE0BD11  vmulps      ymm0,ymm9,ymmword ptr [rax-60h] 
00007FFD7CE0BD16  vfmadd231ps ymm0,ymm12,ymmword ptr [rax] 
00007FFD7CE0BD1B  vfmadd231ps ymm0,ymm10,ymmword ptr [rax-40h] 
00007FFD7CE0BD21  vfmadd231ps ymm0,ymm13,ymmword ptr [rax+20h] 
00007FFD7CE0BD27  vfmadd231ps ymm0,ymm11,ymmword ptr [rax-20h] 
00007FFD7CE0BD2D  vfmadd231ps ymm0,ymm14,ymmword ptr [rax+40h] 

you can see how VS use registers ymm0 and ymm1 to accumulate the intermediate values.
Intel say that since the processor is super scaler and has register renaming that this in fact does not make a difference, I do not think that is the case, in fact I have seen test where this has proven to be not true.

on the othe side using Clang whe get the follow code sequence
Code: Select all
00007FFD757B4C72  nop         word ptr cs:[rax+rax] 
00007FFD757B4CA1  vmulps      ymm7,ymm1,ymmword ptr [rax-240h] 
00007FFD757B4CA9  vmulps      ymm12,ymm8,ymmword ptr [rax-180h] 
00007FFD757B4CB1  vfmadd231ps ymm7,ymm4,ymmword ptr [rax-1E0h] 
00007FFD757B4CBA  vfmadd231ps ymm12,ymm11,ymmword ptr [rax-120h] 
00007FFD757B4CC3  vfmadd231ps ymm7,ymm2,ymmword ptr [rax-220h] 
00007FFD757B4CCC  vfmadd231ps ymm12,ymm9,ymmword ptr [rax-160h] 
00007FFD757B4CD5  vfmadd231ps ymm7,ymm5,ymmword ptr [rax-1C0h] 
00007FFD757B4CDE  vfmadd231ps ymm12,ymm6,ymmword ptr [rax-100h] 
00007FFD757B4CE7  vmovaps     ymm4,ymmword ptr [rsp+0A0h] 
00007FFD757B4CF0  vfmadd231ps ymm7,ymm4,ymmword ptr [rax-200h] 
00007FFD757B4CF9  vfmadd231ps ymm12,ymm10,ymmword ptr [rax-140h] 
00007FFD757B4D02  vfnmsub231ps ymm7,ymm14,ymmword ptr [rax-1A0h] 
00007FFD757B4D0B  vfnmsub231ps ymm12,ymm15,ymmword ptr [rax-0E0h] 
00007FFD757B4D14  vaddps      ymm7,ymm12,ymm7 


here the partial values are accumulated in two register as well, ymm7 and ymm12 but they are interleaved explicitly in a way the the sequence of instruction has not decencies.
that's was what I expect to see if I was hand writing assembly code.

there are other factors that can me the code not yield a 200% expected gain, like memory bandwidth where the bust is saturated, or but there is also the chance that since solver pack 8 joints each with up to 48 rows, that the cache is blown as in goes over the entire group of 8 joints.

I would expect that given anything equal, the code generated by Clang to significant faster than the code produced by Microsoft, and that's the best we can hope for in my opinion.

if what Intel say is correct them both code sequence will be equal, but I have a hard time believing that the cpu will keep an island of instruction in flight that is so large that it will execute for example

00007FFD7CE0BCE4 vfmadd231ps ymm1,ymm6,ymmword ptr [rax+0C0h]
together with
00007FFD7CE0BD16 vfmadd231ps ymm0,ymm12,ymmword ptr [rax]
which is about 6 instruction a head in the queue.

the superscalar is in microcode, and so many instructions will make a very huge pipe.
which with cause disastrous effect on branch misprediction, a cpu must balance both effects.
Julio Jerez
Moderator
Moderator
 
Posts: 12249
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: About OpenCl solver

Postby Dave Gravel » Sat Apr 17, 2021 7:17 pm

You search a nice physics solution, if you can read this message you're at the good place :wink:
OrionX3D Projects & Demos:
https://orionx3d.sytes.net
https://www.facebook.com/dave.gravel1
https://www.youtube.com/user/EvadLevarg/videos
User avatar
Dave Gravel
 
Posts: 800
Joined: Sat Apr 01, 2006 9:31 pm
Location: Quebec in Canada.

Re: About OpenCl solver

Postby Julio Jerez » Sat Apr 17, 2021 9:34 pm

Nice that they include support for share virual memory.
If this is the case it make much simpler because not need to have so many buffer duplication.

To me that the worse part of gpu programming.
Julio Jerez
Moderator
Moderator
 
Posts: 12249
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: About OpenCl solver

Postby Julio Jerez » Sun Apr 18, 2021 1:10 pm

Ok I commited the changed for compiling using Clang toolset.
It generated the solution from CMake by selecting that option like it shows in the image below

Untitled.png
Untitled.png (20.55 KiB) Viewed 12139 times


it is not as flexible as the Visual studio tool set,
for example, Clang does not like the Visual studio precompile header is they include system header file, so that option have to be disabled for non Visual studio compilers.

another option is that Clang or Intel build report errors is an intrinsic function is used on a header file that is export as a library. for example
take instruction _mm_hadd_ps ()

this was introduced with sse3 about 15 to 20 years ago, but Intel and Clang will not use it as default if it is inline in a header file unless you pass the option --ss3
the problem is that the option for the IDE are different from toolset to toolset, so I am limited to the default options.

it is not all losses, if the intrinsic are used in a cpp file like we do for the AVX2 solver now, them it does compile correctly even if not setting the option in a command line, so that is good.

other that that, we can now make Clang solution from CMake when using Visual studio 2019
I only tested it in 64 bit mode, which I assume is what every one uses.
Julio Jerez
Moderator
Moderator
 
Posts: 12249
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: About OpenCl solver

Postby Julio Jerez » Sun Apr 18, 2021 2:37 pm

I think I am making a big mistake.
I am trying to arrange the solver so that it is friendly to GPU using OpenCL.

for example I am trying to make and island that use indices intead of pointer for CPU so that I can loaded to GPU. but this is the second time I try and both time I get error.

I am not going to do that, I will see if the new OpenCL uses shared virtual memory so that I can load pointer. and we only support OpenCL 2.0 minimum.

the local code becoming real messy with so much duplication that are propagation other areas.
Julio Jerez
Moderator
Moderator
 
Posts: 12249
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: About OpenCl solver

Postby JoeJ » Sun Apr 18, 2021 3:52 pm

Julio Jerez wrote:and we only support OpenCL 2.0 minimum.

Notice Nvidias 3.0 support does not mean they support 2.0 as well. 3.0 is a confusing step back, not requiring advanced 2.0 features like device side enqueue. Probably that's one reason why NV now proudly presents 3.0 support. :?
Though, i don't think that's a problem if you reduce to the features NV announced on the webpage. AMD GPUs support full 2.0 so no problems there either. AFAIK you use AMD GPU, so we rely on user feedback to report issues on NV if so. But 'minimum spec' would be more likely 3.0 than 2.0 then.

Curious about shared memory performance. Personally i expect it ends up bad, assuming GPU need to access all memory over AGP. If this really works well, offloading stuff to GPU would become a lot more attractive. Currently the hurdle to rearrange all data and memory layout to GPU buffer objects is just too high for me as well.
User avatar
JoeJ
 
Posts: 1453
Joined: Tue Dec 21, 2010 6:18 pm

Re: About OpenCl solver

Postby JoeJ » Tue Apr 20, 2021 2:57 am

Got an interesting answer about CLs(non)future for AMD GPUs: https://forum.beyond3d.com/posts/2200246/
Usually the guy knows what he talks about. Maybe pointer support for Vulkan is interesting too, although not sure if or when this is exposed to GLSL.

Well, personally i'll stick at Vulkan for compute. Writing some abstraction to get rid of API clutter is little work, and after that it's as comfortable as CL. :|
There are also such things as this: https://github.com/EthicalML/vulkan-kompute
User avatar
JoeJ
 
Posts: 1453
Joined: Tue Dec 21, 2010 6:18 pm

Re: About OpenCl solver

Postby Julio Jerez » Tue Apr 20, 2021 1:14 pm

well, in my case I see if I can get any OpenCL going and from there, I see if I can add othe solver.
after I get one solver them there is a lot a learning that we can use to improve and make othe solution.
nothing stop us from making a special CUDA or Vulcan version or both if we have too.

any way in the refactoring of the solver I wna making the decision that the solve is going to take a hit in the in the single threaded version top favor the multithreaded.

this is because as I mentioned before, some loops in the solve fall on the categoric of reduction, basically these are loop that in my time we use to called system with memory,
for example a loop of the type
Code: Select all
for (i = 1; i < count ; i ++)
  a[i] = a[i-1] + a[i]


these kind of loop can be paralyzed, but in my experience the overhead of doing so on a cpu result on a next lost if count is not large enough, in newton count is of teh order of 2 to 4 thousand.
so as a result ther are many algorithm in newton like this that are implemented single threaded.

once count become 16 thousand of more the then the paraller version start to make sence, and we see some marginal gains, however even if we start see gains, the final performance is so slow that it is no worse making a paraller version.
this changes for GPU, in GPU the loop about is a killer, so we nee a paraller version and ther we start see the proliferation of multiple algorithm that do the same thing.

what I am doing nwo is that I will go wit the paraller version even if the singel threaded version is going to take a hit. here is a example form teh engine

tha for example this operation that take a joint and deterimne with bodies are at rest

Code: Select all
ndConstraint* const joint = jointArray[i];
ndBodyKinematic* const body0 = joint->GetBody0();
ndBodyKinematic* const body1 = joint->GetBody1();
body0->m_bodyIsConstrained = 1;
body0->m_resting = body0->m_resting & resting;
body1->m_bodyIsConstrained = 1;
body1->m_resting = body1->m_resting & resting;


this is in a loop when where few joint can chare a body, so as a result it can be done multithreaded, because we take the risk of a race condition.

the thread safe version with be this

Code: Select all
ndConstraint* const joint = jointArray[i];
ndBodyKinematic* const body0 = joint->GetBody0();
ndBodyKinematic* const body1 = joint->GetBody1();
ndSetData& data0 = body0->m_setData;
ndSetData& data1 = body1->m_setData;

const bool resting = body0->m_equilibrium & body1->m_equilibrium;
data0.SetConstrainedAndRestingState(true, resting);
data1.SetConstrainedAndRestingState(true, resting);


where SetConstrainedAndRestingState is this function

Code: Select all
inline void ndSetData::SetConstrainedAndRestingState(bool constrained, bool resting)
{
   ndSetData oldData(m_transaction);
   ndSetData newData(m_transaction);
   oldData.m_resting = 1;
   oldData.m_bodyIsConstrained = 0;
   newData.m_resting = resting;
   newData.m_bodyIsConstrained = constrained;
   m_transaction.compare_exchange_weak(oldData.m_value, newData.m_value);
}


as you can see it is quite more expensive in single threaded, lot more instruction and use a CAS operation.
the CAS is not really as bad as if look since the write to memory is conditional so it only memory buffer when the are real changes. so that part is no too bad. but is does still write at least once for each item.

so these are the thing I have to first do to get the engine conditioned for GPU what I am not doing is to make it use indices, to me that is just ridiculous to the point that if I can't use pointers, them so be it, ther will not be GPU version.
Julio Jerez
Moderator
Moderator
 
Posts: 12249
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: About OpenCl solver

Postby Julio Jerez » Sat Jul 10, 2021 10:51 am

Hey guys
Has anyone used SYCl
I herd from it a while back but ignored because it said not one took it seriuolly.

Today browing one the state of HPC I keep reading about it.

The description say that this is actually cpp build on top of open cl.

If this is true, it is similar to what Microsoft wanted to do with AMP, but was ignored. Of course this is made by Chronos, so I do not expect to be anything as clean and well though out as Microsoft, but they say is cpp, so that should be better.

That would be the dream apt to compute shader I was waiting for almost 15 years.

And apparently this is supported native by Microsoft starting with visual studio 2019 they call it computeCPP.
I am going to try that.

Has anyone try this SYCL yet?
Julio Jerez
Moderator
Moderator
 
Posts: 12249
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: About OpenCl solver

Postby JoeJ » Sun Jul 11, 2021 2:23 am

I have not used SYCL.
Intel uses it for its One API, so at least they take it very serious. IDK if it can be used on any HW which has OpenCL support. Surely worth a try!

Meanwhile i've learned AMP does have support for LDS memory (Initially assumed that's abstracted away for simplicity, so i've ruled out AMP). I might try that some time. Downside is it uses DX11 and has not been updated. Thus not sure if things like async compute work well.

I also have some hope Windows11 will help AMD to make it's HIP (or what's the name) compatible with Windows. If so, Cuda might become an option. Differences could be handled with some #ifdefs, i guess. But then we have still a problem with Intel GPUs.
User avatar
JoeJ
 
Posts: 1453
Joined: Tue Dec 21, 2010 6:18 pm

PreviousNext

Return to General Discussion

Who is online

Users browsing this forum: No registered users and 16 guests

cron