Cuda Solver

by **Julio Jerez** » Wed Mar 23, 2022 4:01 pm

I think what is becoming obsolete is the pci bus.

Video games consoles do not have pci bus, and they run circles around d pc which in geral are about two to tree time more powerful.

But anyway, I believe that some how api like dx and opengl figure out how to split the device, so that differen wave front run different kernel.
If you think about, and gpu these days has at least 16 or more multicore, and each one of these core need a least 264 x 10 items = 640 threads to be efficient.
A block running just one thread has the same cost of one running 10 wavefront.

The later gpu come with 40, am even 60 multiprocessor.
It is ridiculous to keep those busy by running I one shader at a time.
That's where stream and dependency graph become very useful, vulka and dx12 exposes that with command buffer.
The legacy app the driver does but but that's very hard.

For a physics library, and for some of the new game engine like UE and unity.
They are fucus on big data, so the can keep a gpu busy with single shder.

For us what it means is that we have a high number of objects that we can process and above that ther will be a dramatic cut off in performance, but that number could be quite high as physics goes.
I now can see how a sph fluid can handle million of particles, it just capitalized on the high core count and huge latency.

As for all those memory mode. I do not think I will try, many of them, for what I have read run quite slower because the count of hardware exception handling to trigger hidden memory copy behind the user

The memcopy asyn, is also something that have to be use with care, because they use committed virtual pages. Nvidua called pin memory. But that's could be bad for the os.

by **Julio Jerez** » Wed Mar 23, 2022 4:16 pm

After I made the last common optimization which will be.
Getting the transform. That less than have the memory.
I will get enought to continue adding the Skelton of the engine.
Next I can write the broad phase, a box collision calculator, so that it can go to the solver constraint solver.

What the asyn copy will do is that it will keep the cpu broadphase for the app interface. These update can run while the gpu is doing work.

But the heavy load of pair will be sewp and prune brute force.
Right now that cost 60 ms in cpu for that scene.

And we keep adding systems at a time from the top diwn.

by **JoeJ** » Wed Mar 23, 2022 4:56 pm

It is ridiculous to keep those busy by running I one shader at a time.
That's where stream and dependency graph become very useful, vulka and dx12 exposes that with command buffer.
The legacy app try to figure it out I. The driver, but that very hard.

I did some testing on this years ago.
It was not really easy to get working. For example, AMD exposes one gfx queue and one compute queue to Vulkan. So i tried the latter for async compute tests. But turned out this compute queue can not run at full throughput. I got only about half of performance than when using gfx queue. And that's nowhere documented, you just figure it out by testing, and you don't know how other GPUs behave. Also, AMD did not respect given queue priorities so i could change this behavior. Maybe they have changed some things in the meantime. Still, in the end i got pretty good scaling with async compute. While one queue is blocked on sync, the GPU still keeps working on the other. Pretty nice. Did not test this on NV, which back then had no proper async yet. I've heard they only got this finally working well with Ampere.

However, the preferred option is to use just one queue and fill it with different workloads, while making sure sync does not create too many stalls. In this case the GPU processes multiple dispatches automatically and the programmer does not even recognize (without using profilers). It only shows by strange timings which only makes sense after realizing that workloads execution may overlap.

This method gave me perfect performance, but usually you need the results of one task to process the next one, so it rarely is an option. I guess it works best if we mix totally independent workloads, e.g. raster, lighting, physics, audio, etc. Then we always have enough work to saturate, but surely we have to tweak ideal execution graphs for different hardware. :roll:

...working on consoles really must be total fun!

by **Julio Jerez** » Wed Mar 23, 2022 6:00 pm

if you look at the image.

: Untitled.png (52.98 KiB) Viewed 2344 times

the first bar is the time taken by gpu mem copy,

the final optimize version seek that there will that bard will disappear altogether because there will be two streams, one stream will be coping the data to one buffer, while the other thread will be reading from the secund buffer, then the buffer will be swapped for the next frame.

that happens in this code below.

Code: Select all: void ndWorldSceneCuda::UpdateTransform() { D_TRACKTIME(); GetBodyTransforms(); ndScene::UpdateTransform(); }

GetBodyTransforms() should no block, and that is what cudaMemcoyAsyn is for,

so if I am correct that will remove about 4.0 to 5 ms from the scene

by **Julio Jerez** » Wed Mar 23, 2022 11:12 pm

oh, darn this is very scary. I did a stress test with a block of 40 x 40 x 40 = 64000 bodies.
the GPU does even blink, in fact is wants more, data.
here is an image.

: Untitled.png (61.42 KiB) Viewed 2338 times

the physics with 4 threads take 9 ms. but all that is coping the transform, the GPU is no even 100 microseconds, and we know that is correct because the memcpu synchronize the kernels.
rendering 64k boxes bring opengl to its knees at 60 ms.
I am using, at that level we will have to add some special extension like iterops buffers, so that the engine can copy the transform GPU to GPU memory,
but this seems to indicate that it could easily do more than a million of those babies.

of corse it will get slower as more system get added but there is all that time bellow the rendering.
so as long as we do better than the render we will be ok.

so now I will contrinue with the othe part of the scene stuff.

by **Julio Jerez** » Thu Mar 24, 2022 12:00 pm

I have to get rid of the algorithms with n log(k* n) time complexity.

they are elegant and serve well in in the pass with, but now with so many cores and we trying to manipulate big data. they are much harder to implement.

one such algorithm in the engine is the Construction of island, In that scene it take about 25% of the time for zero gain.
I try to remove altogether last year and it sent me to month of rewriting.
But I will try again,
the method I am using is a sort but the sleeping state, where each body get the info about the neighbor
and then we can apply the sleep set to each body base on the state.

it would seem that sorting is probably worse, and it really is if we only have one or two cores.
Radix and counting sort are not really faster than quick if the number of element is around 2000.
and that is because the memory bandwidth that it takes to move data from buffer to buffer completely is a not cache kill the linear time complexity. while quick sort inner look is very case coherent.
once with pass to the several thousand elements, both algorithms are dominated by memory bandwidth and them quick sort is no longer the winner.

so for big data, it become more important to have algorithm the minima the memory access, and do more in code. and algorithm with linar time complexity by definition do that better that method with
n *log (n)

for example a quick sort will end up reading each element long(n) time which for a small data set,
is not a big problem since the element will probably cashed. but is array is very large them that data will be evicted from cache, a lot more.
but is we have a counting sort, in each pass each data element will be accessed twice but at a random location, so it is much better even if is a cache miss each time since the quick sort will also be a cache miss for large set.

anyway I will try to remove island again and see goes that works.
then the next change is to go for a multigrid sweep and prone, so we keep two bread phase, one for the game, ray case, aabb test, convex case and that sort of things. and one that is big each frame again using grid sort and color coding.

the result will be that we will lose some performance with small scenes, but the grow with scene size is linear instead of the n Log (n) that we have now.

by **JoeJ** » Fri Mar 25, 2022 7:35 am

It would be very nice if you could keep the old methods as well, so we can choose which to use.
Probably too much to maintain, but would be nice.
It's hard to predict he future. Will core counts keep increasing? Will they instead decrease again, because increasing power draw becomes too much of an issue? Will lower core counts, combined with larger on chip caches, do better for what's meaningful consumer workloads?
Hard to say, but eventually you might regret decisions at some point.

Btw, what's your proposal regarding a scrolling open world? Maybe that's related to such decisions too.
Should we move all bodies every frame, or would this trigger some rebuild of acceleration structures?
Could we avoid this, eventually by using a small set of parent spaces, so we do not need to move bodies but just space transforms, and Newton handles the boundaries of overlap automatically?

by **Julio Jerez** » Fri Mar 25, 2022 8:05 am

So far they are both there, bia #ifdef
But when a new method is clearly better than the old
Them keeping the old around becomes a hazel to maintain.

On the big open world.
For the most part all method in the engine are local.
So making open world would be as simply as just making the offset base a double.
So far no one has asked for that funtionality.

But yes that could be interesting to think about in as a feature.

Right now unless the size of the world is larger that can be kept in the mantiza of a float32 the engine has no problem
Since the calculation are local relative to the lowest point, and proportional to the world size.
But yes you are right it will not role since it does not have a notion of a virtual space.

In 4.00 this can be controlled from the app level using the new proper that bodies, joint, model, and all other object can be added ad remove to the world,
So an application can keep track of where is the active world island and remove the object that ast outside

The fro multiple active island, it cam make multiple world, and move object from one world to another.

The only limitation is that object can only bong to one world at a time, so not overlation world are possible.

So the answer is there is some support for that, but if I get to do it it will be a with application assisted with a couple of funtions.

by **Julio Jerez** » Fri Mar 25, 2022 11:52 am

ok here is the proof of what I said about time complexity.
here is the old method in the 64k bodies scene

: Untitled.png (79.11 KiB) Viewed 2303 times

as you can see sort island take about 7.x ms, that because it runs an n log (n) on 64k elements, and come out with 64 island of one body each. also you can see that since the algorithm is recursive, it can not be parallelized without atomic lock, and I am that actually make it slower.

here is the new method

: suspension.png (60.22 KiB) Viewed 2303 times

as you can see the sort island takes under 1 ms, using 4 cores, is paralyze almost linearly with core counts.

it is not without problems. the new method can only check for one layer neighbor, so it can only put an island to sleep one layer at a time.

for example, the pyramid, will take several frames to go to sleep, since the first row and second row all have to go to sleep and once in order to make the very row all sleep. then in another frame is all the first layer are sleeping the all the neighbor can gore to sleep, and also the neighbor, as so on.
this man that a high island will take many frames to go to sleep and may collapse
.

the old does not have that problem because since it has the complect island, it coudl process all elements at once and copy the data to each as a frame count.
some when a toll island when to the process, bodies on top are consider as if the have the same state as body oldest bodies that when into that state as long as the island di no changes configuration.

but that is exactly what make the algorithm expensive for very large data set, because it needs to run a disjoint set algorithm, and these are usually recursive, or recurrent.

In any case the new method put the entire system to sleep if we do 3 sup step, and that is in line wit the goal which is to go for the GPU and high cores count systems.

I think I can make ethe current sleep to actually be more aggressive if we use tow level, (neighbor and neighbor and neighbor scheme, that was, essentially is a multiples. what these does is that in every frame it keep more sleeping bodies, so it catch up farter. but that could be an option.

there are many tweaks that can be added.
but at the end give the performance factor, more than five forth, when we tune this method, there is no reason to keep the older. especially when the older is no friendly to high core counts.

if you test the differences, ther si a define in file ..\newton-4.00\sdk\dCollision\ndBodyKinematic.h
#define D_USE_ISLAND_WIP

it is check it set to the old method, until I do more testing. but you can try it out if you want.

by **JoeJ** » Fri Mar 25, 2022 12:02 pm

Julio Jerez wrote:On the big open world.
For the most part all method in the engine are local.
So making open world would be as simply as just making the offset base a double.
So far no one has asked for that funtionality.

It would be an important feature i think.
So far i never did anything big, so i lack experience and have not spend much thoughts on it.
It became topic recently on gamedev.net, and a guy dragonmagi chimed in who did research papers and afaik also runs a company to help out devs implementing large worlds with fp32: https://www.gamedev.net/forums/topic/711954-issues-with-floating-origin/5446370/?page=2
Maybe you're interested in the papers, although it's somewhat obvious and was something new only to me i guess

However, his proposals only cover visuals, which is the easy part. The part which matters is how we represent our coordinates for actual processing, e.g. physics. And i think it's thus the physics engine which dictates the solution we should use. So the ball is on you.

I'm convinced even small and linear games will be all based on open world tech soon. No matter if it's UE5 level detail forcing us to stream levels of detail, or if it's robotic ragdolls which introduce smaller bodies and so shrink our spatial range of precision.

At the end of the thread i came up with a solution which would work for both gfx and physics.
We have a set of spaces, and all objects are local to that, so their local coordinates are always small and precise.
To model interactions across multiple adjacent spaces, i see two options:
1. Transform all objects to a shared uniform space. But unlike worldspace, the shared space has the player close (or exactly) to it's origin. This way we keep best precision where we need it the most.
2. Keep objects local, but resolve adjacency offsets across spaces on demand internally.

The first solution would be much more easy to use of course, but it really depends on what's more efficient.
Maybe worth it's own thread, to see what other users think, if you get at it...

For my current work on editor i'll simply use doubles, which will be easier and also faster.
Later, if i get at working on destroyed cities, that would be a nice test case.
But maybe i need to drop this idea for now. My work on terrain simulation already takes much too long... :roll:

by **Julio Jerez** » Fri Mar 25, 2022 2:21 pm

Going double is the trivial thing to do and would work. But it is a big performance cost.
For almost 30 years we have been spoiled with cpu having g caches that hide memory cost for us.
In a x86 the cost of double and float is the same.
But that's a hugely misleading statement by intel and amd.
What they do not mention is that when using double in a vector class the code uses twice as many registers, twice as many memory access, and the code output is prompt to run into register spill.
Most people do not see that because all compilers are mostly scalar, a registered can hold a scalr float or a double without problems.
That's not the case for simd, so it is some time the case that is better not to use simd doubles and let the compiler do the scalar optimization.

So the way I would do an open would is that I would resurrect the segment of the old x86.
You make a Volumetry tile.
The coordinate of the volume are worl aligned and can overlap.
I side a time all data is float32.

That way you can have a bunch of adjacent tiles and adding and tempting is just a matter of iterating over the tiles changing the exponent of the double.

If the tile are set such that the sizes are power of two.
The changing at tile base does not change the matiza of any of the element inside that time.

Them you Broad phase is just a tile base which is usually a 64 bit integer.

Two overlapping object in two adjacent tile will only differ bit exponent. So the collision routine do ha to do anything change. Since overlapping tile never change value when rebaseing them.

If you look at the collision system in newton, it does that extensively.
When two shape collide it find the minimum of the aabb of the two combined. The subtract that from both. Do the calculation and the add the base.

If you look at file ndContactSolver. Line 2543
You will see that code.

This is part of what make the calculation very accurate.
What would be different for and open would, is that it would not be using the min of one aabb, instead it subtracts a power of two not changing the mantiza of the values in the cell.
Also the values in the cells are identified by the impotent of the double which will always has inter matiza that are exactly divisible by powers of two.

That's actually a big excessive that without demand I do not see reason to put much time.

by **JoeJ** » Fri Mar 25, 2022 4:03 pm

for example, the pyramid, will take several frames to go to sleep, since the first row and second row all have to go to sleep and once in order to make the very row all sleep. then in another frame is all the first layer are sleeping the all the neighbor can gore to sleep, and also the neighbor, as so on.
this man that a high island will take many frames to go to sleep and may collapse

I don't think that's a problem.
Reading your post it sounds the new method distributes work over several frames, while the old can cause runtime spikes.
Then i clearly like the new one better and my concerns were unfounded.

In a x86 the cost of double and float is the same.

Ha, i always thought doubles have twice the cost, speaking of cycles alone.
So can any AVX CPU do groups of 4 floats, but also 4 doubles as well?
And for scalar ops the same - processing float or double takes the same cycles?
I'm very surprised.
On GPU however the ratio is something like 1:32, so doubles in inner loops surely is no option.

I don't know about your proposed exponent bits trickery either, but sounds interesting.
Will look at your code to get the idea...

by **Julio Jerez** » Fri Mar 25, 2022 4:23 pm

yes avx can do 4 double. but is not as rosy as you think.

when using avx, there is a lot of glue code you have apply, if you are writing cross platform code or even when using other libraries. the avx units are not really a monolitic fpu, instead is two sse placed together, so when no using structures of arrays, the normal zwizling that you do in sse to move data to different lanes, requires two instructions in avx, one to more from upper 128 bit and one for the swizling. It is better in avx2, but really bad in avx.
What I did was that I put the avx2 code in a separate library that does all the avx as if it was a GPU resgister, each lane of the avx act as a scalar unit, not swizling.
that's when avx2 code really shines, but there is the extra overhead of transposing your data.
overall, even with the extra cost is a net win.

but if you try the make a 4 x 1 normal vector class using avx or avx2 resgister, it is in fact a net lost, or at least I have never seeing it being better than SSE, which is why I do not use it.
your binary code is full of extra instructions the compile has to insert to make it compatible with the rest of code that you are going to use it.

to me the only time I have better result using avx is by making separate libraries that you can compile with compile flags so that is does not issue all that extra code.
It still issue some glue code but that's when you have to make calls to function outside the library.
that's part of et reason so much code is duplicated in the solver, and also why lots of the shares functions are inlined.

here is an example, memcpy is much slower that a loop that copy the data.
but a loop that copy the data has to be inlined if you want the avx to be 256 bit instead of 128.
those little things when considered in insulation do not seem to make a big difference, but when put together in lot of consecutive code, they add up.

by **Julio Jerez** » Fri Mar 25, 2022 4:38 pm

The ration of 1:32 in gpu is because the nvidia gpus do not do double. Instead, each compute unit has one special single double precision register.
Since the core has 32 floats lanes an one double that the 32 factor comes from.
For f16 the ratio is twice as many because they are done by the shader cores.
My guess is that sometime In the future new gpu will have 64 float registers, so that ratio will become 1:2

If I have to bet, I think that's a sure thing, since doubles are important for scientific app.

by **Julio Jerez** » Fri Mar 25, 2022 6:02 pm

JoeJ wrote:I don't think that's a problem.
Reading your post it sounds the new method distributes work over several frames, while the old can cause runtime spikes.
Then i clearly like the new one better and my concerns were unfounded.

I do not think is a problem either, most practical scenes do not really do much stacking.
however, it will be nice if we can improve a little.
I now added one of the tweaks from the island method, that is in the full island when a body goes to rest, it does not have zero velocity and acceleration, instead when that happens there is a damping value that reduce the velocity to zero gradually. that has the effect that spiking small velocity changes are zero out, smotth9ong the behavior. in the new method that is not possible since the is not what to save the data, unless I add space on objects to save it. but I can just set to zero.

The effect is that the stacking test scene now goes to rest as before, but it seems the avx2 solver does not like that, this is temporary, I am hopping I can come up with better tweaks.
but all in all this method seem not only equal but seem much better.
on the very less is more accurate.

if you test it, you can see the differences. by setting this define
//#define D_USE_ISLAND_WIP

you will see that the define comment out the scene state active for a longer time. while the full island method is a truncated effect island reach a threshold where the global acceleration and velocity is below a threshold and them the hold island goes to sleep in one frame.
while this method still has to keep bringing neighbors to rest layer by layers.
and it is that process that if a body is about to fall that the whole island can collapse.

is not easy to tweak that, because the whole scene is just one giant island. but the us see what we can do.

I will do so more tunning to get to a stable state, and move to the scene changes, the pair calculation there is an
I

Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Who is online