Cuda Solver

by **JoeJ** » Tue Jun 21, 2022 9:27 am

But the problem is that the majority of the work happen at the beginning of the construction.

It's not really the majority of work, but actually a small workload, leading to bad GPU utilization due to high complexity / low workload ratio. At least regarding building a tree.

To me this is not that bad, because in such cases i can use some other, massive workload to be executed in parallel. Thus, if we have such workload handy, which we always have for a full game, the problem solves itself.

That's one reason why i say we have to use just one API for both gfx and physics, so we can make a optimized async execution graph of all our workloads, to hide those 'setup and low workload costs'.

But if you find another way so you can saturate GPU on your own, that's easier for all of us ofc.
Still, parallel computing should not lead us to a culture of 'first: saturate HW at any price, second: reduce work eventually', like NV usually proposes.

As fir what ever it is nvidia does, it seems they are building thier trees in cpu by the driver

I assumed the same initially, but now i'm sure they build RTX BVH on GPU. People show the costs of this in their GPU profiler outputs, and it's one such task they usually hide under async execution which works well.
But there is big CPU and Memory cost with RT too, so maybe CPU is still involved as well, maybe to build just the TLAS or top levels.

What you say about bottom up tree sounds very interesting. I do this for my offline BVH, but lack realtime experience.
You surely mean something different than the conventional way of using a second tree for agglomerative builds.
I'll try to follow this.
But when i was peeking at your grid for fluid particles for example, i got no clear idea how this is intended to work at all.
Maybe because i know nothing about Sweep And Prune either.

by **Julio Jerez** » Tue Jun 21, 2022 12:51 pm

on the fluid, the sweep and prune was a compromised I made, where I reduce the space but one dimension.
when I implemented the sort, in 3d each cell can potentially generate 8 intersections.
so a block of 40 x40 x40 is 64k particles, can produce up to 512k intesceptions, assuming the struct is 64 bit that's 4 mbit of data, and it need tow buffers.
when I run that, I was using tune which has a mode to measure the memory bandwidth, and according to be tune memory bad width was a problem. I do no have anymore because I could only use it for 30 days.
by reduce the space tow 2 dimension the memory was cut in half. however when I get around to it again, I will make pure 3d. I think I now have a better was to reduce the duplicates plus I am tuning it for newer CPU, whet better memory bandwidth and larger cache. my cpu only has 8 mg of cache and if we keep coding for legacy hardware, it will never make progress.

on the top down tree builder the function bellow is the algoritm that make it.

Code: Select all: ndSceneNode* ndScene::BuildTopDown(ndSceneNode** const leafArray, ndInt32 firstBox, ndInt32 lastBox, ndFitnessList::ndNode** const nextNode) { class ndBlockSegment { public: ndSceneTreeNode* m_node; ndInt32 m_start; ndInt32 m_count; }; class ndBoxStats { public: ndVector m_median; ndVector m_varian; ndVector m_minP; ndVector m_maxP; }; ndInt32 stack = 1; ndBoxStats boxStats[D_MAX_THREADS_COUNT]; ndBlockSegment stackPool[D_SCENE_MAX_STACK_DEPTH]; dAssert(firstBox >= 0); dAssert(lastBox >= 0); if (lastBox == firstBox) { return leafArray[firstBox]; } D_TRACKTIME(); ndSceneTreeNode* const root = (*nextNode)->GetInfo(); root->m_left = nullptr; root->m_right = nullptr; root->m_parent = nullptr; *nextNode = (*nextNode)->GetNext(); stackPool[0].m_node = root; stackPool[0].m_start = firstBox; stackPool[0].m_count = lastBox - firstBox + 1; m_sceneBodyArray.SetCount(stackPool[0].m_count); ndSceneNode** const tmpBuffer = (ndSceneNode**)&m_sceneBodyArray[0]; while (stack) { stack--; ndBlockSegment block = stackPool[stack]; if (block.m_count == 2) { ndSceneTreeNode* const node = block.m_node; dAssert(node->m_left == nullptr); dAssert(node->m_right == nullptr); node->m_left = leafArray[block.m_start]; node->m_left->m_parent = node; node->m_right = leafArray[block.m_start + 1]; node->m_right->m_parent = node; const ndVector minP(node->m_left->m_minBox.GetMin(node->m_right->m_minBox)); const ndVector maxP(node->m_left->m_maxBox.GetMax(node->m_right->m_maxBox)); node->SetAabb(minP, maxP); } else { ndInt32 boxCount = block.m_count; ndSceneNode** const boxArray = &leafArray[block.m_start]; auto CalculateBoxStats = ndMakeObject::ndFunction([this, boxArray, boxCount, tmpBuffer, &boxStats](ndInt32 threadIndex, ndInt32 threadCount) { D_TRACKTIME(); ndVector median(ndVector::m_zero); ndVector varian(ndVector::m_zero); ndVector minP(ndFloat32(1.0e15f)); ndVector maxP(ndFloat32(-1.0e15f)); const ndStartEnd startEnd(boxCount, threadIndex, threadCount); for (ndInt32 i = startEnd.m_start; i < startEnd.m_end; ++i) { ndSceneNode* const node = boxArray[i]; tmpBuffer[i] = node; dAssert(node->GetAsSceneBodyNode()); minP = minP.GetMin(node->m_minBox); maxP = maxP.GetMax(node->m_maxBox); ndVector p(ndVector::m_half * (node->m_minBox + node->m_maxBox)); median += p; varian += p * p; } boxStats[threadIndex].m_median = median; boxStats[threadIndex].m_varian = varian; boxStats[threadIndex].m_minP = minP; boxStats[threadIndex].m_maxP = maxP; }); ParallelExecute(CalculateBoxStats); ndVector minP(boxStats[0].m_minP); ndVector maxP(boxStats[0].m_maxP); ndVector median(boxStats[0].m_median); ndVector varian(boxStats[0].m_varian); const ndInt32 threadCount = GetThreadCount(); for (ndInt32 i = 1; i < threadCount; ++i) { minP = minP.GetMin(boxStats[i].m_minP); maxP = maxP.GetMax(boxStats[i].m_maxP); median += boxStats[i].m_median; varian += boxStats[i].m_varian; } block.m_node->SetAabb(minP, maxP); varian = varian.Scale(ndFloat32(boxCount)) - median * median; ndInt32 index = 0; ndFloat32 maxVarian = ndFloat32(-1.0e15f); for (ndInt32 i = 0; i < 3; ++i) { if (varian[i] > maxVarian) { index = i; maxVarian = varian[i]; } } ndUnsigned32 prefixScan[32]; if (maxVarian > ndFloat32(1.0e-3f)) { class ndSpliteTest { public: ndSpliteTest(const ndFloat32 dist, ndInt32 index) :m_dist(dist) ,m_index(index) { } ndFloat32 m_dist; ndInt32 m_index; }; ndVector center = median.Scale(ndFloat32(1.0f) / ndFloat32(boxCount)); ndFloat32 test = center[index]; class ndEvaluateKey { public: ndEvaluateKey(void* const context) { ndSpliteTest* const test = (ndSpliteTest*)context; m_dist = test->m_dist; m_index = test->m_index; } ndUnsigned32 GetKey(const ndSceneNode* const node) const { const ndFloat32 val = (node->m_minBox[m_index] + node->m_maxBox[m_index]) * ndFloat32(0.5f); const ndUnsigned32 key = (val > m_dist) ? 1 : 0; return key; } ndFloat32 m_dist; ndInt32 m_index; }; ndSpliteTest context(test, index); ndCountingSort<ndSceneNode*, ndEvaluateKey, 1>(*this, tmpBuffer, boxArray, boxCount, prefixScan, &context); } else { prefixScan[0] = 0; prefixScan[1] = boxCount / 2; prefixScan[2] = boxCount; } const ndInt32 leftCount = prefixScan[1] - prefixScan[0]; if (leftCount == 1) { block.m_node->m_left = leafArray[block.m_start]; block.m_node->m_left->m_parent = block.m_node; } else { ndSceneTreeNode* const node = (*nextNode)->GetInfo(); *nextNode = (*nextNode)->GetNext(); node->m_left = nullptr; node->m_right = nullptr; node->m_parent = block.m_node; block.m_node->m_left = node; stackPool[stack].m_node = node; stackPool[stack].m_count = leftCount; stackPool[stack].m_start = block.m_start; stack++; dAssert(stack < sizeof(stackPool) / sizeof(stackPool[0])); } const ndInt32 rightCount = prefixScan[2] - prefixScan[1]; if (rightCount == 1) { block.m_node->m_right = leafArray[block.m_start + leftCount]; block.m_node->m_right->m_parent = block.m_node; } else { ndSceneTreeNode* const node = (*nextNode)->GetInfo(); *nextNode = (*nextNode)->GetNext(); node->m_left = nullptr; node->m_right = nullptr; node->m_parent = block.m_node; block.m_node->m_right = node; stackPool[stack].m_node = node; stackPool[stack].m_count = rightCount; stackPool[stack].m_start = block.m_start + leftCount; stack++; dAssert(stack < sizeof(stackPool) / sizeof(stackPool[0])); } } } return root; }

when I said the first pass does the heavy listing I misspoke, I meant the left node are the one doing the heavy lifting.
as you can see when it start the array is large so the parellel classifier and box stats is very good, but as is keep spliting ther array into smaller sub chunks, them each sub call get smaller and smaller,

this is no a problem for a small core count, cpu, but for a GPU, it end up launching n kernels of very few items.

I think I can probably made a modification when all the split are processed at once, but that method requires lot of process to condition the sub splits.
after I complete the bottom up method, I see if I can made that changes, and which even method is faster and can be parallezed for many core is the one I will used.

so far the top down is probably the winner if I can solve that issue.
but as I said the bottom up is an attractive method because it can be incremental, which mean it can be build across many frame, reduce the foot print.
I find 1 ms on that unacceptable for GPU, if we consider that a physics lib can not take too much time of the GPU.

by **Julio Jerez** » Tue Jun 21, 2022 1:24 pm

also, I am just doing the first test, and so far the bottom up build generates a tighter tree using teh firts metric for both. the matrix I am using is the sum surface area of all the sub nodes

bottom up entropy 42584.000000000000 double
top down entropy 42911.375000000000 double

so this is a point in favor of the bottom up build.
that cam to a surprise to me, I though that a top down build will be tighter, but now that I think about a bottom up build use a spatial position of the items, so when it group tow nodes, there are guarantee to be close in all three dimension, while the top down when it group two nodes only one o the tree dimension is guarantee to be the closes while the other two are close in the statistic sense.

so top down so a very good selections when the array has many item, but as if close down the point and the deviation can not really determine the best split axis.
bottom up does not uses that notion, it simply place a grid filter and the items fully inside the grid are considered closed, then the grid is double and the process is repeated.
since the grid are in space, it only need to group there item by their grid id, which is just a sort by grid.

I start to like this method betters.
if anyone want to follow the development, in file ...\newton-4.00\sdk\dCollision\ndScene.h
#define D_NEW_SCENE

and the function is ndSceneNode* ndScene::BuildBottomUp(ndFitnessList& fitness)

I will run more stress test, and if these results are all consisted, I will just replace the top down,
and move to part two, incremental build.
that will make a big wing for all systems.

them back to GPU impementation.

by **Julio Jerez** » Tue Jun 21, 2022 3:43 pm

JoshKlint wrote:Something you might want to consider is a dynamically resizing hierarchy structure, to eliminate the use of world boundaries.

newton does not has world boundaries, it never had after the second release.

by **Julio Jerez** » Tue Jun 21, 2022 7:23 pm

wow, joe.
the result are far better that my wildest expectation. some thing are good but some thing can be really good. we may have stumbled with a mile stone here.

here is a picture of the results. with a scene with 700 capsule moving around

: Untitled.png (70.57 KiB) Viewed 9867 times

as you can see the bottom up build is about three time faster with four threads.
but in fact since about half is still sequential this can be even fasters once I removed that part.
with one thread is about twice slower. so a case can be made for the core count.
the scale with cores is quite good for the bottom up.

the one cons is that the generated tree is not as tight when building the large scenes,
so that's a point against, but I believe it can afford it that const if it in fact has so much in its favor.

basically the polish version version should practically remove form the profiler.

also you can see that the top down even when is also parallel, it does no yield the expect parallelism because as I said before, the as it goes down the parallel jobs do not have enough workload.

so my expectation is that a GPU version should practically eliminate that concern.

by **JoeJ** » Wed Jun 22, 2022 12:12 pm

Interesting. I looked up implementation but it did not ring my bells.

I wonder, how do you cluster a set of nodes at the bottom?
Any method i could think of would not be parallel friendly at all, beginning with the problem of nodes being assigned to multiple clusters.
Do you use some world space grid to get explicit cluster assignments?

by **Julio Jerez** » Wed Jun 22, 2022 1:56 pm

I run more test last night, and added the rest of the parallelization.
the results dealt a death blow to the top down method, it it about five time faster with for threads, This seems to indicate that even the single thread would be very competitive.

if we add to that that the top dawn also have some pass to precondition the data, whic are no tali in the profiler, that make it even faster.

The precondition is stuff like one pass to filter object which are very large so they the are location by the root of the tree. Imaging a tree in which you have a very big object and very small ones, like that scene. I naïve top down build will place the big object any where in the tree. image you get a bad luck and the big box is at the bottom. That mean that the box immidially on top will be very large, so will be the top->top and so one. The produce a very poor tree.
Now imagine to classify the items by size and you make one tree with the very big objects. then you add the node which is the tree of the small objects.
now you get the large items at the top, and the traversal test much fewer objects.

This is similar to when people make octress or quadtress, that the promote large bodies to top level if is goes across the edge, but in an octree that's a terrible idea for many reason.
1-make the tree top heavy,
2- is does no consider the object size so even a small objects get promoted to teh top is the goes acree edges.
3-maintranace of octrees is quiet expensive for the reason.
for these reason since newton 1.00 I abandone use octress.
anyway below is a picture of what I try to explain

: Untitled.png (8.76 KiB) Viewed 9853 times

the bottom up build does not need those pre passes, since a large object will not fix on that box resolution, so I will no be consider on the pass, until the resolution is increases.

JoeJ wrote:I wonder, how do you cluster a set of nodes at the bottom?
Any method i could think of would not be parallel friendly at all, beginning with the problem of nodes

all the contrary Joe, bottom up algothim are more friendly to parallels than top down.
the reason is that, a bottom up method divide the problem into a series of small problems of equal complexity which are independent of each other

the way my method works, is a fallows.

1- initalization: calculate the resolution of of the item (this the quantized aabb size of the smallest box, at the same time calculate the aabb on each item and quantize world box size.
2- make an virtual grid of the world size box and the grid resolution.

3-in a vector determine the position of each item on the grid. some item will be fully contained and some won't
3-clarify the item array in tree groups: the one that are already liked by a parent node, the one that are fully contacted by a grid and the one that gores across grid.
4-sorting that array, you get the set of item that are fully contained in one grid
this array will have several runs of item which all fall on the same grid,
5-flag all those items as linked, and form new item using a brute fore top down build of a small tree,
the new item will be added to the end of the array and unlink item
6 double the resolution and go to step 3.

their repeat until the three resolution in the size for the world box, of until the array reduce to size of only one item

there are few caveat and edge cases but that the general algorithm.

by **Julio Jerez** » Wed Jun 22, 2022 3:27 pm

oh man it has some real bad knews. but first the good new.
here is the performance in the scene with 32k items.

: Untitled.png (84.19 KiB) Viewed 9847 times

it is just blowout for the top down method. so the bottom up wind preference wise hand down.

the bad knew that the quality for the tree, is really bad. that a big red flags.
the second problem in the, the item that get accumulated as the grid result ion is engaoled is much large than I expected.

for example on the scene the last box, can contain 32 x 32 x 32 items, it the old goes across the edge,
and than too many for a single core to generate the brute fore tree.
that part I can probably make better, but the tree quality is a big concern.

so it is not all rosy picture.

by **Julio Jerez** » Wed Jun 22, 2022 5:39 pm

wow, joe, I take the last comment back.
it turned out there was a bug, and I fix it, not only it is faster but so far generate much tighter trees, but a big margin. It went from about 4, to 5 time faster to about 10 to 15 time faster with four threads.

I am positing the capture, because this is not to be believe.

: Untitled.png (99.73 KiB) Viewed 9843 times

now what is left is to stress test it, before start writing the GPU version.

by **JoeJ** » Fri Jun 24, 2022 8:17 am

Sounds good news

But i still fail to understand, with most questions raising here:

3-clarify the item array in tree groups: the one that are already liked by a parent node, the one that are fully contacted by a grid and the one that gores across grid.

I imagine at this point we have just one global grid yet, with the cell size determined form the smallest object, large enough to cover all objects.

But if so, all objects would be 'fully contained'. What exactly do you mean by this? Objects which fit inside a single cell?

And how could objects already have a parent node at this time? It's yet not clear how you establish any parent relationship at all. I guess this happens later so there are no parents just in the first iteration, but i still don't get how this happens later either.

6 double the resolution and go to step 3.

Likely you meant 'half the (grid) resolution'?

If you mind, you can elaborate in more detail. I just don't understand the overall concept of forming groups within a hierarchy.

by **JoeJ** » Fri Jun 24, 2022 8:36 am

This is similar to when people make octress or quadtress, that the promote large bodies to top level if is goes across the edge, but in an octree that's a terrible idea for many reason.
1-make the tree top heavy,
2- is does no consider the object size so even a small objects get promoted to teh top is the goes acree edges.

But those problems are solved by using loose octrees?
I came up with this myself when i made my own physics engine, later learned it's no new idea but called 'loose' octree.

Here, the size of the object gives the tree level we will store it in.
E.g. if the largest dimension of the object box is 100, it will go into the level of hierarchy where the cells have a size of 128, but not into other levels of sizes 64 or 256.
We also put the object into the single cell which covers the center of the object, so objects never go to multiple cells.
This means objects won't be bounded by the cell box. But if we double the cell box size, they are bounded with guarantee.
If we do some range query, we thus need to double the node boxes to see if they intersect our query region, or we extend the query region individually per tree level.

This worked well for me. It sits somewhere between traditional octree and BVH, but still needs no individual bounding box per node, as we can calculate the worst case bounding box by doubling the implicit box given from cell coords and hierarchy level.

Btw, back then i was crazy enough to store 26 neighbour pointers per node

I could just have used a regular multi level grid instead, i guess

But it was cool and gave runtime cost linear to body count, so i was happy.

by **Julio Jerez** » Fri Jun 24, 2022 2:21 pm

Ok I now made the new builder the default scene manage.
when I said in step 6 double the resolution I mean double the grid size.

a simple expansion can be as follows.
imagine you have a bunch of items distributes in space. say they are fish in a pawn.
imagine you are fishing wit a net.

the next will capture only the fishes that can pass trough the grid wiothpout touching the edge of teh grid.
it will also capture fishes that are larger then the grid met, if they thought and edge but the are no larger than the next advance grid.

now after you lift the next form the water, they will be a group if fish inside the net and a group of fishes still in the pawn.

now you take all the grid in the net and identify them by their grid id.
this will give you an array of grid each with a set of fishes,

you take each grids, and build and town BVH, this is a brute force build because the number of fish is very small. this can also be run in parallel.

now for each cell, the root of each small BVH, does no have parent, but all it children do.
so you take the root and use the bouncing box to make a new fix the size of that bounding box and you return them back to the pawn.

now what you have is a pawn with not small fixes of certain size,

so now you take a different net in wish the grid is twice the size of the first grid. and you trough it in the pawn.

this time it will do the same, but it will only capture the fishes that are large and the new fishes there where returned to the pawn on the first pass. as far as the net knows, those are just fishes

this process repeat until, all the fishes become one whale the size of the pawn, which means is a net of one grid and will capture only one fish.

maybe I should name this the fish net algorithm. It is remarkable effective so far. it surpassed my very best hope of been batter.

I was only hoping that was embarrassing parallel so that can be ported to GPU, but as it turned out is not only that, it generates better quality BVH, better balanced, plus it can be made incremental, which solves the maintained problems.

the maintenance is my biggest problem I have with octrees. as item move for cell to cell, the amount of entropy increases a lot, and keep in the octree case it start to add lot of dangling nodes.

for BVH this is a problem as well, but no as severe, on the BVH it can be solve wit node rotations, but only for small movement, when item start to move around large distances, the BVH start to become very unbalance and generate more pair than it should.

the way I maintained is that after 250 updates I rebuild t for scratch, but tree rotations is too complex for GPU, and so it will have to rebuild more frequent.
and that where the incremental build, because the biggest winner.

Imagine having two BHV. one the is current and is no maintain at all.

the secund in in construction.
the construction is a state machine, that execute one pass per state.
for state prepare teh dats.
then each next state s one pass (throwing the net and converting the captured fishes into a larger fixes)
the final stage is a net BHV, but since this one was build form a snap shot of the scene few frame behind, the final pass is one that swap the BVH and goes over the tree enlarging the root nodes to match the size. this is the same prosses that the tree rotation does.

but now all this can be done in CPU and GPU, so the GPU scene will have both.
this is important because there are many functionality that is just too hard to do in GPU, and even of you do it is defeat the purpose of acceleration since the application would bee immediate result.,

for example imagine, and app doing a ray cast, or a vehicle checking some contacts.
if we have the GPU is a shadow of want is in the CPU, the CPU can simple calculate the pair for that query and execute in cpu, while the majority of the heavy lifting is done by the GPU.

by **JoeJ** » Sat Jun 25, 2022 5:26 am

Now i can imagine much better, thanks for the explanation.
And i can rephrase my question:
How do you decide where to throw a net?

It's easy if we rely on some given discrete space partition, like octree has.
But it's hard if want some clustering depending only on the fishes, without any given world space grid.

The other question is: How do you find nearby fishes, without having a acceleration structure yet?
You could start with a global sort by Morton Code, but this has similar issues as 'bad' octree approaches.
So i guess you use the acceleration structure from the previous step, which should work much better?

Imagine having two BHV. one the is current and is no maintain at all.

the secund in in construction.
the construction is a state machine, that execute one pass per state.

I see that's the big win here. That's how stuff should work. I like it.

the maintenance is my biggest problem I have with octrees. as item move for cell to cell, the amount of entropy increases a lot, and keep in the octree case it start to add lot of dangling nodes.

Yeah, the last time i had this issue was with the sparse grid for fluid.
But interestingly it's negligible. As the simulation is smooth and timesteps are small, the number of nodes to become empty, or new nodes which need to be created, is a small double digit number even with hundreds of thousands of particles. I was very surprised by this, and all other parts of the grid update can be multi threaded easily.
But just saying. I don't want to convince you about any alternatives.

by **JoshKlint** » Sun Jun 26, 2022 12:11 pm

Julio Jerez wrote:
JoshKlint wrote:Something you might want to consider is a dynamically resizing hierarchy structure, to eliminate the use of world boundaries.

newton does not has world boundaries, it never had after the second release.

Wow, I keep learning new things every day. :oops:

by **Julio Jerez** » Sun Aug 07, 2022 12:48 am

I posted this question about four months ago over the Nvidia forum.
https://forums.developer.nvidia.com/t/w ... ble/215471

I got an answer today, kind of like it brings a new meaning to the phrase.
"better late than never"

Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Who is online