Cuda Solver

by **Bird** » Mon May 30, 2022 6:59 pm

Okay I updated to driver 516.01 and Newton is working fine now.

Physics time is about 5ms
GPU time is about .33ms

by **Julio Jerez** » Mon May 30, 2022 7:05 pm

Awesome, that give me hopes.

The gou time is not accurate.
Nvidia has thee way to measure gpu time and all of the are flawed..
The later I will add a macro that I will place on each kerne that will.measuse tge time of each Kerner using clock64.

Clock64 and events are useless when use across kernels since you are at the mercy of the dispatchers, and that changes from run to run.

I have no idea how the measure frame rate in game because that timing thing is by far the most incubate tool I have ever seen.

But I think adding the tick per kernel will provide a good approximation that will negotiate the dispatch time. But I guess that better than nothing.

by **Julio Jerez** » Mon May 30, 2022 7:19 pm

I posted this over the neidia forum.

https://forums.developer.nvidia.com/t/c ... ted/215900
and this is the kind of * the answer.

basically they just tell thought luck, the whole idea to go to dynamic paralesm is to reduce the number of call and now it turn out that if you nee to get the result they you have to issue individual call.

by **Bird** » Mon May 30, 2022 7:23 pm

The author of the OptiX7 wrapper that I use made this GPU timer that seems to work well

https://gist.github.com/Hurleyworks/eb2 ... 1151533ff9

by **JoeJ** » Tue May 31, 2022 4:44 am

Julio Jerez wrote:I posted this over the neidia forum.

https://forums.developer.nvidia.com/t/c ... ted/215900
and this is the kind of * the answer.

basically they just tell thought luck, the whole idea to go to dynamic paralesm is to reduce the number of call and now it turn out that if you nee to get the result they you have to issue individual call.

Hmm, i this case i do not see the advantage, because i could do the same working solution with compute shaders. This is how it would look like:

Make command list:
Dispatch 1: Calculate the workload, store the number on VRAM
barrier()
Indirect Dispatch 2: Process the work
barrier()
Dispatch 3: Now we can use the results
barrier()

Upload the command list at application start, then we process it each frame. But because the commands are on GPU already, no CPU roundtrips are needed, and so we can run our whole engine with a single draw call.

Imo that's fine and we can not expect to launch a big workload, and then returning to the same calling kernel. To make this work, they would need to cache the state of the kernel to VRAM, so the GPU has all resources available to process the big workload. And after it's done, they would need to load the state of the calling kernel and continue executing it, considering the now available results.

This would be convenient, but it does not give any performance win, and we can just use a new kernel to react to the results. So i agree with the answer you got, makes sense to me.
Maybe Dynamic Dispatch is implemented similar to NVs device generated command lists Vulkan extension, which allows to create command lists as above directly on GPU, giving the same options and restrictions.
The remaining problem with all those methods are zero workloads. If we have no work to process, we still need to execute the memory barrier, flushing GPU cashes and causing sync for no reason. I had discussed this issue with a dev ov NVs extension, and he said he might add the necessary tokens so we could skip over unneeded barriers. I did not check if they already improved this.

The only API i know which can properly skip (or loop) over sections in a command buffer, including barriers, was AMDs Mantle. Vulkan adopted this idea recently in parts, called it 'Conditional Draws', but again the conditions can not include barriers.

Either they are just dumb, or - and that's what i think - some GPU hardware simply can not implement such specs. Which reminds me on OpenCL 2.0, which was ignored by a certain vendor.

But i said this before.

What i would expect from 'kernels can call kernels' are more a kind of sub routines, as we see in mesh shaders, or in the raytracing APIs. Here only a small workload is processed, so they can just pause the calling shader but keep its resources on chip.
This surely is already difficult, as it's probably hard to schedule multiple kernels, each having different precomputed register allocation and LDS memory, in optimal ways. I see, giving us fine grained flexibility is hard.

But they should be able to give us conditional command list execution. No excuse about failing on such coarse functionality. :evil:

I wonder about your problems regarding profiling. In Vulkan, getting timestamps is easy and they are super accurate.

by **Julio Jerez** » Tue May 31, 2022 6:49 am

Either that dude is just a moron that does not know what he is taking about or some one in charge at n vidia is sabotaging the entire thing.

The whole idea of dynamic parallism is to move kerle logic to the gpu to minimize the number of cpu dispatch.
But if they remove the synchronization them they limited to a very small subset of algorithm. Mainly only algorithm that are data parallel by nature. Every thing else is not possible.

Take fir example a quick sort.
In each pass you need to splite an array into two sub array in a child kernel.
But you do not know the index spliter, if ther is not synchronization them buy buy to any divide an conquer algorithm. Because the only way to get the result of the split pass is at the cpu level.

Is is a retarded move to remove any kind of synchronization for child Kernel.
In fact I had seen many of the demos in the cuda sample an they still use it.
They are entre presentation by nvidia where the keynote point is to show how cool that functionality is.

To me that dude does not really know what he is talking about.

What I think is that they are privately making a more specific sync function, because the generic one is not quite right since it sunc all kernel on flight and what is needed is one that sync the child kernels.
Not having that the meaning of streams is also very limited.

To this day I has not found a single useful answer in that forum. In fact it is as if ther purposely mislead the users.

by **Julio Jerez** » Tue May 31, 2022 7:02 am

The problem with timing is that in the engine
An update happens at a fix rate. But the gpu is also running a bunch of open gl shader.

So the time line in newton is not continues it has big silence gaps . And the driver deside what to run tge kernels.
For example say you have three kernels a, b, c.

The sequence in the time line ideally should be

a b c -------- a b c ------- a b c -------

So if you take the time a tge begin of kernel a, and them again at end of kernel c, you should get an accurate timing.
But that not how tge driver run the kerners. The kernel run sequentially but the single gap are arbitrarily anywhere. What you get is something like

a ---- b c -- a b -------c a ------ b -- c --- a b ------ c

And that maje it impossible to get an accurate timing across kernels.
The dashes are the places where there driver insert tge graphic shaders runi g in open gl. So essentially what you get is the timing from frame to frame. And there is not way around that.
Once the silences are very large it does not matter what method you use to measure time across Kerner since tge app can not control whe the silence are going to happens.

I did used the event to measure tine and it was worse. Since even has the effect of blocking the cpu. So end up with two sunc instead of one and the frame time was is about 22 ms, just tge render time.

This apply to all nvidua methods.
It seems the only way to measure the time accurately is by using clock64 on each kernel and adding them a together.
Very much the same way profiler instruments cpp code.
Other than, that not matter how you measure, nvidua will give the frame time which will alway be what ever was taking tge most time graphics or physics ir whatever else was using the gpu.

by **JoeJ** » Tue May 31, 2022 9:07 am

The whole idea of dynamic parallism is to move kerle logic to the gpu to minimize the number of cpu dispatch.
But if they remove the synchronization them they limited to a very small subset of algorithm. Mainly only algorithm that are data parallel by nature. Every thing else is not possible.

I guess this means you have to launch every little Kernel from CPU? No support to pack your pipeline into a GPU command list?
This really * for realtime, if so.
It is like this with OpenCL 1.x, and that's the reason i got a speedup of 2 after moving to Vulkan.

They are entre presentation by nvidia where the keynote point is to show how cool that functionality is.

Snake oil is cool too :mrgreen:

Maybe you still discover some good solution.
But maybe the focus is just business and enterprise, not realtime.

But the gpu is also running a bunch of open gl shader.

Ha ok, this would obscure my Vulkan timestamps as well.
But you could just disable rendering while profiling?

However, saw just recently another guy posting and asking about mysterious gaps on his GPU.
But nobody could really help.

Well, it's just no longer our business what work our computers do.
They know better what we want than we ourselves.
Doing updates, HD indexing, uploading our favorite brands of dog foot and knickers, encoding streamed gameplay... All this is important background work.

by **Julio Jerez** » Tue May 31, 2022 10:36 am

JoeJ wrote:I guess this means you have to launch every little Kernel from CPU? No support to pack your pipeline into a GPU command list?.

It is not completely useless it is still possible to package few calls into single function call.
It is just that you can not get tge result of a child kernel in a parent. But if two child kernel run on the same stream. They run sequentially, so a child kernel can put results in a vector and the next one can read them.

That's still better that not having the feature.

What scares me is if they decide to cut that too.

If you read their sample demos, they all still do that way, and if you read the blogs about dynamic parallelism specially by one of thier big Cajunna name Mark Harris.
They all say that this function has to be use.

But what bother me the most is how misleading and frankly dishonest they are when the just say. This is deprecated but not where say the funtionality is not longer supported.

That kind of omission is not just misleading it is border line a lie that cost developer hundred of hour of research, and lot of money.

We already saw that all it takes is making a funtion call in one driver and have the next or priovius generation failing.

The fact that something works with on driver and fail on others, is a very strong indication that this is a software issue that some one pull out of his ass.

It seems the nvidia software department is more like the amateurs hour at the comedy club than a professional team like Apple or Microsoft.
There are too many issues for this to be simple mistakes.

by **JoeJ** » Tue May 31, 2022 12:43 pm

That kind of omission is not just misleading it is border line a lie that cost developer hundred of hour of research, and lot of money.

Hehe, and after that the dev is so desperate and mad, other devs do not take him serious, but rather continue to believe in shiny marketing lies and progressive expertise of NV.

'It just works' :mrgreen:

a professional team like Apple or Microsoft

In all respect, i do not think they do much better.
Apple would be fine, but even their XCode is a 'click one button to do everything' bag of restrictions to feed dumb hipsters in a walled garden with gadgets.
And MS never invents anything, their bugs are permanent, and their APIs are about establishing strange conventions so people can't leave after getting used to that.

But that just said to distribute my rant fairly among all tech mega coorps.
People often wrongly assume i would be just an NV hater.

by **JoshKlint** » Fri Jun 17, 2022 9:14 am

Julio Jerez wrote:It seems the nvidia software department is more like the amateurs hour at the comedy club than a professional team like Apple or Microsoft.

Julio, one of your big problems in life is you never say what you really think. :lol:

by **Julio Jerez** » Mon Jun 20, 2022 1:49 pm

wow,
building a good usable bounding box hierarchical that is practical but that can be build using many cores is quite challenging.

My first though was to just no doing it and use GPU sweep and prune.
in fact that works well for stuff like fluids because most entities are of very regular size.

I did build one but after testing I was no happy with it because it requires too much overheard and the result is of very unpractical.

so I scrap that idea and said if I am going to do this I might as well build the same structure that is use on the PC.

at first I thought that, I could simple use the CPU one and apply updates using mam copy,
but after I tested the scene with 32k bodies, it take around 40 ms, since all bodies are active, so that no a solution.

the only solution is that there has to be two scene, one is CPU and one in GPU, that do no see if each other.

so that beg the question, how to build the scene in GPU very fast.

I try few methods that actually improve the builds in CPU, but since they are all intrinsically recursive they do no really work in GPU.

so that solution I am taking now is to construct the three is a bottom up faction.
for what I can see it is the only method that can capitalize on multicores.

so decides to try that and start writing on CPU kernels, but boy that was just ton a aggravations.
there are too many edge cases that are extremally difficult to debug in GPU,
so after two week of try and try I decide to implemented in CPU first. them after I have it working I will port it to GPU.

I almost have it, the algorithm is sound, but man, since I am making it with thousand of cores in mind it has to be nor recursive, and embarrassingly parallel, so that is not trivial even in CPU.

but I believe the effort will pay off, later I will try it out to see how it performs compare to the current method.

by **JoeJ** » Tue Jun 21, 2022 4:14 am

To build acceleration structure on GPU, i used an approach similar to the BVH tutorial i have posted here: https://www.gamedev.net/forums/topic/712135-lots-and-lots-of-triangles-how-to-accelerate-ray-triangle-intersection/?page=2

It's a conventional top down approach, requiring one barrier per tree level.
To minimize this, it was a win for me to build all N top levels (like 5) in a single large workgroup of 1024 threads.

There is an alternative to get rid of barriers completely: https://developer.nvidia.com/blog/thinking-parallel-part-iii-tree-construction-gpu/
But at the cost of doing more work in terms of searching.
Later papers improved this a bit.

I prefer the conventional method, because we can run this async, so it might become even for free in practice. At the time NVidia proposed their solution, their GPUs could not do async compute well, but nowadays i consider the idea outdated. But not sure.

For my GI stuff, building BVH costs are negligible. Much less than 1 ms for 200k surfels.
But this is because i only have to build a top level hierarchy over precomputed BVH per model, similar as DXR works. So i have no idea of expected cost to build for a large number of bodies from scratch.

Btw, i think we face a growing problem of duplicated acceleration structures.
Imagine this scenario, which might happen to me:
Physics builds its own structure for collision detection.
I build my own structure for GI.
DXR builds its own structure for raytracing.
Because all this software is specialized (or even blackboxed like DXR), it becomes hard to have just one structure for everything, although it might or should work in theory.
That kinda *

by **JoshKlint** » Tue Jun 21, 2022 4:16 am

Something you might want to consider is a dynamically resizing hierarchy structure, to eliminate the use of world boundaries. I've been thinking about this for some of the space simulation stuff I do. My idea is that when an object goes outside the current boundaries of the octree, the eight top-level nodes of the octree then become children and another level is added to the structure. That way the scene structure can just keep dynamically resizing as needed, without recalculating the entire structure each time it gets bigger.

: Untitled.jpg (34.92 KiB) Viewed 10396 times

by **Julio Jerez** » Tue Jun 21, 2022 7:19 am

JoeJ wrote:To build acceleration structure on GPU, i used an approach similar to the BVH tutorial i have posted here: https://www.gamedev.net/forums/topic/712135-lots-and-lots-of-triangles-how-to-accelerate-ray-triangle-intersection/?page=2

It's a conventional top down approach, requiring one barrier per tree level.

Yes, but that is precisely what I try to avoid.
Top down build methods are intrinsically sequentially recursive. And that makes them very unstable for many multicore.

Yes you can put some effort and get some parts to run in many cores, but the scaling is always poor.

Example of that is stuff like quick sort, you can build few top level, splits and after you have enought splits, then despatched each independent threads. But the problem is that the majority of the work happen at the beginning of the construction.

On the other hand if you use a counting sort, a radix, or bitoninc. Those are naturally parallel, but also have high memory bandwidth and time complexity.
But if the premise is that as you get more and more cores, the efficiency of quick sort is beaten by a naturally parallel algorithm. Them the system with thousands of cores and much larger memory bandwidth wins.

This is the same for building a BVH, the top down method that I use is very elegant, it has some part that are multi core. But if I move to gpu, or to a system with say 32 cores, them is not efficient since there are parts that still has to iterate over the entire array in a single thread.

I think I now have my method that is a full bottom up build.
I suspect it is not as efficient as the top down, for small cores, and maybe even for medium core count, say 16 or ever 32.
However if we are talking thousands of cores, them the bottom up method approaches a o(k) time complexity, and at worse o(k log (n))
Where k is large, while a top down is still o(k * n log(n)/p)

Where p is the core count, and k is much smaller.

So the point is what if the core count is the deciding factor, gpu has several order of magnitude more core. So it is the clear winner. If tge algorithm naturally parallel and is lock free, tge call that embarrassing parallel.

I have not ported to gpu yet, I need to stress tested first, but so far, I am happy with the priminary results.

As fir what ever it is nvidia does, it seems they are building thier trees in cpu by the driver, I have not seen any paper, but to me that suggest they too are using a top down build method. What they accelerates with hardware is the tree traversal. So they added hardware similar to vertex assembly or rasterization stages to the pipe line to generate rays, but they still use the cpu, which mean race trace will tax tge cpu for very dynamic scenes.
That may be part of the reasons nvidia is think to build thier own cpu.

Any way,
Another point that I find very attractive is that the bottom up method can be made incremental,

Basically you can have two trees, one active and one in construction.
And in each frame just do one pass, each pass do a partial build, and when the tree is complete tge lass pass add tge stack and swap with the old, and tge the process start fir the old one.

This way I estimate the const of maintaining the tree reduces to few micro seconds, less than a hundred for any size tree, even of million of items.

But anyway all this is still speculative, si let us see what the actaull result is.

Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Who is online