Cuda Solver

by **Bird** » Mon Mar 21, 2022 8:58 am

Thanks for making my day!

by **Julio Jerez** » Mon Mar 21, 2022 9:52 am

Yes after cycling around over all the hpc APIs:
Opencl, amp, sycl, vulkan, opencl compute, direct12,
I found all to be quite lacking.
They are the kind of languages that become the center stage of the app and very soon you start bending and accommodating your algorithm in terms the language constraints rather than using it as a tool.

I was one of the very easily adopter of cuda, about 15 years ago when it came out, and it was just as disappointing are those other.
But over the years Cuda has made incredible progress,
Is now legitimate c++11 and more.
Has build system that integrates with cmake, do not need separate compilation and it has very good tools, profiler and even a debugger, but I have not been able to get it to run.

I started last week and since last Friday I was able to port at least one small part of the newton solver almost verbating to cuda, and it work almost the very first time.

I never thought I would say this about and nvidia product, but it is really remarkable.

I made the very first kernel, and I am running side by side with the default solver to check results.

I run a quick test and profile it and the guy time was as if it was not working, and the problem was that it was not measurable

I guess that Cuda will be the gpu of choice

It is a shame, that amd, intel, Apple, and all this other software and hardware companies have not been able to put their act together to solve the hpc problem in more that 20 years, instead all the do is follow these moronic pie on the sky specification that Chronous group releases every year that are all dead end. Mean time nvidua is eating thier launch.

by **Julio Jerez** » Mon Mar 21, 2022 10:54 am

the one thing that I need to do is encapsulate the broaphase into it own class like the solver, so that I can use different algorithms.

right now the three based hierarchical aabb is very good, for most scenes typical scene, that is sence when most object do not move outside their aabb in each frame.
but for scenes, when this does happen like that GPU demo, the performance is atrocious.
for those cases a brute force like sweep and prune does much better.

I will not do it now, but after I run the test, it becomes more and more apparent that sequential algorithm as smart as they are, do not well for high volume data and that's even an emergence truth with PC having more and more cores.

by **Bird** » Mon Mar 21, 2022 11:15 am

I've been hoping you'd end up going this route. I have to do a little Cuda programming now since OptiX exposes it and it's amazing how far it has come.

Nick, as always, has some very simple but useful videos that you might find helpful. I believe there's one there on GPU profiling and debugging.
https://www.youtube.com/c/CoffeeBeforeArch/playlists

There's been a lot of buzz lately about this NVIDIA project. And it looks like there's a lot of examples of very advanced cuda programming that's way over my head.

https://github.com/NVlabs/instant-ngp

by **Julio Jerez** » Mon Mar 21, 2022 2:30 pm

yes, and one of the cool things about using a language that shared most of the same features, is that port can be debug a lot easier.
I wrote many of the support functions and I now have the secund kernel, the one the integrate the velocity, and the one that integrate the positions, and the body rotate but after a few frames blows up, that mean I have some but in the math functions. but is actually working.

on those videos, yes I have seen many of those tutorials. They will be useful for when we are at the point of optimizations. It seem they is a lot of black art magic, to get really high performance kernel in cuda.
out of I run the profiler on the first kernel, and it is so straightforward that I think will hit all the performance counters, but the profile keep say that I only hit a 77% of occupancy.
that was the first set up, but after doing what the suggest I can only get to go lower.
In any case I am not worried about that now, I just wanted to check where I was and if it can get better than that with tweaks, it is even better.

by **Julio Jerez** » Mon Mar 21, 2022 5:18 pm

Wow man.
This is not to be believed.
I completed the secund kernel. And the demo put 27000 boxes spinning.

At first I did not see any difference, so I took a profile trace.
The profile does o ly show the time spend in the engine.
The gpu is like microsecond.
It is mind blowing. All Indication point to that we have to rewrite many of the container to be vector base.

I will try to if def out as much as I can so that we can follow the progress.
But for what I can see, I'd never seen such a blow out.

I think what we have to do is a dirty updat system.
Where we keep the flexible data structures for the pc.
But the update vectors for gpu.
I a way 4.00 is doing that already, but it seem this trick has to be far more important.

by **Julio Jerez** » Mon Mar 21, 2022 6:21 pm

actually I take that back, I have the profile traces at the kernel call which async,
In reallity is running like 3 times slower that the same code in the CPU.
The seem take 7.5 ms in GPU and only 1.5 in the PC.
I think i sthe memory copy frm and too GPU.
when running the nsignt profiler, it telling me this.

This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons.

The profile seems quite shitty, It look mush better when people use it in youtube videos.

anyway, I see what else could be wrong, I need to see so videos to see how extract information from the profiles, because I do not understand it

by **Julio Jerez** » Mon Mar 21, 2022 6:55 pm

the standalone profiler is so much better than the cheesy visual studio integration.
I too a quick capture, and shows this

: Untitled.png (34.03 KiB) Viewed 12541 times

all the time is spent coping data from cpu to gpu forth and back.
anyway, I will continue see where it leads.

by **Julio Jerez** » Tue Mar 22, 2022 11:27 pm

I now got a test scene with 27000 bodies,
that scene takes about 65 ms in avx2
but it takes 25 in cuda.

the thing is that it is a naive system of the from

load data
run shader
store data.

load and store take are about 20 time the time of the shade, so when we move all the data to GPU buffer, that scene would be around 1 to 2 ms.

I red a little about Cuda optimization and for what I can see other that using the Intrinsics type like float4, there is no really much that can be done, since loading data to memory take such a huge toll, of the shader. In fact I thing shader can run in debug and would make much differences.

there real way to make and impact on by using what the call streams, no sure why NVidia call it stream when every one call it command queue, but essentially this is what make it possible to code dependences, and also run async.

anyway so far this is no too had,
next I will try to make the shadow copy of a body in GPU, then these buffers will be updated when the change by the application. them by using double buffers when can get this scene run almost at full speed.
this power point gives an idea of how to remove or hide some data transfer
https://developer.download.nvidia.com/C ... ebinar.pdf

by **Julio Jerez** » Wed Mar 23, 2022 12:23 pm

now I added the very first optimization. which is to keep a shadow copy of each body in GPU memory.
them this copy is only uploaded when the scene changes with an indirect buffer.
that simple optimization cut the timing in half

: Untitled.png (26.57 KiB) Viewed 12486 times

the secund memory transfer is more tricky, is need few stages.
1-make the data smaller, right now is coping the entire body array, but what we have to do is to make a selective transfer, for example get transforms should be a rotation and position array,
2-it should only transfer an indirect array of the data that changes.
3-It will use a double maybe triple body that will do async transfers, so in one frame will update one buffer while the CPU is updating data collected in the secund buffer. then the buffer swaps, for that it will use ethe streams, or a background thread.

after that we will have a good idea of what the skeleton of the GPU version will be. here is a cpu trace

: suspension.png (41.41 KiB) Viewed 12486 times

the tiny sliver under the substeps, is the time spend in the shaders. while getting the data after processing is about 20 to 30 time that.
that tell me that we do not have to worry at all optimizing shader, the biggest bang in performance will come from using clever trick to hide memory transfer latency.

anyway this seem to be going in the right direction.

by **Bird** » Wed Mar 23, 2022 12:57 pm

Nice to see such quick progress!

I tried this morning and the physics time was around 18ms. I just downloaded again and now it's around 15ms

I'm able to debug in Cuda using NSight Next-Gen debugger

: newton_cuda_debug.jpg (192.9 KiB) Viewed 12482 times

by **Julio Jerez** » Wed Mar 23, 2022 2:16 pm

oh please check it again, I now move the Transform update to the scene manager.
and I am getting 6 ms.
but that can be reduce to probably 4, since it is still copy the full body.

after that the reduction should be down to probably 1ms using the double buffer with streams. but that requires more planning, this are just run of the mill optimizations, but they are ethe one that provide the bigger gain, in my opinion. This is how it looks like now

: Untitled.png (10.02 KiB) Viewed 12479 times

as you can see the memcoy happen after the engine update, and there is where some game logic will be applied, because after the memory copy comes an equally long segment of update transform to the CPU bodies. so with nvidia stream that can do cudamemcpyasync to a double buffer, and the cpu and Gpu those section can run in parallel.

I have to say that g-force hardware pack some series floating point pun :mrgreen:

ch.

that scene in cpu is about 10 time slower, and we haven even scratched the surface of the possibility.
The Nsight profile keeps telling me that the shader *. it is either poor occupancy, poor floats throughput, poor memory bandwidth, and so on.
I start to believe that that's just and strategy of never admitting a shader is adequate, and never take the responsibility of something is not right.
I am not pursing shader optimization anymore, if we get a factor of 10x I will be more than satisfied.

One shader capture told me that the float thought put was a ridiculous value of 3.x%
Yes that whole thing is almost not measurable.
The one thing I hot from that is that the native type are important, I made the classes using float, and the share came up with several dozen 32 bit loads.
Afte changing them to use a float4,
The same shader came with just load128 but increment the register count.
I guess that makes sence since a float3 use 25% more resource.
So it seem we have to wheat every strategy a float4 increases memory by 25% and more registes usages, but some how the core like that better in term of load and store.

I first try using floa3 but it seems float 3 resolve to three single loads while float4 is one load128.
And since the cost of load and store are several hundred time more expensive that everything else, it all comes to just try to load the biggest native time at the beginning do calculations and them store them.

by **Bird** » Wed Mar 23, 2022 2:45 pm

Yes, now I'm getting around 5ms with the latest version on Github.

by **Julio Jerez** » Wed Mar 23, 2022 2:55 pm

5 that's good,
You cou probably has the same bus speed than mine.

One of the thing that the profile say is that there is not enought load.
The one you to issue at least 10 block per multicore.
My system only have 22 cores, and that scene only generate 106 blocks.
So that about 55% occupancy.
They say use more blocks, or make the block side smaller.
Basically they one to hide latency by sending enought block to a multicore so that they can load do operation whole other core are loading.

A try making the block 128, but them I get the block is too small, it is not possible to win.

Essentially it says that to be optimal the scene hat to be at a minimal twice as big.
But again that's good for the core but not for memory copy.

The good thing is that after we get the final scen manager, it seem it can do around 50 or ever 10 tousnad bodies, and that's some serious scene.

by **JoeJ** » Wed Mar 23, 2022 3:24 pm

there real way to make and impact on by using what the call streams, no sure why NVidia call it stream when every one call it command queue, but essentially this is what make it possible to code dependences, and also run async.

That's where i expect the problem even if we only think about NV GPUs: We need to do this async transfer and execution stuff also for graphics. But by using a second compute API, we loose the option to synchronize and control their concurrency. Differing API workloads will either wait on each other, or thrash each others caches. At some point we need a port to gfx API compute to get this right.

Did you try to access main ram from GPU and work directly on that? I do not really think that's practical if you read the same data more often, but for a current experiment of just integrating bodies it might work? I mean those new 'Smart Bar' features, or how they call it. I think GTX1600 already has it.
If i'm right and discrete GPUs become obsolete within the decade, this would enable a lot of flexibility, and physics would benefit the most.

Cuda Solver

Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Who is online