Cuda Solver

by **Julio Jerez** » Mon Apr 18, 2022 1:14 pm

400k? over 50 fps.
holly leave the Gun take the canoali. :shock:

Bird wrote:I was able to get 400,000 turtles spinning. Although, occasionally they would stop spinning and then start up again after a short amount of time.

yes that's what I was taking about.
without a way to issue kernel form kernels. th eonle way that I can do th eupdate is to save the state of the last update, them reading it.
but you would think that this will be at most one frame delay. but in reality there are about 100 ms latency between the GPU and the CPU.
so as the buffer each time a buffer gets resized, It throw about 4 frame step until the buffe is resize by the cpu.
maybe I am doing some wrong, but I find a real pain to manage resizable buffer in GPU.
one was to do in is to just assume a fix memory size and make all buffer fix size but it will waste a lot of memory.

I will just put that on hold until we get more of the full picture. maybe we can find a better way to resize buffer.

It may also be that the drive is inserting more of these huge silence spaces.
you can see in the profile, there is a huge black of almost half a secund.

remember we can no take over the entire GPU, my guess is that the graphic drive is the one preempting the GPU since graphic is the primary concern.
anyway we still have lot of unknown that we have to resolve as we keep going.
those gaps star to be a very serious concern. Thats a big monkey range nvidia put there.

by **JoeJ** » Tue Apr 19, 2022 11:32 am

Julio Jerez wrote:Launching kernel from kernel is in fact a nessesary feature.
There are Kerner for which the size of the items is not known in advanced. Instead it is determined by the result of a previous kernel.

We can solve this efficiently now with indirect dispatch, but only at a coarse level.
Indirect dispatch means you set the workload size of a later dispatch indirectly, using some GPU memory to store the workload.So there is no need to read back to CPU just to set the work size of another dispatch. OpenGL already has this, while OpenCL 1.x has not.

What's still missing is the option to launch some kernel directly, bypassing any predefined command queues from the CPU side.
OpenCL 2.0 has this in form of device side enqueue. NV still does not support this, so assume it's a hardware limitation on their side.
But the most interesting example is mesh shaders, which can call amplification shaders. Here, all function parameters remain in local chip memory. It would be great if we could see a ganeralization of this to compute shaders. Maybe the main obstacle here is mesh shaders only support the wourkgroup size of the machine given thread blocks. Mesh shaders can not bundle multiple such blocks to form a larger work group.

Julio Jerez wrote:And yes all of the sudden, a port to direct12 compute does not seems that complex after we get this plugin.

Yeah, i surely think so. I did the same: Keep developing using OpenCL, because it had profiling tools and was mach easier to use. Debugging is another such argument.
But then, porting OpenCL code to Vulkan Compute was just a matter of adjusting some syntax issues.

Julio Jerez wrote:So that leave us with dx12 and vulkan. Maybe sycle if intel put thier act together.

I don't have hope Cycle could become a well integrated option for game devs. At best it could compete Cuda - vendor locked, and separated from the gfx and compute stuff we already do anyway.

But VK and DX really isn't bad. Personally i do not request C++ language and i'm fine with C. But i hope we'll see some progress on getting away from those 'dispatch small kernels over huge work sizes, but leaf control flow to CPU' restrictions before i die :/

by **JoshKlint** » Tue Apr 19, 2022 12:40 pm

I have been watching this thread.

Please, someone post some videos.

by **Julio Jerez** » Tue Apr 19, 2022 6:46 pm

JoeJ wrote:OpenCL 2.0 has this in form of device side enqueue. NV still does not support this, so assume it's a hardware limitation on their side.

that does not sounds right joe. How can an api support a feature that the hardware does not?
plus they do supports it, it just they the call it by a different name. they call it "Dynamics Parallelism"

basically at some point the added the capability of interrupt to the GPU. so essentially it use the preemptive combability to force one of more single core to act as if it was a single general purpose cpu.

the problem I have is that they was is set up in visual studio is that you need to invoke the nvidia linker to find the address of the kernel. and since in Newton the plugin it is a static library, I can't invoke the linker. and linking the nv library wit the other lib does no work since the address of the child kernel are GPU addresses.
I will just try doing it use the plan vanilla kernels call. maybe after all is working we can make and configuration to support it.

but that feature could be quite powerful.

by **JoshKlint** » Thu Apr 21, 2022 4:31 am

How does Cuda divide handle allocation of GPU resources when physics and rendering are both active? If physics is running at 60 Hz and rendering is running at 120 Hz, how does Cuda prevent the physics routine from hogging the GPU? I was recently dealing with some work in Vulkan related to this idea and I found there isn't really a way to run different routines at different frequencies without stalling out the faster one:
https://community.khronos.org/t/schedul ... eue/108531

by **Julio Jerez** » Thu Apr 21, 2022 7:17 am

I do not know how vulkan does it.
But for what I undertand with newer nvidia gpus. What vk call divice queue nvidia call streams.

Sence gpu has some many compute units. A stream is a way to split a gpu Into virtual devices.

But that can only be done as long as the shader do not occupy the entre gpu.

If the compute shader uses all of the compute units, the every time ir ru n the only way to yield is if the driver preempt the gpu after some tine. I believe dx12 in windows does that as a extreme measure.

But if your compute shader use fewer compute units them
You can get it to run concurrently.

Again that's the way I undertand it by the cuda docs and I see it working with memcopy and a compute shader.

I only use one stream for compute shder fir physics yet, later I will use another for particle effect and surfaces generation so I will test that. But I have not idea how that is coordinated across different apis.
It seems the gpu driver are doing a hell of a lot of work mascarading like a mini operation systems, but they do not a very good job yet, and that may be because of the limitation of gpus at multitasking.
Prempting a gpu is not a sheap things. Gpu has several hundres of megabytes of data on fligh. Megabyte of registers, megabyte of local memory, megabyte of cache and so on.
So premp is something that could take several ms on a gpu.
So I do not thing the priority are handles using that feature.
They are rather to force a gpu to stop very long shader when the time slice spire.
In the pass fir hardware that did not have that option, the os simply kill that app runic g that shader. But now with newer gpu it just prempt the shader, which is better but not great.

For my part my plant is to place a cap on the size of tge shader.

I truth, I do not really know, but it is something that do worry because I undertand a physics can't occupy the entire gpu.
The only thing g I seen is those mysterious gaps in the time line where it seem the time just disappear as if the gpu stop working.
So I guess that's when graphics is doing work but the profiler does not show it.

by **JoshKlint** » Thu Apr 21, 2022 8:15 am

Yeah, in Vulkan there does not appear to be any way to specify the amount of resources each "thread" uses at all. I'm very surprised by this. Maybe there are some vendor extensions that offer more control.

by **Julio Jerez** » Thu Apr 21, 2022 1:48 pm

That's part of my beef with these tech companies and all the misleading information. But we are living in a world where Objectivity has become taboo and in fact people even come with the name Hyperbole as if that was a good thing.

The idea that you can have concurrent stream or command cue running concurrent with graphics is just pir one the Sky at the moment, and it will be for a really long time.

with the increase of high-resolution displays, 2, and 4k just rendering one polygon to cover the screen can consume all the shaders cores of a gpu. you may have some control at the vertex and geometry shader. but once the polygon is rasterized it will cut into so many tiles that will occupy even the highest end GPUs.
the highest end GPUs comes with about 64 and the next generation with 128 compute units.
and a compute unit is the granularity that the driver can break down task.
but a 2k resolutions, we are talking of 4 million pixels, that's which is more than enough to complete a gpu at full occupancy.

my plan is to target for a limited amount of GPU resources, per tick so that it does no take too much time of the rendering pipeline which is why I am placing so much emphasis on optimized shaders.
we are running test with thousands of bodies but that's just stress test.

I do not expect high end physics simulation to take over GPU rending any time soon for Realtime high end game engines. The graphics are just too demanding. In fact, so much that even a 4k a next generation GPU can't keep up with and need to do upscaling.

what I expect is to use the GPU for higher quality simulation and probably from 5 to 10 time what the CPU can handle now.
but for people who don't do real time game, that a different story, there we can talk of animation with hundreds of thousand of object at iterative time.

by **JoshKlint** » Fri Apr 22, 2022 2:03 am

I have found that memory bandwidth tends to form bottlenecks before computation does. See page 3:
https://www.xcdsystem.com/iitsec/procee ... AbID=96809

You may see faster overall speeds at scale if you store body 4x4 matrices as a quaterion + position, and then convert that into a 4x4 matrix in the shader, rather than transferring a full 4x4 matrix to and from the GPU every update.

by **JoeJ** » Sat Apr 23, 2022 6:50 am

JoshKlint wrote:You may see faster overall speeds at scale if you store body 4x4 matrices as a quaterion + position, and then convert that into a 4x4 matrix in the shader, rather than transferring a full 4x4 matrix to and from the GPU every update.

If your matrices also have non uniform scale (or sheer), such compression isn't possible anymore.

Looked up a bit of NVs Dynamic Parallelism. Seems pretty nice.

But not sure what's the detailed differences of OpenCLs device side enqueue.

Doesn't matter anyway. What matters is all those folks work on some standard to bring those features to gfx compute APIs.

by **Julio Jerez** » Mon Apr 25, 2022 11:18 am

The only thing I regret is that to program gpus.
Is a case of "We are not in Kansas anymore Toto"

You have to have a lot of a patient, time and determination.

The languages are quire limited and the tools are not great.

That dynamic parallelism is a great feature. But Nvidia made so that to work you have to invoke the cuda linker.
So it seem that after tge compilation the cuda linker generate some cpu functions that the general linker will link.

So that rules out put it in a static library that you can invoke from a project that is not a cuda solution.

So the only way to use it is but making the exe a cuda solution. Or by making the library a dll.
Neither of those options are fixable fir newton now.

It may seem a small problem, so we only need to make the exe a cuda solution.
But imagine now adding another solver like intel which also requires a dcp solution.
The phc ecosystem is a nightmared.

by **JoeJ** » Mon Apr 25, 2022 2:05 pm

Neither of those options are fixable fir newton now.

Well, i think it's actually good that you can not use it.
Because then the later port to compute is possible without a need to work around missing features.

The phc ecosystem is a nightmared.

The tech industry as a whole is a nightmare.
What are their future offers? Metaverse? NFTs? Web3? 10k$ gaming PCs? That's a lot of bullshit nobody really wants or needs.

We'll see how CPP evolves. I expect we'll get easy GPU access at least for tools development and prototyping.
Not sure about client game engines. Likely cumbersome compute remains the fastest and preferred option for that.

by **Julio Jerez** » Wed Apr 27, 2022 3:22 pm

But it is a big problem Joe.
The idea that we can only make a single function call make really annoying to program.
Forget about recursion, you can only call just litle function that in fact reduce to inline macros.

I continue anyway. But the fact that I can't call funtionf from the cuda stand li rare onless the solution itself was a cuda project is not a good thing.

Anyway I have been working on the sorting routine, and for almost a month I been debug a random crash bug, until I finally found this morning.
I think we have a very very sofisticated sorting now that beat by fart the all of the Cuda library from their trust lib.

The bug was that tge drive plays nasty tricks on you, since I was using double buffer to get the memory and the compute to work concurrent.
The cpu results are not available until 4 frames afte soneting is set in the gpu, and is cpyed to the cpu status strut.
I was assuming it was two frames, so when a buffer became larger than the capacity, the cpu resize to the two frame before and signal the gpu that was ready to work on that buffer.
So the sorting was find as long as the buffer was smaller than the initial size.

But the moment I rest with a large buffer I trashed memory.
It looked random because of the paralle nature.

I can fix that in two ways.
1 make the buffer a very large size, but that too naive.
2 add a delay counter, so that simulation is suspended for 4 frame when a partial size changes.

But the best solution is if a kernel can make calls to resize the buffer. And that's the part that is lacking.

I am wondering if a kerne can call memalloc.

by **JoeJ** » Wed Apr 27, 2022 5:53 pm

Forget about recursion, you can only call just litle function that in fact reduce to inline macros.

The funny thing is: I did indeed forget about recursion after working on GPU for some time.

I never used recursive functions again on CPU either in the last 10 years. Instead i manage the stack myself, which is a good thing. So at least the shortcomings made me learn some stuff.

But the best solution is if a kernel can make calls to resize the buffer. And that's the part that is lacking.

Yeah, that's a big one. I lack experience, because i have not yet implemented open world and streaming.
But i'll go with fixed sized buffers, and the user is responsible to set up buffers being large enough.

Maybe that's an option for you too. After that all you'd have to worry is making sure it does not crash in case buffers still end up too small, but users could not expect the simulation gives correct results in such cases.

And maybe tiled resources (used mainly for virtual texturing) could be useful to have something like dynamic memory allocation. But actually i doubt that, and it's still managed from the CPU side.

by **Julio Jerez** » Wed May 18, 2022 1:26 pm

Ok guys, I found few thing that are disappointing in the cuda.
I have a gforce 1660 super at home, and it is how I develop the engine.

a month ago, I start the broad phase and I add many kernels, but as I keep going I realize I have to restart several time, because the limitation of debugging and all the latency.
but I thought that at lest the code was working.

to my surprise when I test the same code on a different system with a gforce 1060, I see lot of malfunctions.
I can not really debug on that system, so I start adding sanity check and in place to see is I can make sence of the errors.
I found few error, and I believe I fix them.

Now I have the latest iteration of the code so far, and I stress tested strong in the 1660, and it work fine every time.
but again now the same code in a 1060 does not get to malfunction but triggers the assert,

that a really bad news, because it mean that we need different algothim for different GPU.

a gtx 1060 is a very common and popular GPU, we should no exclude that hardware or else this will have very limited usages.

but anyway I will probable have to get a 1060 so that I can install and use that as the base line.

but what I wnat to knwo now is what other GOU is not working as expected.

so if anyone reading this can do a test. by:
-syncing to the latest
-build with CUDA
-compile the debug build
-run for a few secunds
-check that is does not assert.

them whether it pass the test or not, tell me the result GPU class.

the information is trace to the debug output window in visaul studio like this

gpu: NVIDIA GeForce GTX 1060 6GB
wavefront: 32
muliprocesors: 10
memory bus with: 192 bits
memory: (mbytes) 6143

that's another quirk I found that does no math wit reality.
if you read the specifications of the gtx 1060
https://www.techpowerup.com/gpu-specs/g ... 6-gb.c2862
it say is has 1280 cuda cores.

but my calculation assuming a 32 cores per multiprocessors, come to 40 multi processors.
some specs say that they have 64 cores per multiprocessors, so that about 20

but when you read the actual spec from the device, it say 32 cores per multi[processors and 10 multiprocessors. that about 320 cores gpu not the 1280 they list on the data *.

either I do not understand they spec of some one is lying big time on those expect.
the result I see are more in line with 320 cores than with 12080 cores.

Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Who is online