Julio Jerez wrote:My impression was that all three high performance computing APIs offer interoperability capability.
Not yet, neither VK nor DX12 have data sharing with other APIs.
Khorons annonced this as feature for next version.
Julio Jerez wrote:as if stand now in open CL last time I check a GOU can be seen as a single device, the make hard to do thong like collision detection.
take for example different pairs.
CL has multiple queues, VK and DX12 too.
This however does not guarantee that work runs in parallel, it just makes it possible.
AMD is the only hardware with fine grained async compute, but even AMD recommends using just one compute task while doing ALU light rendering work (Depth prepass, Shadowmaps).
I hope multiple parallel compute tasks can profit from async compute too if one task has too little work to saturate the GPU, but at the moment i have too much sequential dependencies to test this seriously. I'll let you know...
This means you need to try to generate large workloads of similar work, e.g.
Shader 1: build potential collision pairs and write them to a large list.
Shader 2: Detect exact collision data and write to another list
Shader 3: Resolve all collisions
Julio Jerez wrote:say you have 1000 colliding pairs, 200 box/polygon, 300 box/box, 500 convex/box and so on
Yep, that's a bad example - 1000 is simply not enough to saturate a gpu.
It makes sense only for cloth / particles. And even there it is questionable because you need to maintain a copy of the physical world on the GPU. (I still think if someone wants e.g. cloth on characters, the graphics engine developer can and should implement this more efficiently then you could do)
Julio Jerez wrote:CUDA, the letters versions, has the capability of issues kernel form with in kernels. this is very useful but I believe opencl can do that with the cammands queue.
OpenCL can do this only since 2.0.
Going back to my collisions example,
with OpenCL 1.x you write a list from shader 1 and the list count to GPU memory.
Before you start shader 2 you need to read the list count from GPU to CPU so you know how much work is to do. And this is what *.
VK / DX12 have indirect dispatch: You build the command buffer upfront. The command buffer knows all 3 shader in order and also it knows the count will be a result of the previous shader.
At runtime you call the command buffer with a single call per frame - no need to read the count from GPU, it can handle it on its own.
This is the ONLY reason VK is twice as fast as OpenCL for me, so it's very important.
OpenGL is in between: It has indirect dispatch, but no prerecorded command buffers.
So no need to read the list count, but you need to invoke many shaders per frame instead just one command buffer for everything.
(But 2 years back even with the lack of indirect dispatch CL was twice as fas as GL on Nvidia for me)
Julio Jerez wrote:I know the do it at least on consoles because when you look at the GPU debugger on a console you can see how different each multiprocessor execute different shader in parallel, so I do not know why this was supported by early version of openCL. right now it can only one keener per GPU and what it need is one kernel per multiprocessor.
I expect this to be better with VK / DX12 than with OpenCL, but not as fine grained as you wish:
* Those debug graphs show mostly work from graphics pipeline, not compute - that's a difference.
* AMD only (Pascal might have some improvement, but i assume it's stall far behind. Intel has NO async compute)
godlike wrote:You could use HLSL for the shaders. Then Newton will have 2 backends, one for Vulkan and one for DX12. For the DX12 backend Newton will use the HLSL directly but the Vulkan one will use Khronos' glslang compiler to compile HLSL to SPIR-V*.
Agree, seems the most future proof path to go (but a hard start).
IMHO both CUDA and OpenCL are dead now for game dev.
Edit:
In the meantime i have a multithreaded CPU implementation of my (insanely complex) Global Illumination algo.
FuryX is 30-100 x faster than i7-930
100 on calculation intense work like ray tracing
30 on bandwidth heavy / low workload things