But the problem is that the majority of the work happen at the beginning of the construction.
It's not really the majority of work, but actually a small workload, leading to bad GPU utilization due to high complexity / low workload ratio. At least regarding building a tree.
To me this is not that bad, because in such cases i can use some other, massive workload to be executed in parallel. Thus, if we have such workload handy, which we always have for a full game, the problem solves itself.
That's one reason why i say we have to use just one API for both gfx and physics, so we can make a optimized async execution graph of all our workloads, to hide those 'setup and low workload costs'.
But if you find another way so you can saturate GPU on your own, that's easier for all of us ofc.
Still, parallel computing should not lead us to a culture of 'first: saturate HW at any price, second: reduce work eventually', like NV usually proposes.
As fir what ever it is nvidia does, it seems they are building thier trees in cpu by the driver
I assumed the same initially, but now i'm sure they build RTX BVH on GPU. People show the costs of this in their GPU profiler outputs, and it's one such task they usually hide under async execution which works well.
But there is big CPU and Memory cost with RT too, so maybe CPU is still involved as well, maybe to build just the TLAS or top levels.
What you say about bottom up tree sounds very interesting. I do this for my offline BVH, but lack realtime experience.
You surely mean something different than the conventional way of using a second tree for agglomerative builds.
I'll try to follow this.
But when i was peeking at your grid for fluid particles for example, i got no clear idea how this is intended to work at all.
Maybe because i know nothing about Sweep And Prune either.