I think what is becoming obsolete is the pci bus.
Video games consoles do not have pci bus, and they run circles around d pc which in geral are about two to tree time more powerful.
But anyway, I believe that some how api like dx and opengl figure out how to split the device, so that differen wave front run different kernel.
If you think about, and gpu these days has at least 16 or more multicore, and each one of these core need a least 264 x 10 items = 640 threads to be efficient.
A block running just one thread has the same cost of one running 10 wavefront.
The later gpu come with 40, am even 60 multiprocessor.
It is ridiculous to keep those busy by running I one shader at a time.
That's where stream and dependency graph become very useful, vulka and dx12 exposes that with command buffer.
The legacy app the driver does but but that's very hard.
For a physics library, and for some of the new game engine like UE and unity.
They are fucus on big data, so the can keep a gpu busy with single shder.
For us what it means is that we have a high number of objects that we can process and above that ther will be a dramatic cut off in performance, but that number could be quite high as physics goes.
I now can see how a sph fluid can handle million of particles, it just capitalized on the high core count and huge latency.
As for all those memory mode. I do not think I will try, many of them, for what I have read run quite slower because the count of hardware exception handling to trigger hidden memory copy behind the user
The memcopy asyn, is also something that have to be use with care, because they use committed virtual pages. Nvidua called pin memory. But that's could be bad for the os.