Yes, AMD can process different tasks even on the same CU. Nvidia still does not make such details public. Currently VK exposes max. 4 compute queues for AMD card, but 16 for NV.
I tried to use 3 queues simultaneously and the timestamps of the dispatches overlapped which proofs it works, but it was slightly slower than using just one queue.
However, each of my dispatches was demanding and can saturate the GPU on its own so that was a bad test case.
They say we should pair bandwidth heavy with ALU heavy tasks, but like you i'm hoping more for a solution for small workloads. E.g. i have shaders that need to process a tree with one dispatch per level, so i have tiny workloads near the root.
Early next year i should know more, but from what i hear everywhere we should not expect too much. Let's hope im wrong...
The API is good (but really cumbersome to use, otherwise i'd create a test case to get a quick answer
Maybe there is some work left on the drivers but i doupt so. Actually Mantle driver handles both VK and DX12 on AMD.