oh please check it again, I now move the Transform update to the scene manager.
and I am getting 6 ms.
but that can be reduce to probably 4, since it is still copy the full body.
after that the reduction should be down to probably 1ms using the double buffer with streams. but that requires more planning, this are just run of the mill optimizations, but they are ethe one that provide the bigger gain, in my opinion. This is how it looks like now
- Untitled.png (10.02 KiB) Viewed 12479 times
as you can see the memcoy happen after the engine update, and there is where some game logic will be applied, because after the memory copy comes an equally long segment of update transform to the CPU bodies. so with nvidia stream that can do cudamemcpyasync to a double buffer, and the cpu and Gpu those section can run in parallel.
I have to say that g-force hardware pack some series floating point pun
ch.
that scene in cpu is about 10 time slower, and we haven even scratched the surface of the possibility.
The Nsight profile keeps telling me that the shader *. it is either poor occupancy, poor floats throughput, poor memory bandwidth, and so on.
I start to believe that that's just and strategy of never admitting a shader is adequate, and never take the responsibility of something is not right.
I am not pursing shader optimization anymore, if we get a factor of 10x I will be more than satisfied.
One shader capture told me that the float thought put was a ridiculous value of 3.x%
Yes that whole thing is almost not measurable.
The one thing I hot from that is that the native type are important, I made the classes using float, and the share came up with several dozen 32 bit loads.
Afte changing them to use a float4,
The same shader came with just load128 but increment the register count.
I guess that makes sence since a float3 use 25% more resource.
So it seem we have to wheat every strategy a float4 increases memory by 25% and more registes usages, but some how the core like that better in term of load and store.
I first try using floa3 but it seems float 3 resolve to three single loads while float4 is one load128.
And since the cost of load and store are several hundred time more expensive that everything else, it all comes to just try to load the biggest native time at the beginning do calculations and them store them.