Cuda Solver

by **Julio Jerez** » Fri Mar 25, 2022 7:33 pm

I now committed the with new algorithm.
but as I said before, there new method translated to some gain in performance, at 2000 bodies.
seem to be the cache make a difference between a n * log (n) and a linear time complexity algorithm.

but thar are more benefic that just the linear time, with all the procedure in the solve been linear, they yield better performance when switching to multicores.
this link explains the reason: https://en.wikipedia.org/wiki/Amdahl%27s_law
as you can see that even with as little as 4 cored, the gain is not that much of we have say 50% o fteh code in parallel. so imagine having 100's cores.

the more parallel part when have in the engine the better the multicores, and now the solver is I would say about 95% parallel. and we need that for CUDA.

by **JoeJ** » Fri Mar 25, 2022 7:39 pm

If I have to bet, I think that a sure thing, since doubles are importan for scientific app.

This already exists in form of AMD Instinct CDNA GPUs. They have no rasterization, but fp64 at full rate.
I'm quite doubtful we'll see that on consumer hardware.
But i may be wrong. We got tensor cores too, although we never requested this ;D

by **Julio Jerez** » Fri Mar 25, 2022 8:06 pm

JoeJ wrote:We got tensor cores too, although we never requested this ;D

do not get too excited about tensor cores, at least not yet,
for what I read they are matrix operator but the operate in very limited set of float and ints.
essentially the do a lot of 8 bit add and multiply and float 16 multiply and even that high.

there are designed for intermediates hidden layer of deep neural need for executing a large number of convolutional kernels. the Resul there can be very noisy, so do not expect that you will be able to use that in any meaningful ways any time soon. even the one that Intel touted are at highest precision 16 bit floats.
for us is all about those shader cores.

what would be interesting is to see if nvidia and amd expose their ray tracer ray casters in compute shaders, maybe they can be use for collision in GPU, which is still a huge challenge.

by **Julio Jerez** » Fri Mar 25, 2022 9:24 pm

JoeJ wrote:I don't know about your proposed exponent bits trickery either, but sounds interesting.
Will look at your code to get the idea...

that's not trick Joe, that's scientific notation in twos power.
imagine a number system that only deal with fraction that are powers of 10.
then you can add and multiply and all the fractions will be integer in the mantissa.
so what you do is that you subtract the tile number form the absolute address, and the mantissa can only shift to the right by a fix number of bits that you allow to lose. say you allow 4 or 5 bits precision will give you 16 or 32 consecutive tiles. and you can lose at most 5 bits for every operation.

moving a tile to another tile one requires changing double value, not change of any of the elements. what you get is that the element will have alias addresses. that can be easy determine by the tile number.

for example say you have a box a location 1.32 in a terrain. and you move it very smooth, like 1.33, 1.34, .... that will work fine.
now if the same box was a location 3000. in the same terrain, you will see it moving very jitery.

now with a tile system, the 1.32 will be at some tile, say tile zero, if you tile size was say 128. (whi will be huge but for simplicity)
in tiles, the location 2000.33 will be stored at tile = 2000 / 128 = 15
and the fraction will be 2000.32 - 15 * 128 = 80.32
and that number you can move very mostly,

if you are rendering what you do is that you render, you figure out all the visible tiles in grid coordinate, so not lost, and you place the camera in the alias location and render from that point.
no movement needed.

by **JoeJ** » Sat Mar 26, 2022 5:35 am

do not get too excited about tensor cores, at least not yet,
for what I read they are matrix operator but the operate in very limited set of float and ints.
essentially the do a lot of 8 bit add and multiply and float 16 multiply and even that high.

I think there is one thing to learn: Use build in data types for matrices, so in case there are HW instructions the compilers can use them easily.
Til yet i mostly implemented matrix multiply myself using dot products, which was mostly faster than build in types due to using less registers.
I plan to port all my GI stuff to fp16. But matrix multiply is super rare there. Won't benefit from tensors.

what would be interesting is to see if nvidia and amd expose their ray tracer ray casters in compute shaders, maybe they can be use for collision in GPU, which is still a huge challenge.

That's already a reality with DXR 1.1. You can trace from compute ('inline tracing').
It has potential downsides as it disables advanced grouping or reordering HW, which NV maybe has to some degree, but AMD surely has not yet. It is currently advised to profile and use the faster option for your current application and GPU.
Intels RT will be very advanced. They already confirmed traversal shader support, so programmable traversal. This can be used to implement stochastic LOD by sending random rays t a lower detailed version of your models. They did a paper about this idea. Other applications could be maybe to connect adjacent spaces as we had just discussed. But ofc. traveral shaders will add some extra cost, and it's not clear yet if it gets API support. MS listed it as a potential future feature after intels paper (2 years ago), so i guess we'll get it, and other vendors add support too.
Intel also talked about grouping ray hits per material, and they don't believe so much in the inline tracing idea. But it will be supported, and with physics models being low poly, it surely remains a valid option in the future.

by **Julio Jerez** » Sat Mar 26, 2022 12:19 pm

If you were to compare tech companies geo warfare.
Where the tech companies are the super powers and the applications are the battle field.
Intel would be the united stares
Amd would be the USSR.
Appple, IBM, Nec, Motorola, Arm, etc would be China.
Google, Microsoft and some of the other less relevant manufacturers would be like Germany, Britain or France.
Nvidia would be the equivalent of North Cohrea or a Terrorist organization like Alquida.

Nvidia is the player that is never honest, uses forbidden weapons, does not take prisoner and run asymmetrical against civillian war .
Not that the other player do not cheat or bend the rules, but at Nvidia bending the rules is in the standard procedural operation playbook.

These tensor cores, has zero application for anything other than some special type of hidden layers of neural net called deep convulsions neural net that one dude figured out around 2000 as I remember.

Starting from the misleading name, Tensor, instead of what they actually do which is vector multiply and accumulate.
Iv linear algebra for more than 50 years that fammilly operation is know as fvaxp

That is x = a + sum(b[i] * c[i]) for an array of n elements.
Note that this operation is called a reduction in linear algebra and can be done in shader if they added the ability to add register from different lanes of a compute unit.

But what Nvidia does is what the old intel f87 float units did

They have an intermediate register that is 80 to accumulate the partial result and the value get truncate when written to memory.
Intel did it right for 32 and 64 bit floats wit the x87.
they also do it right for avx2 fmuladd

Now enter Nvidia.
It is a known fact that in deep learning tenths of thousands of floats are a near zero value, but that not a law of neural nets, not even for net that process images. This only apply to a very specific features of these hidden layers that calculate the convulsion kernels of a 4x4 or 8x8 block of quantized pixels.
If you look at the intermediate images of these hidden layer, you will see that they recognize features, so essentially it is a bunch of blocks of mostly zeros. And the pixels that represents the feature is saturated to a max value.
If you do these operations with normal floats, it is really bad because these near zero value goes onto denormal and the nonzero value will overfloat which require more hardware to handle them.

So what they do is the encode input vector of 16 bit float apply the convulsion, which just a bunch of multiple and add.
But they add an internal 32 bit register to get the partial result. Then the output is a vector of 32 float. so the special hardware is just a large register.
People had been doing that for year for extended precision arithmetic. Even I do that for some high precision operations, check out the Googol class in newton.

There is no other applications that I know off for that kind of arithmetic, this only apply to the way Nvidia implement their convulsion kernels. But the dishonesty comes from the fact that they promoted ad if was a general purpose feature when it isn't.

Until they give that functionality the full range of precision,
Say full 32 or 64 bit floats. This has not application in the scientific or the video game world.
And I really question the quality of the nvidia results, since we know those layers are introducing a substantial amount of quantization noise to the calculations.

But Nvidia knows that if they give full precision to those Tensor unit, they become just regular shader unit. So is a distinction without a difference.

by **JoeJ** » Sat Mar 26, 2022 12:53 pm

Haha, i like your altitude, even if it does not give you any favor...
... on the short run :mrgreen:

Idk, but to me it looks like the tech bubble is going to plop soon.
Looking at what the most valued companies are, it's all US tech companies. Many of them, like Facebook, not even having an actual product but just services, plus a vision about a future completely ignoring humanity.
It also seems the main application of ML is just optimized marketing.
It's all worthless, and it does not look that good for the great western ideology.
Games industry is not really convincing either, at the moment. Music industry is dead, and movies are just shallow shadows of what we had in the past already.
Young generation needs rock'n'roll, inspiration and goals. Nobody serves this need right now. Even the internet lost its freedom of speech over hypocritical, sycophancy driven mind control, which isn't any better than censorship.
It does not feel safe, but more like sitting on top of a volcano, with some unknown revolution cooking below.

by **Julio Jerez** » Sat Mar 26, 2022 2:22 pm

this is not just me, if you what some review video after all the fuss for Nvidia announcement, you will find that almost each and every claim Nvidia has made over the years hard turn out to be false or an exaggeration. typing just type Tenser core app you get things like this:
https://www.youtube.com/watch?v=AssKiQvaSQ8&t=156s

that does not apply to just graphics, it applies to very thong NVidia does. Cheating and heavy-handed tactics against partners and competitors seem to be the way they.
Nvidia is the Fox News of the Tech world. Misloading information is a huge part of their success.

that's not to say they has some stuff that is good after many, many iterations but at the leading edge, they are not good faith actors.

by **JoeJ** » Sat Mar 26, 2022 2:50 pm

Well, no need to tell me. I've been enough of a forum warrior to point out tensor cores are useless at the moment for games, and NV makes their customers paying the bill on baseless innovation lies. They should spend the die area otherwise.
It looks like a trial to dictate progress. 'ML in rendering is the future, get it now already!' Well, maybe. But usually we fisrst start to use stuff, and after it is established, proven usefull and widely used, after that we consider HW acceleration on a thing we all agree and request.
Not so with tensor cores, which after 3 years still are just cooling pads in any game.
Oh, sorry. I forgot DLSS 'enables' raytracing, because it is just too slow without upscaling. What an argument.
It's just that we can do this without ML as well. UE5 upscaling is great, and likely AMDs next version will compete DLSS in quality too, since it's also temporal.

But what will happen? I'm sure it's this: Journalists will measure DLSS is still faster, so tensors are justified, and the future we all need. Those journalists will not factor in that using the tensor die area for compute would make the alternative upscaling, plus ANY other workload faster as well. They are too stupid to get this. They do not point out Intels marketing lies either, claiming their CPUs would now beat Apples M1 in power efficiency. They do not realize that M1 power draw already include RAM as well, while Intels does not.
Show them some graphs with numbers, and they believe it. It's that easy.

Actually i think NV is fooling the game industry over decades. Their papers about inefficent solutions are made up to sell big GPUs, and the devs do not even try to come up with something better. They trust in NV research, and do not see the conflict of interest with selling hardware or software.

Just like the experts at MS and Khronos, which did not realize how the raytracing API prevents any serious solution to the LOD problem by blackboxing BVH. Sacrifice real progress to ease up hardware. Well done, amateurs.

by **Julio Jerez** » Sat Mar 26, 2022 5:15 pm

on a lighter note.
I was planning to write the grid base sweep and prune broad phase, by before that I when over the current method and I found very by mistakes that translate to a big performance gain.

the biggest mistake was a legacy from 3.xx

this the engine uses a flag called m_equilibrium to determine how a boy need to be update.
over the years that flag has evolved to a complex state machine. but for the most part what the flag mean is that a body do no need to be update if it is in static equilibrium.
this flag is controlled by the solver.

the mistake I made is that the broad phase was also using the same flag to see if it needs to scan the scene for potential new collision contact joint.

here is the problem, from the solve point of view a body can be in static equilibrium, meaning is not moving, and yet the equilibrium flag be false.
there are many reasons for this, the most obvious one is a matrix teleport, but also the sleeping code can set the false off because the neighbor was moving very slowly, and there are more reason still.

but here is the real big problem, imagine a spinning sphere, it will never be in equilibrium, yet the aabb of the sphere in the broad phase never changes. but usen that flag to do the scan will forces the broad phase to run the scan for not change in the scene.
In fact form the broad phase view point item are the aabb of the body, so they look like spheres,

what this mean is that the broad phase needs it own set of flags, that is set when the aabb of a body intersect the aabb of it proxy in the broad phase.

I just made that change and the resole are just beautiful. now almost all the time is accrued to the solve doing the calculation of forces, even for scene that moves a lot or that do not move.

with those changes it take about 15 ms form the stacking scene.
and the gpu spinning cubes goes from 60 ms, to 18 in cpu.
even the cuda got better at 3 ms but in cuda that a fix cost.

here is how the profile look now

: suspension.png (86.91 KiB) Viewed 6479 times

the time is spend in the solve where it really count.
we are now ready to go full steam ahead with GPU.

by **Julio Jerez** » Sun Mar 27, 2022 12:51 am

ok I now added the next cuda kernel that extract the transform form the proxy body that live in the GPU.

so essentially it uses a kernel as a smart memcpy function, that way is only transfer the data that is interested in. that change make the big box scene run at 3.+ ms.
and it we still can reduce to under 3 after I added the double buffer.
for reference the box tacking ruin in the than 0.5 ms

now is just a matter of adding more kernels until we cover the complete engine.
guys this could be really, really good.

by **Julio Jerez** » Sun Mar 27, 2022 10:07 am

now under 3 ms, (2.7)

: Untitled.png (50.35 KiB) Viewed 6461 times

the part that rean and wait for the data can be hidden, when I add the double buffer streams, so it will be under 2. ms for that scene

the cool thing about the cpu is that the timing seems to be a lot more predictable,
also another cool part is that about 90% maybe more of that time is just update the bodies trasform in CPU. that provide hope for all the stuff that is yet to come.

by **Julio Jerez** » Sun Mar 27, 2022 8:44 pm

ok I now have the double buffer scheme implemented. and it work like a charm.
the update is just a tag over 2 ms. but is can actually be better.
I found a big surprise that was unexpected. but first he first the cpu profile.

: Untitled.png (45.68 KiB) Viewed 6434 times

as you can see the only accountable time is the code that get transforms form the GPU, everything else is concurrent.
on the GPU side, I found that is I do not issue a synchronization between the two streams, the GPU run accumulate the and run then until all the stream are reuse again, at that time is auto sync.

it seems that this is because the scene does not use all the GPU resources and can issue few streams very stream simultaneal. in the image you can see that that stream 22 and 23 run concurrent and are doing the same thin, so that is no going to work, what we need is one stream in on frame and the other in the next

: suspension.png (49.96 KiB) Viewed 6434 times

but in any case, this is a good thing because since the GPU only have one DMA channel, for each direction, it serializes the memory copy, so what it means is that it is still take twice as much time.

one way to serialize is using function
//cudaDeviceSynchronize();

but it is quite terrible, in clock everything and that's not what we want.
anyway there is quiet an impressive pierce of hardware.

by **Julio Jerez** » Sun Mar 27, 2022 11:04 pm

and now I broke the entire engine really bad, is trashing memory at random,
no sure where but I made so many changes.
anyway. I now have to see where I went wrong.

by **Bird** » Mon Mar 28, 2022 7:00 am

Looks like you fixed it.

The 27k box scene now runs around 1.5 ms on my machine!

Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Re: Cuda Solver

Who is online