400 Series Performance?

A place to discuss everything related to Newton Dynamics.

Moderators: Sascha Willems, walaber

400 Series Performance?

Postby KKlouzal » Mon Dec 06, 2021 8:13 pm

Hello, thank you for all the great work that has been put into newton over the years, I feel like you deserve a lot more credit and recognition than has been afforded thus far.

I would like to switch my current physics engine over to using Newton Dynamics 400 series. I see some things I really like about the library compared to others out there.

The library, at least in it's latest incarnation, appears to have been built from the ground up with multithreading in mind. I read a post where you mention it is lock free which is simply amazing.
Also built to take advantage of processing on the GPU, again wonderful!

These two points have me looking strongly at Newton Dynamics. I have a concern though. Realistically, how many rigid bodies can we simulate at any given time? My graphical backend is using Vulkan, very fast, and I need my physical backend to be just as fast.

Currently, using another library, I can have roughly 3500 rigid bodies actively interacting and keep 60fps.

Can newton do the same? Better?

I'm going to create a new branch of my project and switch to Newton so I can compare the two but would like to hear from the community. What is performance like? What things can be done to maximize performance?

Thank you!
User avatar
KKlouzal
 
Posts: 16
Joined: Tue Jan 23, 2018 11:59 am

Re: 400 Series Performance?

Postby KKlouzal » Tue Dec 07, 2021 12:07 am

After playing around with the demo sandbox, newton looks to be more than capable. I might go so far as to say you have created the best physics library with this 400 series.

Are there any skeleton tutorials available to get a basic simulation up and running?
"Test" application might be everything needed to get up and running..

Anything specific needed to build with OpenCL solver?
It's failing to build with that enabled.
User avatar
KKlouzal
 
Posts: 16
Joined: Tue Jan 23, 2018 11:59 am

Re: 400 Series Performance?

Postby Julio Jerez » Thu Dec 09, 2021 6:31 pm

I am going under some refactorisation of the solver, in order to make portable to openCL.
I do not want the GPU and CPU version to be too different, so some branching code and, virtual functions, and atomic locks that make easy to program for CPU, are hard for GPU so they have to be predicated.

It is taking me longer that I expected because other obligations, but I will try to get it back before the years end.
Julio Jerez
Moderator
Moderator
 
Posts: 12249
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: 400 Series Performance?

Postby KKlouzal » Thu Dec 09, 2021 7:36 pm

No problem! You have done AMAZING work on the 400 series. I tried the 300 series a few years ago and it was good, but what you have now is a work of art!

I was able to get a simple simulation worked into my engine last night, 10,000 rigid bodies on the CPU, 16 threads, 20-30 fps. I'm only targeting a maximum of 5,000 bodies so your library more than fits my needs, fits my needs way more than any other library available today!

I will post a video later tonight showcasing.
User avatar
KKlouzal
 
Posts: 16
Joined: Tue Jan 23, 2018 11:59 am

Re: 400 Series Performance?

Postby Julio Jerez » Thu Dec 09, 2021 8:40 pm

yes, that's the idea.
The latest changes that removed all of the atomics make it much faster, but most important is made it possible that the multicore now yield much better performance. before with atomics read modify write, once you get three or more cores running is parallel the code becomes memory limited.

the other point is that the simd solvers are also much faster, they are commented out but I soon re enable them we are shooting for 5 to 6 thousand on CPU at playable frame rate.

but we are also going to add a special solver mode named progressive sleep, that will be less realistic but will allow for many thousands of bodies, even in cpu.

please make the video, if you can. maybe when I re enable the avx, you could get much better performance. The solver you are trying now is the Generic Template that is used as the starting point for the simd and GPU.
Julio Jerez
Moderator
Moderator
 
Posts: 12249
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: 400 Series Performance?

Postby KKlouzal » Thu Dec 09, 2021 9:48 pm

Will definitely post later tonight.

AVX solver would be nice to have, could enable it on hardware that supports AVX.

The end goal (for me) is to have physics processing offloaded to the integrated GPU and leave graphics processing on the dedicated GPU. Users can opt to have 2 dedicated GPU, maybe one high performance for graphics, then the second a bit cheaper lower performance for physics processing. This way it leaves CPU processing resources for other things like networking etc..

With PCIe 5.0 hitting the market, then soon 6.0 and after that CXL will replace PCIe, memory bandwidth bottlenecks should be much less of an issue, especially with DDR5, GDDR6X and even HBM3 memory about to become mainstream...computational processing is about to get very interesting :D

It's a wonderful time to be a multi-threaded applications developer!
User avatar
KKlouzal
 
Posts: 16
Joined: Tue Jan 23, 2018 11:59 am

Re: 400 Series Performance?

Postby Julio Jerez » Fri Dec 10, 2021 4:25 pm

The library, at least in it's latest incarnation, appears to have been built from the ground up with multithreading in mind. I read a post where you mention it is lock free which is simply amazing.
Also built to take advantage of processing on the GPU, again wonderful!


yes, the solver is not only lock free, is also atomic and for the most part branch free.
That is what make possible to have a more linear performance gain with multicores.

when Using Locks and atomic read modify write, is fine when using two or three cores, but once you pass that the cores get all block in a bus contention. and even if is run in multicores it does no really get much gain since the bus serializes the this start to become a problem even when using just two cores in avx. the avx solve can solve up 4 or 4 joint per call but the result has to be written to a body buffer and there could be collision there where tow core write or real form the same entry.
so solving say 8 joint you have upto 16 bodies to write to, and if another cores also has 16 bodies the chances of clases in as high are having 8 or more thread each solving one joint.

this was quiet challenging to get working but the results are quite impressive.
anyway I see if I can re enable all the solve over this weekend.
Julio Jerez
Moderator
Moderator
 
Posts: 12249
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: 400 Series Performance?

Postby KKlouzal » Fri Dec 10, 2021 8:35 pm

Personally I'm not too much interested in the AVX solver. More importantly being able to run on the GPU.
As things are now with CPU solver I am happy, can move to GPU solver in the future, but not a high priority at this time.

What I am most concerned with are the issue's I've been opening up on github. Keep running into some random Asserts.. There have been some others that I still need to create issues for.
User avatar
KKlouzal
 
Posts: 16
Joined: Tue Jan 23, 2018 11:59 am

Re: 400 Series Performance?

Postby Julio Jerez » Sat Dec 11, 2021 1:02 pm

as I suspected that was caused by a debug code I had on the old solver so that I can compered the solution with there new solver, after commenting out that code the bug goes away.
if you sync, it the bug should go away.

you will have to do some renaming like dVector to ndVector
this is necessary because some people are using 3.14 and 4.00 together and some classes in the core library had same name and that causes clashes on the high level

this could be resolve using a name space, but I am using a name prefix.
maybe next version will using name space but not now.

I still have not rename the native type like dInt32 and dFloat32
this will be done on a secund round of commit.
Julio Jerez
Moderator
Moderator
 
Posts: 12249
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: 400 Series Performance?

Postby Julio Jerez » Fri Dec 17, 2021 2:33 pm

of this is a follow up to that question or post.
KKlouzal wrote:Personally I'm not too much interested in the AVX solver. More importantly being able to run on the GPU.
As things are now with CPU solver, I am happy, can move to GPU solver in the future, but not a high priority at this time.
What I am most concerned with are the issue's I've been opening up on github. Keep running into some random Asserts.. There have been some others that I still need to create issues for.


happy to say that with the lock free algorithm for the first time I see avx2 beating by a substantial margin the sse solver.

For comparison a 30 x 30 pyramid running on an icore7 7700 four threads
the default solver takes 11 to 12 ms per frame to bring to rest.
sse solver takes 7 to 8 ms per frame to bring to rest.
avx2 solver takes 4 to 5 ms per frame to bring to rest.

avx2 is almost twice as fast, so intel was right. Avx2 really yield double or more floats per clocks but it is really hard to get that kind of performance.

just try sync again and test the avx solve you will see a big difference.

I will now do some final tunning and make the first release. so that people always have a base line to fallback.
Julio Jerez
Moderator
Moderator
 
Posts: 12249
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles


Return to General Discussion

Who is online

Users browsing this forum: No registered users and 51 guests