A place to discuss everything related to Newton Dynamics.
Moderators: Sascha Willems, walaber
by Julio Jerez » Fri Nov 06, 2020 2:18 pm
the first bad news is that I added the very first thing that is necessary to do SPH, which is find out the clusters of particles that are inside the radius of influenze of a given particle.
I am using a gird block of 20 x 20 x 20 particles. (8000)
turning to a grid of clusters generates 64000 potential pairs that need to be sorted by grids ids.
using the normal qsort cost 16 ms in my system.
too slow already and since quick sort does not parallelize well, I will have to write a Radix sort and see how that goes.
radix sort is a mix bag because is some cases is faster that qsort but in most cases is actually slower. but on the other hand it is the only sorting algorithm that I know that parallelized well.
I committed the test, against is people can try and let lee knwo hwo is does in their system that will be some help.
-
Julio Jerez
- Moderator

-
- Posts: 12452
- Joined: Sun Sep 14, 2003 2:18 pm
- Location: Los Angeles
-
by Bird » Fri Nov 06, 2020 4:39 pm
I get Physics time = around 13 - 15 ms on my system in the latest github commit
Jan Bender has an open source project for parallel computation of nearest neighbors that maybe you could use or learn from. I've played a little with his SPliSHSplasH and it seems to get around 8000 particles in real time.
https://github.com/InteractiveComputerGraphics/CompactNSearch
-
Bird
-
- Posts: 636
- Joined: Tue Nov 22, 2011 1:27 am
by Dave Gravel » Fri Nov 06, 2020 5:08 pm
Config: Ryzen 2700x + 16go of memory + a little nv1060 gtx 3go.
Here it is stable to 13 ms
I get around 2500 3000 fps.
Sometime if I move and don't see all box the fps drop to 1500.
If I place the camera for see all box it play around 2500 3000 fps.
The ms stay stable to 13 here.
My water cooling fans run a bit faster with this test, surely it request more on the cpu.
-

Dave Gravel
-
- Posts: 808
- Joined: Sat Apr 01, 2006 9:31 pm
- Location: Quebec in Canada.
-
by Julio Jerez » Fri Nov 06, 2020 6:01 pm
Oh not I refrain from using any of the stuff form those people.
There is nothing really on that paper that I did not already knew.
All the stuff is just rehash literature from people who did first and presented as new stuff.
I believe the way I am hashing is far, far superior anyway.
The reason the test was slow is because is just what to try a regular sorting of teh bucket to see hwo it will compare to generation teh bucket in place.
I now added a hash that let me compute the bucket ID for a hash, and I can generate the buckles in place.
In fact, that fact that teh bucket are all clustered and evenly distributed suggest that they can be generated in place no nee for sorting.
Try again, I add teh hash to get teh key, and now it is under 2 ms.
this is the most expensive part of the simulation. if we can get it under two ms, then we are heading to have all in reals time for a set of up to 8 to 10 thousand particle with one core. probably 32 thousand with for cores. assuming it yield 50% gain whi is conservative.
play try again.
-
Julio Jerez
- Moderator

-
- Posts: 12452
- Joined: Sun Sep 14, 2003 2:18 pm
- Location: Los Angeles
-
by Dave Gravel » Fri Nov 06, 2020 6:24 pm
I get 1.36 ms pretty stable here now.
2900 3200 fps.
The cpu don't get the high peak now, My fans don't go higher like the test before.
It look better for me.
-

Dave Gravel
-
- Posts: 808
- Joined: Sat Apr 01, 2006 9:31 pm
- Location: Quebec in Canada.
-
by Bird » Fri Nov 06, 2020 7:41 pm
Excellent! I'm getting physics time at 1.2 - 1.3 ms and around 4000 fps
-
Bird
-
- Posts: 636
- Joined: Tue Nov 22, 2011 1:27 am
by Julio Jerez » Sat Nov 07, 2020 1:15 pm
yes, but that was not building the complete grid map, only the grid with teh the particle the has their origin in that grid, when calculation the neighbor is when it gets slow.
the code now calculates the full map.
anyway I make an improvement in the existing code.
I now have tow algorithm, both which can be improve substantially.
teh first one is the one that buidl the gird in place, this is for each particle, it buidl a cluster or particle that contains all teh particles with this origin is the cell form but it quantized AABB if diameter equal to the radius of the mothing filter.
this is my prefect method because the is no post processing, but unfortunately it is the slowest one.
I nwo make a change that make the grid hash a 64 bit integer, (thee 20 mantissa of teh vector that represent the position of teh particle.
teh result is that in 32 bit god slower, because 64 bit int operation is 32 bit mode are much slower that float32 using simd.
I got some what faster in 64 bit mode wen form 16 ms, to about 12 ms in this machine.
I will have to abandoned this method for many reasons.
1-is slower because the calculation of the grid ID is a log(n) red black trees, this can probably be improved by using a hash map.
2- and this is the worse part, is does no scale to multicores easily since the construction requires using high level containers that do not paralyzed.
the secund method is as followed, build the entire set of all the hash and neighbor particles in a flat array.
then in another pass sort the array by grids using some parallelable algorithm.
thsi method I am trying now, and I am still using quick sort.
as before is slow in 32 bit for teh same reason that 64 bit ints operation are slow in 32 bit.
it is reasonably fast in 64 bit (4 ms in my system)
this now can be improved by switching to using Radix or bucket sort whi but can be parallelizable.
I will first try Radix. since Bucket will requires a mapping for grid (x,y,z) to a cell id.
plus since radix sort uses digits, it is not affected by 64 bit in operations in 32 bit.
is anyone wnat to test it, you can enable the modes by commenting out define
//#define D_USE_IN_PLACE_BUCKETS
in file: ...\newton-4.00\sdk\dNewton\ndBodySphFluid.h
-
Julio Jerez
- Moderator

-
- Posts: 12452
- Joined: Sun Sep 14, 2003 2:18 pm
- Location: Los Angeles
-
by Bird » Sat Nov 07, 2020 5:14 pm
in the very latest github version:
with D_USE_IN_PLACE_BUCKETS defined I'm seeing
physics time = 1 ms and about 5500 fps
with D_USE_IN_PLACE_BUCKETS not defined I'm seeing
physics time = .3 ms and about 5500 fps
GPU: NVIDIA Geforce RTX 2070 Super 8GB
CPU: 8 core Intel i7-9700K@3.60GHz 32GB
-
Bird
-
- Posts: 636
- Joined: Tue Nov 22, 2011 1:27 am
by Julio Jerez » Sat Nov 07, 2020 5:27 pm
ok now I switched to a complete dedicated Radix sort. and is is about twice as fast time faster.
In my system it when form 4 ms to under 2 ms.
this is important because nwo It just give me the full list of all the pair. the onle thing left is to do teh actual physics simulation, whi I suspect is about 2 or 3 ms, so we can expect that a batch of fliud particle will be about 4 to 5 ms using a single core on a mainstream cpu like mine Icore7 7700 3.6 ghz.
the cool thing is that these algorithm paralellize well.
if you just run in it just tell what timing you got.
On a side note I alway dislike Radix sort because is no generic enough, since you have to do a lot of preconditioning of the data to get it to work. but it is undeniable that for large set is fasters that qSort.
after all this I will replace some place whe the am using qsort with specialized version of Radix. the engine could benefit form that.
Anyway we are now very close to get the simulation going.
-
Julio Jerez
- Moderator

-
- Posts: 12452
- Joined: Sun Sep 14, 2003 2:18 pm
- Location: Los Angeles
-
by Bird » Sat Nov 07, 2020 5:37 pm
Now I'm seeing
with D_USE_IN_PLACE_BUCKETS defined
physics time = 13 ms and about 3900 fps
with D_USE_IN_PLACE_BUCKETS not defined
physics time = 1.6 ms and about 3900 fps
-
Bird
-
- Posts: 636
- Joined: Tue Nov 22, 2011 1:27 am
by Julio Jerez » Sat Nov 07, 2020 5:42 pm
Yes the 1.3 is the new figure.
This is the complete sort for 8000 close particles.
Single core.
Is about 7 time faster than using standard sort and external to parallelization.
It is such a blow out to in place buckets, tha I will just delete the in place method.
-
Julio Jerez
- Moderator

-
- Posts: 12452
- Joined: Sun Sep 14, 2003 2:18 pm
- Location: Los Angeles
-
by Julio Jerez » Sun Nov 08, 2020 12:26 pm
ok I added part one of teh optimization.
Yes I knwo it is earlly by this face is very critical to decide if this will be a practical feature or just a
gimmick that is ther as a check box that no one can really used.
I parallelize the pair generation and teh result are very, very promising.
I increased teh number of particles to 32k (32 x 32 x 32)
this produce about 120k pairs and take 2.5+ ms to generate in my system
the parallel version, even with teh over head is about 3 to 5 % slower, so this mean we can keep onle one algorithm. here is teh good knew, teh performance is linear with core count. with 4 cores is 0.9 ms.
but even better, when I increase the core count to 8 is teh performance is 0.7 ms,
that 0.2 ms is actually significant because this is a 4 core hyperthreaded, so more thread only make the more responsive but not faster, in fast most high performance threaded job system when the thread count is higher than the hardware cores, ten to run slower because teh cost of task witching.
anyway we are now at under 1 ms for 32k particles using 4 cores in a mainstream system
I am now ready for the second face, sorting in parallel which was around 2+ ms single core.
it is all commited it to github.
-
Julio Jerez
- Moderator

-
- Posts: 12452
- Joined: Sun Sep 14, 2003 2:18 pm
- Location: Los Angeles
-
by Leadwerks » Sun Nov 08, 2020 2:05 pm
I have Newton isolated entirely on a separate thread, so it will get a full 16.667 milliseconds to execute each update. It may be a while before I get to test this out, I am working hard to get our new engine out, But I can tell you this stuff will be used by NASA and others.
-

Leadwerks
-
- Posts: 569
- Joined: Fri Oct 27, 2006 2:54 pm
by Julio Jerez » Sun Nov 08, 2020 5:40 pm
actually I am doing thsi for myself, I do not really care what "nasa" or anyone else think about it.
anyway I now have the parallel Radix, that create run on source batches that can be merge using merge source.
teh result seem quite good, fo rteh 32000, positive in single thread I get 6.5+ ms.
and with with 4 threads in paraller 2.2 ms
with 8 threads 1.5 ms.
more thread yield not more improvement but liek I sad this is a four core hyperthreaded main stream cpu.
It all seems to point to that teh final simulation can be under 4 or 5 ms for 32000 particle.
which I think is usable
-
Julio Jerez
- Moderator

-
- Posts: 12452
- Joined: Sun Sep 14, 2003 2:18 pm
- Location: Los Angeles
-
by Bird » Mon Nov 09, 2020 6:28 am
Wow!. I'm getting a little under 1 ms with 32k particles using 8 threads with the latest version.
-
Bird
-
- Posts: 636
- Joined: Tue Nov 22, 2011 1:27 am
Return to General Discussion
Who is online
Users browsing this forum: Google Adsense [Bot] and 99 guests