ok my experiment of combining jobs by implementing a sync function that spin the prosessor until the other threads catch a sync point, did made the code in general faster, but very unstable.
in the 2001 pile of bodies the parallel solver when from about 20 million clocks, to about 9 millions,
but is was a black art, setting more thread made goes to hundred of millions clocks.
that was not all, second guessing semaphores or muted is a doing at your own risk thing, at least in windows, because if you get all you core on a spin, and it happens that other processes call a high priority tread, then you get into a spin lock the can never get out.
to solve this problem semaphores has this thing called priority inversion, which make the spin a much higher priority while spinning in order to get a reply from the scheduler, else you will never get one if the is or other cores are running threads of higher than yours.
This made a no, no to implemented a sync function without semaphore.
I am not going to deal with priority inversion BS there I abandoned that aproach.
it still remind the high cost of task switching. so I was researching how other people has dealt with this problem. sure there has to be a solution, it does not make sense that running multiple thread can potentially make your app slower.
many people talk about Intel treading block template library, guess what in their docs, they say that a task switch can bet upto one million clock cycle, thats far worse than what I measured which is up to 256, round.
anyway it seems in windows 10, 64bit mode there a way to make thread very lightweighty by calling a function the enter user mode scheduler, which apparently put the responsibility on the app,
I just found that and has not read much.
does anyone has experience with this?