Al right, there is a misunderstanding, what is commited is no an AVX build.
 it is the solve that can solve single island in parallel without having to break the mass matrix into separate island of non intersecting joints.
to about confusion, I commited the SDK with the DLL loader commented out for now.
the test is simply this:
-run the demo and you will see a single island 40 x 40 pyramid 
-set to run on multiple core, 2 or 4 or more depend on your system
-you should see the physics time scale some what linearly with the number of threads.
-them the option "Solve Large island in parallel off" and repeat the test
you soudl see that the sequential solve is much slower that the parallel solve.
Then to my surprised I also see that at least in two systems that I tested, the parallel solver sine to be faster than teh sequential solve even when is has a sentential amount of overhead, 
I thought that thsi was a sign of a bug, that why I said I am not sure what is going on, so I revised the code and indeed there was a bug, 
but after I fixed the Bug, for some reason it got even better.  
 
   
 and that is really welcome unexpected result because the parallel solver has a much poor converge rate than the sequential, but if it is in fact faster this means we can add some extra iterations to improve convergence until the single thread performance is at least equal in both solvers.
I run that test with the code that is committed now, and this is my result.
in my system with 1 core I am getting, could you guy check it again.
		
			 
- parallelSolver.png (77.12 KiB) Viewed 6263 times
 the other thing that I was saying is that with the parallel solve we can now implement the structure of array version whi will use single lanes of a simd register as a joint, this way the solve will resolve multiple joint per call, as opsed to what is doing now whi is one joint per call.
I am now working of teh SSE version whi will do 8 joint per call, the reason of 8 is to give the compiler the change to schedule teh code multiple float per instruction. 
Anyway that's more tweak stuff bu the point is that the parallel solve can do multiple joints per call even on a single core.