alright, I now made the very first stable release. which will be the base line for all versions.
this way I can keep adding functionality and we have that fall back.
It has double and single precision.
one curious thing that I noticed, is that for the first time in more than 30 years programing for x86
I see a significant difference between double and single precision.
on paper single precision should be literally twice as fast double, for the single reason that offer tice the throughput of floatps. However, this is just a theorical limit.
when adding the memory bandwidth, branching, and atomic operation in the case of multicore code, these issues dominate the code and the performance.
It takes a real effort to get close to that theoretical limit.
this is one for the reason that you see libraries like Linpack, gal, Mathematica, also professional videos games made but stablished developers, or libraries made by hardware maker like AMD, Intel Nvidia, Apple, and others.
they use tricks that on the surface you see it and do not make sense, but that translate to huge differences in performance.
take for example this code.
- Code: Select all
template <class T> inline T dMax(T A, T B)
{
return (A > B) ? A : B;
}
you would expect that a compiler will use a conditional move. but you will be very wrong, for some reason many compilers will issue a compare test, and the claim seem to be that conditional move are slower. than the compare, test a jump.
In my experience, this is simply not true, but the is nothing than can be done, is we are going to use that function let it be the standard one or a use defined one.
I have seemed case with compiler like Clang that is you use std::max it will issue the conditional move, but if you issue your own code, it will not. That blow my mind for a long time and I thought it was a mistake I made, but after testing and testing, no, somehow the compiler made special notation for the standard library.
anyway, imaging you have a function like that is a tight loop, with a random probability that a and b are one smaller than the other. at best you get 50% probability of a branch misprediction.
and in today's cpus, a branch misprediction is far worse one of the worse performance offenders
so for my to remove that compiler uncertainty, for the case of integer compare.
I am doing if like this
- Code: Select all
ndInt32 dMax(ndInt32 A, ndInt32 B)
{
ndInt32 mash = (A-B)>>31;
return (mask & B) | (~mask & A);
}
now that code uses instructions that can be issue in parallel, super scalar and is branch less 100% of the time.
the results of these optimizations id that, indeed the loop is not only significant faster, but we can see the real difference when using f32 and f64
now f32 when using avx2 is more than twice as fast as when using double because the loops gat closer to that theoretical limit, before the difference was marginal.
It took almost three months to go over making the solve lock, atomic free and branch less as much as possible, many, many bugs, but the results make a difference for the first time.