Yes I am very optimistic about this, It could come close to a mid range GPU provided that the code aproach the theoretical throughout of operations per clock, but that is hard to archived with sequential programming.
The trick is to have tight loops that preload data to level one cache and do a significan number of multiply and add, non of that shuffle of special instrutions. This what the plugin solver is designed to do.
One thing that baffled me is that I was using the operation fmuladd, which I though was standard for all AVX ready CPU, but it turns out that some do not support it, It appears that for the processor that do support it the troughput is not just double, is four time since an add using fmuadd with a const of 1.0 is twice as fast as a simple add, plus it doesy two operations.
but for the first version I will stick to simple add and simple mul operations.
I was searching to see if is was true that some icore 7 do not support mul add, and it seems that sandy bridge do not.
Intel Core 2 and Nehalem:
4 DP FLOPs/cycle: 2-wide SSE2 addition + 2-wide SSE2 multiplication
8 SP FLOPs/cycle: 4-wide SSE addition + 4-wide SSE multiplication
Intel Sandy Bridge/Ivy Bridge:
8 DP FLOPs/cycle: 4-wide AVX addition + 4-wide AVX multiplication
16 SP FLOPs/cycle: 8-wide AVX addition + 8-wide AVX multiplication
Intel Haswell/Broadwell/Skylake/Kaby Lake:
16 DP FLOPs/cycle: two 4-wide FMA (fused multiply-add) instructions
32 SP FLOPs/cycle: two 8-wide FMA (fused multiply-add) instructions
AMD K10:
4 DP FLOPs/cycle: 2-wide SSE2 addition + 2-wide SSE2 multiplication
8 SP FLOPs/cycle: 4-wide SSE addition + 4-wide SSE multiplication
AMD Bulldozer/Piledriver/Steamroller/Excavator, per module (two cores):
8 DP FLOPs/cycle: 4-wide FMA
16 SP FLOPs/cycle: 8-wide FMA
AMD Ryzen
8 DP FLOPs/cycle: 4-wide FMA
16 SP FLOPs/cycle: 8-wide FMA
Intel Atom (Bonnell/45nm, Saltwell/32nm, Silvermont/22nm):
1.5 DP FLOPs/cycle: scalar SSE2 addition + scalar SSE2 multiplication every other cycle
6 SP FLOPs/cycle: 4-wide SSE addition + 4-wide SSE multiplication every other cycle
AMD Bobcat:
1.5 DP FLOPs/cycle: scalar SSE2 addition + scalar SSE2 multiplication every other cycle
4 SP FLOPs/cycle: 4-wide SSE addition every other cycle + 4-wide SSE multiplication every other cycle
AMD Jaguar:
3 DP FLOPs/cycle: 4-wide AVX addition every other cycle + 4-wide AVX multiplication in four cycles
8 SP FLOPs/cycle: 8-wide AVX addition every other cycle + 8-wide AVX multiplication every other cycle
what this mean is that since thsi si a extranl plugin we can have diffrent versions that chek teh hardwres support.
I seem that the trick to get close to theoretical performance is to have enough instruction in the loop body, so that the compiler can schedule muls and adds in parallel and keep the units busy.
This appear to be true, the profile is showing that the same code is no just faster, this seem order of magnitude faster so far. some thing can be very good, I am very optimistic about this now.