yes that's why I was so surprised when I saw almost 50% speed up.
but I did not expect a regression on amd.
here is how I can explain it, essentially both AMD and Intel are 64 bit processes with 126 bit internal busses.
But AMD decided to keep the internal busses at 128 bit, whoever Intel when a head and made the 256 bit internal busses to support AVX. I believe this is the case for CPU after Sandy Bridge.
AMD when for more cores, and more integer performance while Intel when for wider float and integer units. So what AMD did is they the support the AVX instruction set, but did not do anything to the micro architecture other than widening the register set, but not the execution units.
Instead AMD put all the silicon on the APUs.
Her is two image that compare the Bulldozer and the IVY bridge microarch, as you can see the main difference the internal memory.

- amdvsintel.png (345.52 KiB) Viewed 10438 times
for us what mean is that if we want to take full advantage of the arch we nee to make three plugins
sse4.2, avx, avx2
the SSE4.2 is the one that will take advance of the mul-add instruction that are available to AMD and that will cut the number of instruction almost by half while keeping the float unit busy.
the AVX will take advance of higher band with, but can no use mul-add, because most AMD older ship do not support AVX2.
the AVX2 which will take advantage of both bandwidth and mul-add instruction that are available to intel core 7
therefore the workhorse for icore7 and bulldozer should be the sse4 solver.
the ASVX is for newer generation like AMD zen and icore9
here is what I do not understand why AMD keep making these mistakes, for the ryzen arch instead of making the internal bus 256 bit, the keep it 128 bit and added tow more 128 bit float units.
Intel when different round the made 256 bus, and tow full 256 float units that are fully othotgomnal, AMD is always on step behind Intel in vesion to the future, that bet that the Ryzen will beat the Intel because there are more app using SSE than AVX, therefore sine the can issue 4 SSE instruction on fly while intell can only issue two using AVD float units but what the don count is that in the intel the SSE unit and the AVX float unit are different,
This is the reason Intel consistently beats AMD in a core per core benchmark, while we may see some Ryzen Benchmark do better that Intel in multithreaded app just because they have more cores.
In my option the AMD made a huge mistake keeping 128 bus and float units, but I am always wrong on the prediction.
for use is reduced that we sould make a SSE4.4 plugin so ta we can use fmadd instructions.