A remarkable difference for a small change.

A place to discuss everything related to Newton Dynamics.

Moderators: Sascha Willems, walaber

A remarkable difference for a small change.

Postby Julio Jerez » Thu Jun 20, 2019 3:02 pm

for some tiem I have try to figure out hwo the inner loop of the solver can be write with out register spills, the follow two lines can the loop to degenerates into a large set of instruction
Code: Select all
a = a.AndNot((f > upperFrictionForce) | (f < lowerFrictionForce));
f = f.GetMax(lowerFrictionForce).GetMin(upperFrictionForce);

the sse code does no have a single slect instruction so select has to be done manually.
the I read in a web side that the problem is the instruction andnot whi does no commute and cuase a register allocator to clap registers. if the selct was implenetd usingh xor the whne used with more instruction the the regisl can resuse resister, so I adde funtion
Code: Select all
   DG_INLINE dgVector Select(const dgVector& data, const dgVector& mask) const
   {
      // (((b ^ a) & mask)^a)
      //return  _mm_or_ps (_mm_and_ps (mask.m_type, data.m_type), _mm_andnot_ps(mask.m_type, m_type));
      return  _mm_xor_ps(m_type, _mm_and_ps (mask.m_type, _mm_xor_ps(m_type, data.m_type)));
   }


and what do you know, its true, the code is uses with fewe intrunton and with only 6 registers.
while the original uses 8 registers and still spill to memory.

here is the sequences of instructions.
Old slect instruction. 8 registers spill from memory
Code: Select all
a = a.Select(zero, f > upperFrictionForce).Select(zero, f < lowerFrictionForce);
f = f.GetMax(lowerFrictionForce).GetMin(upperFrictionForce);

00B7F82A  cmpltps     xmm1,xmm7 
00B7F82E  movaps      xmm2,xmm7 
00B7F831  cmpltps     xmm2,xmm5 
00B7F835  maxps       xmm7,xmm5 
00B7F838  movaps      xmm0,xmm1 
00B7F83B  movaps      xmm3,xmm2 
00B7F83E  andps       xmm1,xmmword ptr [zero] 
00B7F845  andnps      xmm0,xmm6 
00B7F848  andps       xmm2,xmmword ptr [zero] 
00B7F84F  orps        xmm0,xmm1 
00B7F852  minps       xmm7,xmm4 
00B7F855  andnps      xmm3,xmm0 


new select instruction: 7 registers no spill
Code: Select all
a = a.Select(zero, f > upperFrictionForce).Select(zero, f < lowerFrictionForce);
f = f.GetMax(lowerFrictionForce).GetMin(upperFrictionForce);

001AF82C  cmpltps     xmm1,xmm5 
001AF830  andps       xmm1,xmm0 
001AF833  xorps       xmm1,xmm2 
001AF836  movaps      xmm2,xmm5 
001AF839  cmpltps     xmm2,xmm4 
001AF83D  movaps      xmm0,xmm1 
001AF840  maxps       xmm5,xmm4 
001AF843  xorps       xmm0,xmm7 
001AF846  andps       xmm2,xmm0 
001AF849  xorps       xmm2,xmm1 
001AF84F  minps       xmm5,xmm3 
Julio Jerez
Moderator
Moderator
 
Posts: 12249
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: A remarkable difference for a small change.

Postby MeltingPlastic » Thu Jun 20, 2019 3:11 pm

How much slower can a register spill be? I've read about these kind of optimizations but haven't tried them myself yet.
MeltingPlastic
 
Posts: 237
Joined: Fri Feb 07, 2014 11:30 pm

Re: A remarkable difference for a small change.

Postby Julio Jerez » Thu Jun 20, 2019 3:20 pm

if you have a complex tight loop it can be dreadful.
however on intel cpus is not as bad as the literature say because the intel cpu with its complex addressing modes that can do operations with target address from memory is not as bad as any risc processor.
on the other hand risc processors have more registers, so spilling to memory is no that common.

the thing is that the solver loop even in x64 that has 16 registers still spills to memory. and other that writing the loop in assembly is has been real hard to come up with a vector class that yield decent results. the solver loop is quite different in x64 than it is in x86, but the x64 still spills.
Julio Jerez
Moderator
Moderator
 
Posts: 12249
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: A remarkable difference for a small change.

Postby Julio Jerez » Thu Jun 20, 2019 6:41 pm

um even better this reduces to even less instruction and uses 6 registers, this si the best I have so far
and more important does not changed the code.

Code: Select all
         //a = a.AndNot((f > upperFrictionForce) | (f < lowerFrictionForce));
         dgVector mask((f > upperFrictionForce) | (f < lowerFrictionForce));
         a = a.AndNot(mask);
         f = f.GetMax(lowerFrictionForce).GetMin(upperFrictionForce);
003AF7EA  cmpltps     xmm2,xmm5 
003AF7EE  movaps      xmm0,xmm5 
003AF7F1  cmpltps     xmm0,xmm3 
003AF7F5  maxps       xmm5,xmm3 
003AF7F8  orps        xmm2,xmm0 
003AF7FB  andnps      xmm2,xmm4 
003AF801  minps       xmm5,xmm1 
Julio Jerez
Moderator
Moderator
 
Posts: 12249
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: A remarkable difference for a small change.

Postby JernejL » Fri Jun 21, 2019 1:33 am

This is awesome, and a proof that manual optimization can always do better than a general all purpose compiler :)
Help improving the Newton Game Dynamics WIKI
User avatar
JernejL
 
Posts: 1578
Joined: Mon Dec 06, 2004 2:00 pm
Location: Slovenia

Re: A remarkable difference for a small change.

Postby Julio Jerez » Fri Jun 21, 2019 7:17 am

it seems to be the case, but the compiler seem to do very poorly, you can clear see that the code could be improved liek this.

Code: Select all
         //a = a.AndNot((f > upperFrictionForce) | (f < lowerFrictionForce));
         dgVector mask((f > upperFrictionForce) | (f < lowerFrictionForce));
         a = a.AndNot(mask);
         f = f.GetMax(lowerFrictionForce).GetMin(upperFrictionForce);
003AF7EA  cmpltps     xmm2,xmm5 
003AF7F1  cmpgtps     xmm0,xmm3 
003AF7F8  orps          xmm2,xmm0 
003AF7FB  andnps      xmm2,xmm4 
003AF7F5  maxps       xmm5,xmm3 
003AF801  minps       xmm5,xmm1

there is not need to the move instruction, in fact that the first thing a good register allocator will do, eliminate all unnecessary moves instruction, but I have never seem a MS compiler doing that in more than 20 years.

if you use avx2 that allowes for threa address instructions is does, but not with avx, SSE or any compile flags.
Code: Select all
0FC13676  vcmpgt_oqps ymm1,ymm4,ymm2 
0FC1367B  vcmplt_oqps ymm0,ymm4,ymm3 
0FC13680  vorps       ymm0,ymm1,ymm0 
0FC13684  vandnps     ymm1,ymm0,ymm5 
0FC13690  vmaxps      ymm0,ymm4,ymm3 
0FC13694  vminps      ymm2,ymm0,ymm2 


however even in avx2 the compiler generate poor code. that any assembly programmer can spot but sloppy code. In fact I have write compiler that does far better.

for example the avx code could be done with 5 register and maybe even 4, by MS use 6.
and that cause oeth parts to spill rather badly.
Julio Jerez
Moderator
Moderator
 
Posts: 12249
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: A remarkable difference for a small change.

Postby Julio Jerez » Fri Jun 21, 2019 7:43 am

oh I just realized that teh problem is the instruction andnot if I change the boolean expresion
Code: Select all
   a = a.AndNot((f > upperFrictionForce) | (f < lowerFrictionForce));

to
Code: Select all
   a = a & ((f <= upperFrictionForce) & (f >= lowerFrictionForce));


them it generates this code sequence
Code: Select all
00EE3F26  movaps      xmm2,xmm5 
00EE3F29  cmpleps     xmm2,xmm1 
00EE3F2D  cmpleps     xmm0,xmm5 
00EE3F31  maxps       xmm5,xmm3 
00EE3F34  andps       xmm2,xmm0 
00EE3F37  andps       xmm2,xmm4 
00EE3F3A  minps       xmm5,xmm1 


but even that can be chnge so that the first movaps xmm2,xmm5 is eliminated only if the compiler use the cmpgeps instead of cmpleps, plus there are two and instruction that can't be issue to different logic units since they both have the same register dependencies.
Ms bank too much on the claims of out of order execution to issue seriously midiocre code for years.

I think what happen is that is appear all visual studio compilers do not generate code sequence thet use the form
if (a > b) exp

and translate to teh form
if (!(a <= b)) exp

this is why is always has a move to swap the registers order.
I will late remove the instance of anpnot, becaus ethe function is not commutative and it seem to make makes the compiler use one extra register.
Julio Jerez
Moderator
Moderator
 
Posts: 12249
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: A remarkable difference for a small change.

Postby Julio Jerez » Fri Jun 21, 2019 7:54 am

With the last optimization now that I am not using andnot instruction, I can remove the parentices from exp:
a = a & ((f <= upperFrictionForce) & (f >= lowerFrictionForce));
and the is the final result is
Code: Select all
a = a & (f <= upperFrictionForce) & (f >= lowerFrictionForce);
00F83F26  movaps      xmm2,xmm5 
00F83F29  cmpleps     xmm2,xmm1 
00F83F2D  cmpleps     xmm0,xmm5 
00F83F31  maxps       xmm5,xmm3 
00F83F34  andps       xmm2,xmm4 
00F83F37  minps       xmm5,xmm1 
00F83F3A  andps       xmm2,xmm0 


still one unnecessary move, its quite clear that xmm2 in the first instrcuction is a dead register, why the compiler dies eliminated it is baflling.

but as you can see the instruction schedule can interleave the two and instructions with a conflict destination. so each one can be issue in paraller.
I guess this is so far the best that can be expected from that, but still bother me that the compiler does not kill dead registers, that code can be issue with fewer registers.

This weekend I will go and replace instruction andnot with the select and explicit logic expression that can commute.
Julio Jerez
Moderator
Moderator
 
Posts: 12249
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: A remarkable difference for a small change.

Postby Julio Jerez » Fri Jun 21, 2019 2:59 pm

here is one of the danger of writing code using instincts or assembly
if you have an expression.
a = a0 * x0 + a1 * x1 + a2 * x2 + a3 * x3 + a4 * x4 + .....

a compile will brake that into a series of mulladd, like

a = a0 * x0;
a = a + a1 * x1;
a = a + a2 * x2;
....

but the compile know well that a cpu has more than one float unit and that adobe code and a write after write dependence so only one float unit will be schedule
the compiler will bra the code like this

a0 = a0 * x0;
a1 = a1 * x1;
a0 = a0 + a2 * x2;
a1 = a1 + a3 * x3;
....
a = a0 + a1;

but if the float muladd is intrinsics it is forbidden to do so because of the float order.

this is the case wit the solve, if has this sequence of add mult
Code: Select all
         a = a.MulSub(row->m_JMinv.m_jacobianM0.m_linear.m_y, forceM0.m_linear.m_y);
         a = a.MulSub(row->m_JMinv.m_jacobianM0.m_linear.m_z, forceM0.m_linear.m_z);
         a = a.MulSub(row->m_JMinv.m_jacobianM0.m_angular.m_x, forceM0.m_angular.m_x);
         a = a.MulSub(row->m_JMinv.m_jacobianM0.m_angular.m_y, forceM0.m_angular.m_y);
         a = a.MulSub(row->m_JMinv.m_jacobianM0.m_angular.m_z, forceM0.m_angular.m_z);

         a = a.MulSub(row->m_JMinv.m_jacobianM1.m_linear.m_x, forceM1.m_linear.m_x);
         a = a.MulSub(row->m_JMinv.m_jacobianM1.m_linear.m_y, forceM1.m_linear.m_y);
         a = a.MulSub(row->m_JMinv.m_jacobianM1.m_linear.m_z, forceM1.m_linear.m_z);
         a = a.MulSub(row->m_JMinv.m_jacobianM1.m_angular.m_x, forceM1.m_angular.m_x);
         a = a.MulSub(row->m_JMinv.m_jacobianM1.m_angular.m_y, forceM1.m_angular.m_y);
         a = a.MulSub(row->m_JMinv.m_jacobianM1.m_angular.m_z, forceM1.m_angular.m_z);
         a = a.MulSub(row->m_force, row->m_diagDamp);


which translate to the horrendous code sequence

Code: Select all
00007FFE4B4A25E0  vmovups     ymm0,ymmword ptr [rax] 
00007FFE4B4A25E4  vfnmadd213ps ymm0,ymm14,ymmword ptr [rax+1E0h] 
00007FFE4B4A25ED  vfnmadd231ps ymm0,ymm15,ymmword ptr [rax+20h] 
00007FFE4B4A25F3  vfnmadd231ps ymm0,ymm11,ymmword ptr [rax+40h] 
00007FFE4B4A25F9  vfnmadd231ps ymm0,ymm10,ymmword ptr [rax+60h] 
00007FFE4B4A25FF  vfnmadd231ps ymm0,ymm5,ymmword ptr [rax+80h] 
00007FFE4B4A2608  vfnmadd231ps ymm0,ymm6,ymmword ptr [rax+0A0h] 
00007FFE4B4A2611  lea         rax,[rax+3E0h] 
00007FFE4B4A2618  vmovups     ymm6,ymm0 
00007FFE4B4A261C  vfnmadd231ps ymm6,ymm7,ymmword ptr [rax-320h] 
00007FFE4B4A2625  vfnmadd231ps ymm6,ymm2,ymmword ptr [rax-300h] 
00007FFE4B4A262E  vmovups     ymm0,ymmword ptr [forceM1] 
00007FFE4B4A2633  vfnmadd231ps ymm6,ymm0,ymmword ptr [rax-2E0h] 
00007FFE4B4A263C  vmovups     ymm0,ymmword ptr [forceM1] 
00007FFE4B4A2641  vmovups     ymm2,ymmword ptr [forceM1] 
00007FFE4B4A2646  vfnmadd231ps ymm6,ymm2,ymmword ptr [rax-2C0h] 
00007FFE4B4A264F  vfnmadd231ps ymm6,ymm0,ymmword ptr [rax-2A0h] 
00007FFE4B4A2658  vfnmadd231ps ymm6,ymm4,ymmword ptr [rax-280h] 
00007FFE4B4A2661  vfnmadd231ps ymm6,ymm8,ymmword ptr [rax-240h] 


as you can see the compiler try to insert some move and integer instruction to increase the instruction per clock count, by since there aren't too many, the code has extreme register dependencies and practically runs in one float unit when each cores has two.
no out of order execution can improve that.
basically the target registers are xmm0 and xmm6 but the are used sequentially since a compiler is prohibit to reorganize float expressions. I will try to re arrange that
Julio Jerez
Moderator
Moderator
 
Posts: 12249
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles


Return to General Discussion

Who is online

Users browsing this forum: No registered users and 18 guests