Trying to find bottleneck in NewtonUpdate

A place to discuss everything related to Newton Dynamics.

Moderators: Sascha Willems, walaber

Trying to find bottleneck in NewtonUpdate

Postby andylhansen » Mon Jun 06, 2011 10:38 pm

I've implemented Newton Physics into my game engine, and I'm trying to figure out how to time the simulation. I'm running into a strange bottleneck, and I'm not sure where it's coming from.

Right now I have a stack of boxes that I can knock over by tossing more boxes at it.

On my laptop, I have a 1.6GHz dual-core processor, and I get 180 FPS game time, with the simulation updating at 60FPS. I also have a netbook that is also a 1.6GHz dual core processor, with the same amount of RAM as my notebook. On my netbook, the simulation will run at around 180FPS as well so long as the stack of boxes is idle. However, when I toss a box at the stack, the simulation on my netbook will slow to around 10 FPS during the collision. Whereas my laptop will only slow down a little.

Would this be caused simply by the smaller CPU cache in my netbook? I don't know what else might be slowing it down.

This is the way I'm doing the update:
Code: Select all
void CPhysicsInterface::setCalculationsPerSecond(float calcPerSec){
    updateFPS=calcPerSec;
    sleepTime=1000.0f/updateFPS;
    updateTime=sleepTime/1000.0f;
}

void CPhysicsInterface::Update(float deltaTime){ //delta time is in seconds
    accumlativeTimeSlice+=deltaTime*1000.0f;
    while(accumlativeTimeSlice>sleepTime){
        NewtonUpdate(pWorld,updateTime);
        accumlativeTimeSlice-=sleepTime;
    }
}


My application calls setCalculationsPerSecond with value of 60 when the application first runs. It calls Update every frame with the elapsed time since the last frame. Is there something I'm doing wrong here that might be causing the bottleneck on my netbook? Is there a better way I can do it?

Also, when I call setCalculationsPerSecond() with a smaller value, the simulation will run in slow motion. With the way the code is written, shouldn't NewtonUpdate() be passed a larger value, which will compensate?
andylhansen
 
Posts: 21
Joined: Sat Jun 04, 2011 6:32 am

Re: Trying to find bottleneck in NewtonUpdate

Postby Julio Jerez » Mon Jun 06, 2011 10:57 pm

you say bothe cpu are 1.6 ghz.
are they the same type of cpu?

also how many objects are we talking about?
Julio Jerez
Moderator
Moderator
 
Posts: 12452
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: Trying to find bottleneck in NewtonUpdate

Postby andylhansen » Mon Jun 06, 2011 11:17 pm

My laptop is a Intel core 2 CPU at 1.6GHz dual core
My netbook is AMD E-350 processor 1.6GHz dual core

It's about 30 boxes stacked in a triangle shape.

Thanks for the quick reply!
andylhansen
 
Posts: 21
Joined: Sat Jun 04, 2011 6:32 am

Re: Trying to find bottleneck in NewtonUpdate

Postby Julio Jerez » Tue Jun 07, 2011 8:22 am

My understanding is that AMD cpus are not as good as the latest intel dual cores, but that difference is too much.

In occasion I had seen nonsence like that, when one build is like 10 time faster than the PC, when buiding a linux or a Mac build 64, and I still cannot explain why is that.
anyway, 30 boxes is not enought to slow down the engine that much, the first thing make sure you are not using solver mode 0, which is the defualt.
my imprseion is that one cpu is solver mode zero, and the other defualt to solver mode 1 (iterative solver)
the secund things see if the AMD support simd? make sure the test recognize it.

also to make everything equal, at leat for a test, try not using fix time slicing, instead update once per frame at 60 ot 100 fps,
that way both system will run the same amound of work load. just for the test.
Julio Jerez
Moderator
Moderator
 
Posts: 12452
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: Trying to find bottleneck in NewtonUpdate

Postby andylhansen » Tue Jun 07, 2011 11:51 am

I am using solver mode 1. I commented out the time slicing loop and just let NewtonUpdate only once every frame. The slowdown still occurred. I looked up my AMD cpu. It has MMX(+),SSE(1,2,3,3S,4A),x86_64, and AMD-V instruction sets. So I'm pretty sure it has SIMD support. Here is a page showing all the specs for it. http://www.guru3d.com/article/amd-brazo ... u-review/4
andylhansen
 
Posts: 21
Joined: Sat Jun 04, 2011 6:32 am

Re: Trying to find bottleneck in NewtonUpdate

Postby Julio Jerez » Tue Jun 07, 2011 3:39 pm

I do not know what could it be, I have an older athlon laptop, that does no even has simd, and even there is not as slow as 10 fps with 30 boxes.
the engine does not make any special consideration about CPU or cache sizes. it is just plain C++ code.
Julio Jerez
Moderator
Moderator
 
Posts: 12452
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: Trying to find bottleneck in NewtonUpdate

Postby andylhansen » Tue Jun 07, 2011 4:00 pm

I did some poking around in the code. It looks like the world update code either chooses to update with SIMD or not. Maybe my CPU supports SIMD, but it is really slow or something. What would be the best way to built the source with SIMD disabled? Do I just find where the preprocessor flag is defined and comment it out?
andylhansen
 
Posts: 21
Joined: Sat Jun 04, 2011 6:32 am

Re: Trying to find bottleneck in NewtonUpdate

Postby andylhansen » Tue Jun 07, 2011 4:19 pm

I discovered that if I open the project on my netbook, the preprocessor highlighting automatically shows SIMD support as being disabled. So I think if I just build Newton on my netbook then that may fix the problem. Do I build the project found in the "Packages" folder? Or the one in the "corelibrary_200" folder? I tried to build the one in core packages, and I got
\newton-dynamics-2.33\corelibrary_200\source\core\dgTypes.h(631): error C2084: function 'dgInt32 dgFastInt(dgFloat64)' already has a body
newton-dynamics-2.33\corelibrary_200\source\core\dgTypes.h(621) : see previous definition of 'dgFastInt'

It looks like the typedefs that are getting passed into the functions are both "double" so it won't compile.

Edit:
It looks like it should already be disabled. I'm calling NewtonSetPlatformArchitecture (pWorld, 0); Should that not disable SIMD?

Edit2:
I tried compiling my project with NewtonSetPlatformArchitecture(pWorld,2) and it actually helped with the slowdown. Instead of reducing the application to 10FPS, it only lowered it to 60-80 FPS with 30 boxes. Still seems pretty slow, but it's not unusable. When the bodies are at rest the application runs at about 180FPS.
andylhansen
 
Posts: 21
Joined: Sat Jun 04, 2011 6:32 am

Re: Trying to find bottleneck in NewtonUpdate

Postby Julio Jerez » Tue Jun 07, 2011 5:24 pm

I think it is NewtonSetPlatformArchitecture(pWorld, 3)
I think it it mak eno diffrence but that is what I always use.

The slow down is still a lot. there is a small litel profiofler I use name VerySleepy, tha is hady to profile teh code.
It is free and it is non intrusive. Maybe that can help to find out whe the problem is. is.

can you sync to SVN and get teh latest?
Julio Jerez
Moderator
Moderator
 
Posts: 12452
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: Trying to find bottleneck in NewtonUpdate

Postby andylhansen » Tue Jun 07, 2011 9:39 pm

I ran the profiler on my application. This function seems to be where most the overhead is. I copied the output from the profiler here for that function. I don't know if this gives any clues. I will try downloaded the latest version from SVN.

Code: Select all
void dgJacobianMemory::CalculateForcesGameModeSimd (dgInt32 iterations, dgFloat32 maxAccNorm) const
   {
   #ifdef DG_BUILD_SIMD_CODE
      dgFloat32* const force = m_force;
      const dgJacobianPair* const Jt = m_Jt;
      const dgJacobianPair* const JMinv = m_JMinv;
      const dgFloat32* const diagDamp = m_diagDamp;
      const dgFloat32* const invDJMinvJt = m_invDJMinvJt;
      const dgBodyInfo* bodyArray = m_bodyArray;
      dgFloat32* const penetration = m_penetration;
      const dgFloat32* const externAccel = m_deltaAccel;
      const dgFloat32* const restitution = m_restitution;
      dgFloat32* const coordenateAccel = m_coordenateAccel;
      dgJacobian* const internalVeloc = m_internalVeloc;
      dgJacobian* const internalForces = m_internalForces;
      dgFloat32** const jointForceFeeback = m_jointFeebackForce;
      const dgInt32* const normalForceIndex = m_normalForceIndex;
      const dgInt32* const accelIsMortor = m_accelIsMotor;
      const dgJointInfo* const constraintArray = m_constraintArray;
      const dgFloat32* const penetrationStiffness = m_penetrationStiffness;
      const dgFloat32* const lowerFrictionCoef = m_lowerBoundFrictionCoefficent;
      const dgFloat32* const upperFrictionCoef = m_upperBoundFrictionCoefficent;
   
      dgFloat32 invStep = (dgFloat32 (1.0f) / dgFloat32 (LINEAR_SOLVER_SUB_STEPS));
      dgFloat32 timeStep =  m_timeStep * invStep;
      dgFloat32 invTimeStep = m_invTimeStep * dgFloat32 (LINEAR_SOLVER_SUB_STEPS);
   
      dgFloatSign tmpIndex;
      tmpIndex.m_integer.m_iVal = 0x7fffffff;
      simd_type signMask = simd_set1(tmpIndex.m_fVal);
   
      simd_type zero = simd_set1 (dgFloat32 (0.0f));
      for (dgInt32 i = 1; i < m_bodyCount; i ++) {
         dgBody* const body = m_bodyArray[i].m_body;
         (simd_type&) internalVeloc[i].m_linear = zero;
         (simd_type&) internalVeloc[i].m_angular = zero;
         (simd_type&) internalForces[i].m_linear = zero;
         (simd_type&) internalForces[i].m_angular = zero;
         (simd_type&) body->m_netForce = (simd_type&) body->m_veloc;
0.00s         (simd_type&) body->m_netTorque = (simd_type&) body->m_omega;
      }
      (simd_type&) internalVeloc[0].m_linear = zero;
      (simd_type&) internalVeloc[0].m_angular = zero;
      (simd_type&) internalForces[0].m_linear = zero;
      (simd_type&) internalForces[0].m_angular = zero;
   
   
      for (dgInt32 i = 0; i < m_jointCount; i ++) {
         dgInt32 first = constraintArray[i].m_autoPairstart;
         dgInt32 count = constraintArray[i].m_autoPairActiveCount;
         dgInt32 m0 = constraintArray[i].m_m0;
         dgInt32 m1 = constraintArray[i].m_m1;
         //dgJacobian y0 (internalForces[k0]);
         //dgJacobian y1 (internalForces[k1]);
         simd_type y0_linear = zero;
         simd_type y0_angular = zero;
         simd_type y1_linear = zero;
         simd_type y1_angular = zero;
0.00s         for (dgInt32 j = 0; j < count; j ++) {
            dgInt32 index = j + first;
            //val = force[index];
0.00s            simd_type tmp0 = simd_set1(force[index]);
            //y0.m_linear += Jt[index].m_jacobian_IM0.m_linear.Scale (val);
            //y0.m_angular += Jt[index].m_jacobian_IM0.m_angular.Scale (val);
            //y1.m_linear += Jt[index].m_jacobian_IM1.m_linear.Scale (val);
            //y1.m_angular += Jt[index].m_jacobian_IM1.m_angular.Scale (val);
0.01s            y0_linear  = simd_mul_add_v (y0_linear, (simd_type&) Jt[index].m_jacobian_IM0.m_linear, tmp0);
0.00s            y0_angular = simd_mul_add_v (y0_angular,(simd_type&) Jt[index].m_jacobian_IM0.m_angular, tmp0);
0.01s            y1_linear  = simd_mul_add_v (y1_linear, (simd_type&) Jt[index].m_jacobian_IM1.m_linear, tmp0);
0.01s            y1_angular = simd_mul_add_v (y1_angular,(simd_type&) Jt[index].m_jacobian_IM1.m_angular, tmp0);
         }
         //internalForces[k0] = y0;
         //internalForces[k1] = y1;
         (simd_type&) internalForces[m0].m_linear = simd_add_v ((simd_type&) internalForces[m0].m_linear, y0_linear);
         (simd_type&) internalForces[m0].m_angular = simd_add_v ((simd_type&) internalForces[m0].m_angular, y0_angular);
0.00s         (simd_type&) internalForces[m1].m_linear = simd_add_v ((simd_type&) internalForces[m1].m_linear, y1_linear);
         (simd_type&) internalForces[m1].m_angular = simd_add_v ((simd_type&)internalForces[m1].m_angular, y1_angular);
      }
   
      simd_type timeStepSimd = simd_set1 (timeStep);
      dgFloat32 firstPassCoef = dgFloat32 (0.0f);
      dgInt32 maxPasses = iterations + DG_BASE_ITERATION_COUNT;
      for (dgInt32 step = 0; step < LINEAR_SOLVER_SUB_STEPS; step ++) {
         for (dgInt32 curJoint = 0; curJoint < m_jointCount; curJoint ++) {
            dgJointAccelerationDecriptor joindDesc;
   
            dgInt32 index = constraintArray[curJoint].m_autoPairstart;
            joindDesc.m_rowsCount = constraintArray[curJoint].m_autoPaircount;
   
            joindDesc.m_timeStep = timeStep;
            joindDesc.m_invTimeStep = invTimeStep;
            joindDesc.m_firstPassCoefFlag = firstPassCoef;
   
            joindDesc.m_Jt = &Jt[index];
   
            joindDesc.m_penetration = &penetration[index];
            joindDesc.m_restitution = &restitution[index];
            joindDesc.m_accelIsMotor = &accelIsMortor[index];
            joindDesc.m_externAccelaration = &externAccel[index];
            joindDesc.m_coordenateAccel = &coordenateAccel[index];
            joindDesc.m_normalForceIndex = &normalForceIndex[index];
0.00s            joindDesc.m_penetrationStiffness = &penetrationStiffness[index];
0.00s            constraintArray[curJoint].m_joint->JointAccelerationsSimd (joindDesc);
         }
         firstPassCoef = dgFloat32 (1.0f);
   
         dgFloat32 accNorm;
         accNorm = maxAccNorm * dgFloat32 (2.0f);
         for (dgInt32 passes = 0; (passes < maxPasses) && (accNorm > maxAccNorm); passes ++) {
            simd_type accNormSimd = zero;
0.02s            for (dgInt32 curJoint = 0; curJoint < m_jointCount; curJoint ++) {
0.01s               dgInt32 index = constraintArray[curJoint].m_autoPairstart;
0.00s               dgInt32 rowsCount = constraintArray[curJoint].m_autoPaircount;
               dgInt32 m0 = constraintArray[curJoint].m_m0;
0.00s               dgInt32 m1 = constraintArray[curJoint].m_m1;
   
0.02s               simd_type linearM0  = (simd_type&)internalForces[m0].m_linear;
               simd_type angularM0 = (simd_type&)internalForces[m0].m_angular;
0.01s               simd_type linearM1  = (simd_type&)internalForces[m1].m_linear;
               simd_type angularM1 = (simd_type&)internalForces[m1].m_angular;
0.02s               for (dgInt32 k = 0; k < rowsCount; k ++) {
      //            dgVector acc (m_JMinv[index].m_jacobian_IM0.m_linear.CompProduct(linearM0));
      //            acc += m_JMinv[index].m_jacobian_IM0.m_angular.CompProduct (angularM0);
      //            acc += m_JMinv[index].m_jacobian_IM1.m_linear.CompProduct (linearM1);
      //            acc += m_JMinv[index].m_jacobian_IM1.m_angular.CompProduct (angularM1);
0.09s                  simd_type a = simd_mul_v (       (simd_type&)JMinv[index].m_jacobian_IM0.m_linear, linearM0);
0.15s                  a = simd_mul_add_v (a, (simd_type&)JMinv[index].m_jacobian_IM0.m_angular, angularM0);
0.12s                  a = simd_mul_add_v (a, (simd_type&)JMinv[index].m_jacobian_IM1.m_linear, linearM1);
0.16s                  a = simd_mul_add_v (a, (simd_type&)JMinv[index].m_jacobian_IM1.m_angular, angularM1);
   
                  //a = coordenateAccel[index] - acc.m_x - acc.m_y - acc.m_z - force[index] * diagDamp[index];
0.39s                  a = simd_add_v (a, simd_move_hl_v(a, a));
0.29s                  a = simd_add_s (a, simd_permut_v (a, a, PURMUT_MASK(3, 3, 3, 1)));
0.30s                  a = simd_sub_s(simd_load_s(coordenateAccel[index]), simd_mul_add_s(a, simd_load_s(force[index]), simd_load_s(diagDamp[index])));
   
   
                  //f = force[index] + invDJMinvJt[index] * a;
0.26s                  simd_type f = simd_mul_add_s (simd_load_s(force[index]), simd_load_s(invDJMinvJt[index]), a);
   
0.01s                  dgInt32 frictionIndex = m_normalForceIndex[index];
0.05s                  _ASSERTE (((frictionIndex < 0) && (force[frictionIndex] == dgFloat32 (1.0f))) || ((frictionIndex >= 0) && (force[frictionIndex] >= dgFloat32 (0.0f))));
   
                  //frictionNormal = force[frictionIndex];
                  //lowerFrictionForce = frictionNormal * lowerFrictionCoef[index];
                  //upperFrictionForce = frictionNormal * upperFrictionCoef[index];
0.08s                  simd_type frictionNormal = simd_load_s(force[frictionIndex]);
0.11s                  simd_type lowerFrictionForce = simd_mul_s (frictionNormal, simd_load_s(lowerFrictionCoef[index]));
0.13s                  simd_type upperFrictionForce = simd_mul_s (frictionNormal, simd_load_s(upperFrictionCoef[index]));
   
   
                  //if (f > upperFrictionForce) {
                  //   a = dgFloat32 (0.0f);
                  //   f = upperFrictionForce;
                  //} else if (f < lowerFrictionForce) {
                  //   a = dgFloat32 (0.0f);
                  //   f = lowerFrictionForce;
                  //}
0.14s                  f = simd_min_s (simd_max_s (f, lowerFrictionForce), upperFrictionForce);
0.42s                  a = simd_andnot_v (a, simd_or_v (simd_cmplt_s (f, lowerFrictionForce), simd_cmpgt_s (f, upperFrictionForce)));
0.22s                  accNormSimd = simd_max_s (accNormSimd, simd_and_v (a, signMask));
   
                  //prevValue = f - force[index]);
0.11s                  a = simd_sub_s (f, simd_load_s (force[index]));
0.07s                  a = simd_permut_v (a, a, PURMUT_MASK(0, 0, 0, 0));
   
                  //force[index] = f;
0.02s                  simd_store_s (f, &force[index]);
   
0.10s                  linearM0 = simd_mul_add_v (linearM0, (simd_type&) Jt[index].m_jacobian_IM0.m_linear, a);
0.06s                  angularM0 = simd_mul_add_v (angularM0,(simd_type&) Jt[index].m_jacobian_IM0.m_angular, a);
0.24s                  linearM1 = simd_mul_add_v (linearM1, (simd_type&) Jt[index].m_jacobian_IM1.m_linear, a);
0.21s                  angularM1 = simd_mul_add_v (angularM1,(simd_type&) Jt[index].m_jacobian_IM1.m_angular, a);
0.01s                  index ++;
               }
   
               //internalForces[prevM0].m_linear += Jt[prevIndex].m_jacobian_IM0.m_linear.Scale (prevValue);
               //internalForces[prevM0].m_angular += Jt[prevIndex].m_jacobian_IM0.m_angular.Scale (prevValue);
               //internalForces[prevM1].m_linear += Jt[prevIndex].m_jacobian_IM1.m_linear.Scale (prevValue);
               //internalForces[prevM1].m_angular += Jt[prevIndex].m_jacobian_IM1.m_angular.Scale (prevValue);
0.00s               (simd_type&) internalForces[m0].m_linear = linearM0;
0.02s               (simd_type&) internalForces[m0].m_angular = angularM0;
0.00s               (simd_type&) internalForces[m1].m_linear = linearM1;
               (simd_type&) internalForces[m1].m_angular = angularM1;
0.01s            }
            simd_store_s (accNormSimd, &accNorm);
         }
   
   
         for (dgInt32 i = 1; i < m_bodyCount; i ++) {
            dgBody* const body = bodyArray[i].m_body;
            //dgVector force (body->m_accel + internalForces[i].m_linear);
            //dgVector torque (body->m_alpha + internalForces[i].m_angular);
   
0.00s            simd_type force = simd_add_v ((simd_type&) body->m_accel, (simd_type&)internalForces[i].m_linear);
0.03s            simd_type torque = simd_add_v ((simd_type&) body->m_alpha, (simd_type&)internalForces[i].m_angular);
   
            //dgVector accel (force.Scale (body->m_invMass.m_w));
0.00s            simd_type accel = simd_mul_v (force, simd_set1 (body->m_invMass.m_w));
   
            //dgVector alpha (body->m_invWorldInertiaMatrix.RotateVector (torque));
            simd_type alpha = simd_mul_add_v (simd_mul_add_v (simd_mul_v ((simd_type&)body->m_invWorldInertiaMatrix[0], simd_permut_v (torque, torque, PURMUT_MASK(0, 0, 0, 0))),
                                                   (simd_type&)body->m_invWorldInertiaMatrix[1], simd_permut_v (torque, torque, PURMUT_MASK(1, 1, 1, 1))),
0.00s                                                   (simd_type&)body->m_invWorldInertiaMatrix[2], simd_permut_v (torque, torque, PURMUT_MASK(2, 2, 2, 2)));
   
            //body->m_veloc += accel.Scale(timeStep);
0.00s            (simd_type&) body->m_veloc = simd_mul_add_v ((simd_type&) body->m_veloc, accel, timeStepSimd);
            //body->m_omega += alpha.Scale(timeStep);
0.01s            (simd_type&) body->m_omega = simd_mul_add_v ((simd_type&) body->m_omega, alpha, timeStepSimd);
   
            //body->m_netForce += body->m_veloc;
0.00s            (simd_type&)internalVeloc[i].m_linear = simd_add_v ((simd_type&)internalVeloc[i].m_linear, (simd_type&) body->m_veloc);
            //body->m_netTorque += body->m_omega;
            (simd_type&)internalVeloc[i].m_angular = simd_add_v ((simd_type&)internalVeloc[i].m_angular, (simd_type&) body->m_omega);
         }
      }
   
      dgInt32 hasJointFeeback = 0;
      for (dgInt32 i = 0; i < m_jointCount; i ++) {
   //      maxForce = dgFloat32 (0.0f);
         dgInt32 first = constraintArray[i].m_autoPairstart;
         dgInt32 count = constraintArray[i].m_autoPaircount;
         for (dgInt32 j = 0; j < count; j ++) {
0.00s            dgInt32 index = j + first;
            dgFloat32 val = force[index];
            _ASSERTE (dgCheckFloat(val));
   //         maxForce = GetMax (dgAbsf (val), maxForce);
            jointForceFeeback[index][0] = val;
         }
   //      if (constraintArray[i].m_joint->GetId() == dgContactConstraintId) {
   //         m_world->AddToBreakQueue ((dgContact*)constraintArray[i].m_joint, maxForce);
   //      }
   
   //      hasJointFeeback |= dgUnsigned32 (constraintArray[i].m_joint->m_updaFeedbackCallback);
0.00s         hasJointFeeback |= (constraintArray[i].m_joint->m_updaFeedbackCallback ? 1 : 0);
   //      if (constraintArray[i].m_joint->m_updaFeedbackCallback) {
   //         constraintArray[i].m_joint->m_updaFeedbackCallback (*constraintArray[i].m_joint, m_timeStep, m_threadIndex);
   //      }
      }
   
   //   simd_type invStepSimd;
   //   signMask = simd_set1 (invStep);   
      simd_type invStepSimd = simd_set1 (invStep);   
      simd_type invTimeStepSimd = simd_set1 (m_invTimeStep);   
      simd_type accelerationTolerance = simd_set1 (maxAccNorm);   
      accelerationTolerance = simd_mul_s (accelerationTolerance, accelerationTolerance);
      for (dgInt32 i = 1; i < m_bodyCount; i ++) {
         dgBody* const body = bodyArray[i].m_body;
   
         //body->m_veloc = internalVeloc[i].m_linear.Scale(invStep);
         //body->m_omega = internalVeloc[i].m_angular.Scale(invStep);
         (simd_type&) body->m_veloc = simd_mul_v ((simd_type&) internalVeloc[i].m_linear, invStepSimd);
         (simd_type&) body->m_omega = simd_mul_v ((simd_type&) internalVeloc[i].m_angular, invStepSimd);
   
   
         //dgVector accel = (body->m_veloc - body->m_netForce).Scale (m_invTimeStep);
         //dgVector alpha = (body->m_omega - body->m_netTorque).Scale (m_invTimeStep);
         simd_type accel = simd_mul_v (simd_sub_v ((simd_type&) body->m_veloc, (simd_type&) body->m_netForce), invTimeStepSimd);
0.00s         simd_type alpha = simd_mul_v (simd_sub_v ((simd_type&) body->m_omega, (simd_type&) body->m_netTorque), invTimeStepSimd);
   
         //if ((accel % accel) < maxAccNorm2) {
         //   accel = zero;
         //}
         //body->m_accel = accel;
         //body->m_netForce = accel.Scale (body->m_mass[3]);
         simd_type tmp = simd_mul_v (accel, accel);
         tmp = simd_add_v (tmp, simd_move_hl_v (tmp, tmp));
         tmp = simd_add_s (tmp, simd_permut_v (tmp, tmp, PURMUT_MASK(0, 0, 0, 1)));
         tmp = simd_cmplt_s (tmp, accelerationTolerance);
         (simd_type&)body->m_accel = simd_andnot_v (accel, simd_permut_v (tmp, tmp, PURMUT_MASK(0, 0, 0, 0)));
         (simd_type&)body->m_netForce = simd_mul_v ((simd_type&)body->m_accel, simd_set1 (body->m_mass[3]));
   
         //if ((alpha % alpha) < maxAccNorm2) {
         //   alpha = zero;
         //}
         //body->m_alpha = alpha;
         tmp = simd_mul_v (alpha, alpha);
         tmp = simd_add_v (tmp, simd_move_hl_v (tmp, tmp));
         tmp = simd_add_s (tmp, simd_permut_v (tmp, tmp, PURMUT_MASK(0, 0, 0, 1)));
         tmp = simd_cmplt_s (tmp, accelerationTolerance);
0.00s         (simd_type&)body->m_alpha = simd_andnot_v (alpha, simd_permut_v (tmp, tmp, PURMUT_MASK(0, 0, 0, 0)));
   
   
         //alpha = body->m_matrix.UnrotateVector(alpha);
         alpha = simd_mul_v ((simd_type&)body->m_matrix[0], (simd_type&)body->m_alpha);
         alpha = simd_add_v (alpha, simd_move_hl_v (alpha, alpha));
         alpha = simd_add_s (alpha, simd_permut_v (alpha, alpha, PURMUT_MASK(0, 0, 0, 1)));
   
         tmp = simd_mul_v ((simd_type&)body->m_matrix[1], (simd_type&)body->m_alpha);
         tmp = simd_add_v (tmp, simd_move_hl_v (tmp, tmp));
         tmp = simd_add_s (tmp, simd_permut_v (tmp, tmp, PURMUT_MASK(0, 0, 0, 1)));
         alpha = simd_pack_lo_v (alpha, tmp);
   
         tmp = simd_mul_v ((simd_type&)body->m_matrix[2], (simd_type&)body->m_alpha);
         tmp = simd_add_v (tmp, simd_move_hl_v (tmp, tmp));
0.00s         tmp = simd_add_s (tmp, simd_permut_v (tmp, tmp, PURMUT_MASK(0, 0, 0, 1)));
         alpha = simd_permut_v (alpha, tmp, PURMUT_MASK(3, 0, 1, 0));
   
         //body->m_netTorque = body->m_matrix.RotateVector (alpha.CompProduct(body->m_mass));
         alpha = simd_mul_v (alpha, (simd_type&)body->m_mass);
         (simd_type&)body->m_netTorque = simd_mul_add_v (simd_mul_add_v (simd_mul_v ((simd_type&)body->m_matrix[0], simd_permut_v (alpha, alpha, PURMUT_MASK(0, 0, 0, 0))),
                                                                  (simd_type&)body->m_matrix[1], simd_permut_v (alpha, alpha, PURMUT_MASK(1, 1, 1, 1))),
                                                                  (simd_type&)body->m_matrix[2], simd_permut_v (alpha, alpha, PURMUT_MASK(2, 2, 2, 2)));
      }
   
   
      if (hasJointFeeback) {
         for (dgInt32 i = 0; i < m_jointCount; i ++) {
            if (constraintArray[i].m_joint->m_updaFeedbackCallback) {
               constraintArray[i].m_joint->m_updaFeedbackCallback (*constraintArray[i].m_joint, m_timeStep, m_threadIndex);
            }
         }
      }
   #endif
   }

andylhansen
 
Posts: 21
Joined: Sat Jun 04, 2011 6:32 am

Re: Trying to find bottleneck in NewtonUpdate

Postby Julio Jerez » Tue Jun 07, 2011 9:59 pm

CalculateForcesGameModeSimd is the correct function, and it will be where most the time is spent.
That funtion is the core of the newton Iterative solver.
I do not know what to say, I did not expect the fps be 10 fps with 30 bodies. No even using the exact mode.

you are not running in double presition aren't you?
Julio Jerez
Moderator
Moderator
 
Posts: 12452
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: Trying to find bottleneck in NewtonUpdate

Postby Julio Jerez » Tue Jun 07, 2011 10:15 pm

There is one detail that may explain why the difference is so high.
your AMD only has 512k of level 2 cache. while the intell core2 has 2 mg.

I am not sure if that can explain the high difference, the only way to know is to run a Cache Benchmark on both machine to see i the cache size makes that kind of a difference.
30 bodies is a small number of bodies but I beleive the solver still used more than 512k byte of memory.
one thing that you can do, is to reduce the body count and see if you find where perfomance break down.
if it does, then it means the penalty on the AMD CPU for a cache miss is very, very high, but that should be common for other applications as well.

Laptops manufators are notorios for using very inexpensive CPUs because profits margin are very small on laptops, I too have a laptops that is like that.
Julio Jerez
Moderator
Moderator
 
Posts: 12452
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Re: Trying to find bottleneck in NewtonUpdate

Postby andylhansen » Tue Jun 07, 2011 10:20 pm

Yeah, I'm thinking it's probably just the cache. I'm going to download the AMD profiler to benchmark the cache use.

What do you mean by running in double precision?

Edit: I found the option for double precision. I've got it turned off.

Edit2: I kept adding boxes until it started breaking down. At about 55 boxes, it would start going down to 10FPS, and then staying at around 30FPS once the objects settled.
andylhansen
 
Posts: 21
Joined: Sat Jun 04, 2011 6:32 am

Re: Trying to find bottleneck in NewtonUpdate

Postby andylhansen » Wed Jun 08, 2011 1:48 am

I ran the application demo exe and tested out some different demos. I tried the one with all the stacked boxes, and it ran at a consistent 60FPS when the option to run physics in a seperate thread is enabled. It will run at 25FPS when it is run in the same thread. Doesn't seem to have the problem that my application has where it slows to a crawl.

In my application, the second I hit 55 dynamic bodies, my application slows to a crawl. I manually made it so that NewtonUpdate is only called every 20 frames. This seemed to help, until I hit 55 bodies, at which point it slowed to a crawl again. Maybe something in the rest of my code is conflicting with Newton and causing the bottleneck.
andylhansen
 
Posts: 21
Joined: Sat Jun 04, 2011 6:32 am

Re: Trying to find bottleneck in NewtonUpdate

Postby Julio Jerez » Wed Jun 08, 2011 7:25 am

I ran the application demo exe and tested out some different demos.
I tried the one with all the stacked boxes, and it ran at a consistent 60FPS when the option to run physics in a seperate thread is enabled.
It will run at 25FPS when it is run in the same thread. Doesn't seem to have the problem that my application has where it slows to a crawl.

if this the SDK demos. what demo are you running? make sure you use one the is representative of your demo type, I mean teh same numbe of boxes with the in the same configuartion.
maybe you can change one to run a set of stacked objects identical to what you are doing in your game. That way you will be sure that you are testiong the exact same conditions.
in file "\applications\newtonDemos\sdkDemos\NewtonDemos.cpp"
if you set DEFAULT_SCENE 2
then it will load a scen form a file

//#define DEFAULT_SCENE 0 // friction test
//#define DEFAULT_SCENE 1 // closest distance
#define DEFAULT_SCENE 2 // Box stacks
...

then you can open file "applications\newtonDemos\sdkDemos\demos\BasicStacking.cpp"

and try loading one of the scenes like the

char fileName[2048];
//GetWorkingFileName ("boxStacks_1.ngd", fileName);
//GetWorkingFileName ("boxStacks_3.ngd", fileName);
//GetWorkingFileName ("boxStacks.ngd", fileName);
//GetWorkingFileName ("pyramid40x40.ngd", fileName);
GetWorkingFileName ("pyramid20x20.ngd", fileName);

scene->LoadScene (fileName);

that file has about 200 bodies, you can edit the file or better yest you can link to the scene library in you project and save you scene to a file, that way the same scen can be tested.


the thing is that you are using tenth of objects, and that slow down should not even happen with hundreds of objects.
running in separate thread does not make if faster, it juts run a different thread so you see the graphics updating faster you should test running in the same thread.
Julio Jerez
Moderator
Moderator
 
Posts: 12452
Joined: Sun Sep 14, 2003 2:18 pm
Location: Los Angeles

Next

Return to General Discussion

Who is online

Users browsing this forum: No registered users and 675 guests