well, in my case I see if I can get any OpenCL going and from there, I see if I can add othe solver.
after I get one solver them there is a lot a learning that we can use to improve and make othe solution.
nothing stop us from making a special CUDA or Vulcan version or both if we have too.
any way in the refactoring of the solver I wna making the decision that the solve is going to take a hit in the in the single threaded version top favor the multithreaded.
this is because as I mentioned before, some loops in the solve fall on the categoric of reduction, basically these are loop that in my time we use to called system with memory,
for example a loop of the type
- Code: Select all
for (i = 1; i < count ; i ++)
a[i] = a[i-1] + a[i]
these kind of loop can be paralyzed, but in my experience the overhead of doing so on a cpu result on a next lost if count is not large enough, in newton count is of teh order of 2 to 4 thousand.
so as a result ther are many algorithm in newton like this that are implemented single threaded.
once count become 16 thousand of more the then the paraller version start to make sence, and we see some marginal gains, however even if we start see gains, the final performance is so slow that it is no worse making a paraller version.
this changes for GPU, in GPU the loop about is a killer, so we nee a paraller version and ther we start see the proliferation of multiple algorithm that do the same thing.
what I am doing nwo is that I will go wit the paraller version even if the singel threaded version is going to take a hit. here is a example form teh engine
tha for example this operation that take a joint and deterimne with bodies are at rest
- Code: Select all
ndConstraint* const joint = jointArray[i];
ndBodyKinematic* const body0 = joint->GetBody0();
ndBodyKinematic* const body1 = joint->GetBody1();
body0->m_bodyIsConstrained = 1;
body0->m_resting = body0->m_resting & resting;
body1->m_bodyIsConstrained = 1;
body1->m_resting = body1->m_resting & resting;
this is in a loop when where few joint can chare a body, so as a result it can be done multithreaded, because we take the risk of a race condition.
the thread safe version with be this
- Code: Select all
ndConstraint* const joint = jointArray[i];
ndBodyKinematic* const body0 = joint->GetBody0();
ndBodyKinematic* const body1 = joint->GetBody1();
ndSetData& data0 = body0->m_setData;
ndSetData& data1 = body1->m_setData;
const bool resting = body0->m_equilibrium & body1->m_equilibrium;
data0.SetConstrainedAndRestingState(true, resting);
data1.SetConstrainedAndRestingState(true, resting);
where SetConstrainedAndRestingState is this function
- Code: Select all
inline void ndSetData::SetConstrainedAndRestingState(bool constrained, bool resting)
{
ndSetData oldData(m_transaction);
ndSetData newData(m_transaction);
oldData.m_resting = 1;
oldData.m_bodyIsConstrained = 0;
newData.m_resting = resting;
newData.m_bodyIsConstrained = constrained;
m_transaction.compare_exchange_weak(oldData.m_value, newData.m_value);
}
as you can see it is quite more expensive in single threaded, lot more instruction and use a CAS operation.
the CAS is not really as bad as if look since the write to memory is conditional so it only memory buffer when the are real changes. so that part is no too bad. but is does still write at least once for each item.
so these are the thing I have to first do to get the engine conditioned for GPU what I am not doing is to make it use indices, to me that is just ridiculous to the point that if I can't use pointers, them so be it, ther will not be GPU version.