writing more efficient spin locks

by **Julio Jerez** » Fri Apr 02, 2021 11:26 am

I always wrote spins locks this way

Code: Select all: void Lock() { while (m_lock.exchange(1)) { std::this_thread::yield(); } #endif }

apparently this secund method is way more efficient

Code: Select all: void Lock() { dUnsigned32 test = 0; while (!m_lock.compare_exchange_weak(test, 1)) { std::this_thread::yield(); test = 0; } #endif }

the reasoning for this is that in a system where there are not too many thread contentions, (more threads than cores hit that lock) the thread that acquire the look write to memory a 1 and that has to go to system memory.

the threads that go to spin the they read memory from cache and see there the value is equal to expected so they to not write to memory and the memory channel for the memory location doesn't block other threads.

while the first method using m_lock.exchange(1) always writes to system memory therefore blocking other thread over the entire system.

has any one experimented with this?

by **Julio Jerez** » Fri Apr 02, 2021 2:13 pm

here is a case whe this kind of thing seem important
take for example thsi function.

Code: Select all: void dDisjointSets::SetSleepState0(dUnsigned32 id0, dUnsigned32 id1, dUnsigned32 state0, dUnsigned32 state1) { dData data0(m_data[id0]); dData data1(m_data[id1]); dData newData0(data0); dData newData1(data1); newData0.m_solverSleep0 = state0; newData1.m_solverSleep1 = state1; m_data[id0].compare_exchange_weak(data0.m_value, newData0.m_value); m_data[id1].compare_exchange_weak(data1.m_value, newData1.m_value); }

where id0 and id1 are node in a graph, basically this funtion is painting a graph with a color

it is possible to write tshi in few different ways,
one would be to test is the stat was already set, and do it si teh test fail, this was how I was doing until now, by this causes lots of branch misprediction.

another wat is to simply white atomically or not the value to the node, since all write are atomic at least on x86, this will cause lot of tread contention when id0 or id1 is a node shared by many links.

the better way is using compare_and swap.
this method white the new value onel when teh current is equal to teh expected but after the firs write, is only compares then and ne write again, so each node is written only once.

I think I start to see the usefulness of these these compare_and_swap function as oppose to using the unconditional write or exchange the data.

in theory this is one of the reasons of poor performance when using multiple threads. too much access to memory is no good.

by **Bird** » Thu Apr 08, 2021 9:23 pm

Nick at CoffeeBeforeArch has a series of videos that I found helpful for understanding spinlocks.

https://www.youtube.com/watch?v=AN6XHy2znzc&list=PLxNPSjHT5qvsZyes3ATYvhPhwkoY33L2v&ab_channel=CoffeeBeforeArch

by **Bird** » Fri Apr 09, 2021 10:49 am

I use OIIO for imaging and it has a Spin Lock that performs a bit better than Newton's. I used the same benchmark test that Nick uses in his video

Here's the link to the OIIO Spin Lock( spin_mutex )
https://github.com/OpenImageIO/oiio/blob/master/src/include/OpenImageIO/thread.h

by **Julio Jerez** » Fri Apr 09, 2021 12:22 pm

oh yes that's was what I was saying teh spin lock in newton 3.14 and even 4.00 are really bad.

until I start using compare_exchange_weak which does does a conditional write.
I think out version is now more elegan, becauce it use CAS instead of exchange and load but just nick peaking.
I added the locals delay instead for the thread_yields, that should make even better now.

The only problem I have with using mm_pause instead of thread yield, is that mm_pause is platform specific. and defeat teh purpose of portability. you wouel think that thread_yile will use that,
but I will no assume for now, maybe later I will add a check for platfom.

this is who our version looks now

Code: Select all: /// Simple spin lock for synchronizing threads for very short period of time. class dSpinLock { public: dSpinLock() #ifndef D_USE_THREAD_EMULATION :m_lock(0) #endif { } void Lock() { #ifndef D_USE_THREAD_EMULATION //while (m_lock.exchange(1)) //{ // std::this_thread::yield(); //} dInt32 exp = 1; for (dUnsigned32 test = 0; !m_lock.compare_exchange_weak(test, 1); test = 0) { Delay(exp); } #endif } void Unlock() { #ifndef D_USE_THREAD_EMULATION m_lock.exchange(0); #endif } #ifndef D_USE_THREAD_EMULATION private: D_CORE_API void Delay(dInt32& exp); dAtomic<dUnsigned32> m_lock; #endif }; void dSpinLock::Delay(dInt32& exp) { // adding exponential pause delay for (dInt32 i = 0; i < exp; i++) { for (volatile dInt32 j = 0; j < 4; j++) { _mm_pause(); } } exp = dMin(exp * 2, 64); }

by **Bird** » Fri Apr 09, 2021 12:58 pm

Yes, the new version definitely a bit faster, although it's a little hard too judge since the results vary on each run. But for 8 threads the new version time was 11.0ms as opposed to 19.8ms in the version I tried first.

by **Julio Jerez** » Fri Apr 09, 2021 2:05 pm

alright

from 19.8 to 11.0 ms, I would say that's a good optimization considering we are using plain standard and portable c11 constructs. beyond that, we will have to use more esoteric platform specific knowledge just to get at those 2 more ms.

also the other two points is that, before I start refactoring some of the algorithms to prepare them for GPU, there where very little usage of Spinlocks, but that was at the expense of more sequential algorithms that do not scale well with very large array of items.

once you pass 1000 to 2000 items in a loop, an or O(k n * log n) cost start to become a dominates factor.
for example say we use a quick sort, to sort bodies of joint by some criteria.
quick sort can be paralyze by the extra over head yield a very small marginal gains.
an alternative is counting sort in some cases or radix source for more generic.

in general when sorting under 1000 items, quick sort beats radix source, but once you pass that limit, that O (n log (n)) become too steep to beat. so is better to go with the algorithm with the linear time complexity even if it is harder to implements and required usage of Locks.

this is what prompted me to revisit the SpinLock, so it is all good for now.

thank for running that test for me :mrgreen:

writing more efficient spin locks

writing more efficient spin locks

Re: writing more efficient spin locks

Re: writing more efficient spin locks

Re: writing more efficient spin locks

Re: writing more efficient spin locks

Re: writing more efficient spin locks

Re: writing more efficient spin locks

Who is online