Quantcast
Channel: GameDev.net
Viewing all articles
Browse latest Browse all 17560

Why InterlockedAdd is much faster than InterlockedCompareStore under high contention

$
0
0

Hey Guys

 

I have a compute shader with threadgroup size 8*8*8 try to update a 32bit uint (UAV, and is set to 0 before cs). 

 // Update brick structure u3BlockIdx is same for every thread in one threadgroup
    if (true || fSDF < vParam.fTruncDist) { // for testing purpose, I made this always true so every thread will do the update
        //tex_uavFlagVol[u3BlockIdx] = 1;                              // case 1, fastest. no atomicity, only useful when we just want to know is there any thread update it or not
        //InterlockedAdd(tex_uavFlagVol[u3BlockIdx], 1);               // case 2, take twice time as case 1, useful when we need to know how many thread actually update it.
        //InterlockedCompareStore(tex_uavFlagVol[u3BlockIdx], 0, 1);   // case 3, take almost twice time as case 2, not much useful, just test for fun
    }

It looks as the result is straight forward. But my expectation is that under such contention (512 threads try to update the same data) case 3 maybe the fastest since it's doing 512 serialized read and compare but only 1 write. while case 1 didn't do atomic write, it actually doing 512 write, and with bank conflict, it should not be much faster than case 3. For case 2, this really confuses me: it's doing 512 serialized write, why its twice fast as case 3 where we only have 1 write? 

 

My understanding is that serialized read should be no slow than serialized write, extra 512 compare time should be negligible, so case 3 really shouldn't be that slow compare to case 2.

 

But apparently, I must get something wrong. Thanks for anyone who could enlightening me on this

 

 

 


Viewing all articles
Browse latest Browse all 17560

Trending Articles