Hey Guys
I have a compute shader with threadgroup size 8*8*8 try to update a 32bit uint (UAV, and is set to 0 before cs).
// Update brick structure u3BlockIdx is same for every thread in one threadgroup if (true || fSDF < vParam.fTruncDist) { // for testing purpose, I made this always true so every thread will do the update //tex_uavFlagVol[u3BlockIdx] = 1; // case 1, fastest. no atomicity, only useful when we just want to know is there any thread update it or not //InterlockedAdd(tex_uavFlagVol[u3BlockIdx], 1); // case 2, take twice time as case 1, useful when we need to know how many thread actually update it. //InterlockedCompareStore(tex_uavFlagVol[u3BlockIdx], 0, 1); // case 3, take almost twice time as case 2, not much useful, just test for fun }
It looks as the result is straight forward. But my expectation is that under such contention (512 threads try to update the same data) case 3 maybe the fastest since it's doing 512 serialized read and compare but only 1 write. while case 1 didn't do atomic write, it actually doing 512 write, and with bank conflict, it should not be much faster than case 3. For case 2, this really confuses me: it's doing 512 serialized write, why its twice fast as case 3 where we only have 1 write?
My understanding is that serialized read should be no slow than serialized write, extra 512 compare time should be negligible, so case 3 really shouldn't be that slow compare to case 2.
But apparently, I must get something wrong. Thanks for anyone who could enlightening me on this