Hey Guys,
From what I know, more warp/wavefront scheduled in one CU (higher occupancy), the better performance when you have lots of memory read in your shader: when one warp stalls on waiting memory CU will switch to other warp to keep itself busy, thus more warp in CU means less chance you CU will idle on waiting for mem read.
But how about the case you only have memory write in your shader? Theoretically memory write shouldn't stall your warp since nothing in the later instruction is depend on write inst. But since mem write only have limited bandwidth, it probably will affect warp execution somehow... but how? And if warp won't be stall by mem write, does that means you don't have to struggle a lot to fit more warp into a CU's budget? (since no stall means you don't need to switch to other warp to keep CU busy)
Thanks