Hey Guys,
Recently I have a lot of fun playing with GPU atomic, compute shader and dispatchIndirect. And I have a tricky situation:
================================Backgroud====================================
In my project I was doing volume rendering, and to speed that up, I partitioned my volume into blocks(which contain 8^3 voxels), and have a GPU buffer contain idx of blocks which are not empty. So here I got a large buffer contains non-empty block coordinates I call it occupiedBlocksBuf
My scene in that volume is dynamic, so during each pass, I have a compute shader update the whole volume (It was fast, it only update voxels actually changed), so there will be blocks which are previously empty now become non-empty, and also there will be blocks which are previously non-empty now become empty.
For new empty blocks which are previously non-empty, I also maintain a buffer called freedSlotsBuf, so when a block get freed, the compute shader will first find its idx in occupiedBlocksBuf, and write FREED_FLAG into that location, and then append that idx into freedSlotsBuf
For new non-empty blocks which are previously empty, my compute shader will first get available slots from freedSlotsBuf and write the coordinate of that newly non-empty block's coordinate into that slot in occupiedBlocksBuf, so basically filling freedslots in occupiedBlocksBuf first, and then if there are no more freed slots, I append the block's coordinate to the end of occupiedBlocksBuf.
So that's the basic idea. And as you may notice, the 'size' of occupiedBlocksBuf will never decrease, and as my program running, in some cases that buffer will become fragmented (lots of slots get freed), which are bad....
===============================Problem=======================================
I then write a defragmentation shader (I have freedSlotsBuf told me how many freedslot I've got and where are they, so I got everything I need), and use dispatchIndirect to defragment occupiedBlocksBuf. Indirect param are written by compute shader, and based on the size of freedSlotsBuf, so when the size of freedSlotsBuf is smaller than the threshold, Indirect param will be 0,1,1 which result in empty thread.
However, by doing defragmentation the way I described, I have to call the following code every frame on CPU side even though I know 99% of the time, it will map to empty GPU working thread.
void TSDFVolume::OnDefragment(ComputeContext& cptCtx) { GPU_PROFILE(cptCtx, L"Defragment"); cptCtx.SetPipelineState(_cptBlockQDefragment); cptCtx.SetRootSignature(_rootsig); cptCtx.TransitionResource(_occupiedBlocksBuf, UAV); cptCtx.TransitionResource(_freedFuseBlocksBuf, csSRV); cptCtx.TransitionResource(_jobParamBuf, csSRV); cptCtx.TransitionResource(_indirectParams, IARG); Bind(cptCtx, 2, 0, 1, &_occupiedBlocksBuf.GetUAV()); Bind(cptCtx, 2, 1, 1, &_freedFuseBlocksBuf.GetCounterUAV(cptCtx)); Bind(cptCtx, 3, 0, 1, &_freedFuseBlocksBuf.GetSRV()); Bind(cptCtx, 3, 1, 1, &_jobParamBuf.GetSRV()); cptCtx.DispatchIndirect(_indirectParams, 48); }
There will be UAV transitions, PSO changes, so seems definitly non-zero CPU/GPU cost. Which looks sub-optimal....
to avoid that, we can let CPU decide whether to call OnDefragment or not. But that require we read back freeSlotBuf size from GPU in at some frequency, which may have even worse perf impact....
So any suggestions? or are there existing better ways to do these GPU buf defragmentation?
Thanks in advance.
P.S. When UAV barrier actually have non-zero GPU cost? I feel like if I don't have any read/write between two UAV barriers on the same resource, then second barrier should take no GPU time, right? (please correct me if I got that wrong)