Hey Guys,
I need some advise and suggestion of doing a tricky GPU-CPU task very efficient, here comes the background first:
=====================Background (optional)===========================
In my recent project I am trying to align two point clouds: Think about using a depth camera taking pictures(each pixel is depth, think about as depthbuffer with real depth not 1/z somekind) of a target from two slight different views (so from using two different mView, camera pose1, and pose2). Then you will get two point clouds (reproject the 'depthbuffer'). Now our job is to find the matrix M to align those two point cloud (the matrix to transform pose1 to pose2). and there are algorithms to do the work, in my case I use FastICP (fast iterative closest point). As the name suggest, it's a iterative method so the routine looks like the following:
=============================Detail================================
Texture2D<float4> depth_and_normalmap1; // 512x424 pixels
Texture2D<float4> depth_and_normalmap2; // 512x424 pixels
StructuredBuffer<float4> workingBuf[7]; // 512x424 element(float4)
float reprojection_error = FLT_MAX;
int iterations = 0;
matrix m = IdentityMatrix; // 4x4 matrix
float4 result[7] = {};
do{
m = CPU_ICPSolver( result ); // Nothing to do with GPU inside
GPU_PrepareWorkingBuffer(
depth_and_normalmap1, // input as SRV
depth_and_normalmap2, // input as SRV
matrix, // input as CBV
workingBuf); // output as UAV (all 7 buffer)
for (int i = 0; i < 7; ++i) {
GPU_Reduction::Process(workingBuf[i]); // reduction to 1 float4 value inside GPU, but not copied to ReadBack buffer
}
GPU_Reduction::Readback( result ); // Read the reduction result, copy from default heap to readback heap, need to wait GPU inside
reprojection_error = GetReprojectionError( result );
}while(iterations < 20 && reprojection_error > threshold)
Above is how the workflow looks like. Right now I have tested and profile 1 iteration case on my GTX680
this part alone:
for (int i = 0; i < 7; ++i) {
GPU_Reduction::Process(workingBuf[i]); // reduction to 1 float4 value inside GPU, but not copied to ReadBack buffer
}
GPU_Reduction::Readback( result ); // Read the reduction result, copy from default heap to readback heap, need to wait GPU inside
took 0.65ms (is that seems reasonable? or it's incredibly slow, please let me know, thanks), so if I add the GPU_PrepareWorkingBuffer and do 20 iterations I probably will end up with 16ms.... which seems too much...
The reduction shader I write is very standard one which guided by this post so not a naive one (but there are some tricky things, I will cover latter....)
============================Questions==============================
So for standard GPU reduction, as this
post described, the result dimension is equal to threadgroup number (for example Dispatch(threadgroupX, threadgroupY, 1) means you reduction result will be of size (threadgroupX, threadgroupY)).
In my case I need to reduce to only one value per one buffer, so ideally I should only dispatch one threadgroup, but one threadgroup could only have maximum 1024 threads, so even each thread do 16 fetches, one threadgroup could only handle 1024*16 size. which is not enough in my case 512x424.
There are other way to do reduction into 1 element even with multiple threadgroup, by using atomicAdd, however that only work with 32bit integer so I can't use that method (mine is sum over float4), (or there are some efficient workaround?? thanks)
To sum up 512x424 to 1 pixel I need to to at least 2 pass, but with that configuration, the second pass I will end up with summing up 1024*16 elements but with only 14 actually needed. So the ideal configuration will be thread_per_group * fetches_per_thread = sqrt(512*424) ~ 466, which means with per thread fetch set to 8, I should have 64 threads per threadgroup...isn't it to small compared to 1024 maximum? and should I not worry about 'summing up 1024*16 elements but with only 14 actually needed' at all?
Other question I have is about should I use compute shader at all? I can change my input format to Texture2D instead of Structured buffer, and use fullscreen quad pixel shader do log2(n) pass with linear_sampler times 4(linear_sampler will average 4 neighboring pixel right?) as sum over 4 neighboring pixel? (I can bound 8 RTs for one pass right?) I have such pixel shader reduction implemented way back using Dx9, but don't have GPU_Profiler implemented back then, so don't know the exact cost (and I hope not spending time figuring that out, so if someone know such cs vs. ps reduction perf result, please let me know thanks
Question 3, with 512x424 elements reduction it seems CPU could handle it pretty well in terms of performance (even though every element is float4 and I have 7 of such buf), anyone have compared GPU/CPU reduction? I haven't tried CPU reduction on this due to the fear that copying that 7 StructuredBuffer<float4> from GPU->CPU may cost much more (am I wrong?)
Question 4, as you may notice every iteration I have CPU GPU data dependency (cpu will wait for the reduction result to compute the matrix, and then gpu will wait for the matrix to perform the work....), so any suggestion about that? The reason is that I need to solve a small linear system Ax = b (A is 6x6), which seems is GPU solvable... So anyone know how to solve small linear system on GPU?
Any suggestion, comments, advises are super welcome, and appreciated.
Thanks in advance~