Suggestion on GPU reduction

Hey Guys,

I need some advise and suggestion of doing a tricky GPU-CPU task very efficient, here comes the background first:

=====================Background (optional)===========================

In my recent project I am trying to align two point clouds: Think about using a depth camera taking pictures(each pixel is depth, think about as depthbuffer with real depth not 1/z somekind) of a target from two slight different views (so from using two different mView, camera pose1, and pose2). Then you will get two point clouds (reproject the 'depthbuffer'). Now our job is to find the matrix M to align those two point cloud (the matrix to transform pose1 to pose2). and there are algorithms to do the work, in my case I use FastICP (fast iterative closest point). As the name suggest, it's a iterative method so the routine looks like the following:

=============================Detail================================

Texture2D<float4> depth_and_normalmap1; // 512x424 pixels
Texture2D<float4> depth_and_normalmap2; // 512x424 pixels
StructuredBuffer<float4> workingBuf[7]; // 512x424 element(float4)

float reprojection_error = FLT_MAX;
int iterations = 0;
matrix m = IdentityMatrix; // 4x4 matrix
float4 result[7] = {};

do{
    m = CPU_ICPSolver( result ); // Nothing to do with GPU inside

    GPU_PrepareWorkingBuffer(
        depth_and_normalmap1, // input as SRV
        depth_and_normalmap2, // input as SRV
        matrix,               // input as CBV
        workingBuf);          // output as UAV (all 7 buffer)

    for (int i = 0; i < 7; ++i) {
        GPU_Reduction::Process(workingBuf[i]); // reduction to 1 float4 value inside GPU, but not copied to ReadBack buffer
    }
    GPU_Reduction::Readback( result ); // Read the reduction result, copy from default heap to readback heap, need to wait GPU inside

    reprojection_error = GetReprojectionError( result );
}while(iterations < 20 && reprojection_error > threshold)

Above is how the workflow looks like. Right now I have tested and profile 1 iteration case on my GTX680

this part alone:

    for (int i = 0; i < 7; ++i) {
        GPU_Reduction::Process(workingBuf[i]); // reduction to 1 float4 value inside GPU, but not copied to ReadBack buffer
    }
    GPU_Reduction::Readback( result ); // Read the reduction result, copy from default heap to readback heap, need to wait GPU inside

took 0.65ms (is that seems reasonable? or it's incredibly slow, please let me know, thanks), so if I add the GPU_PrepareWorkingBuffer and do 20 iterations I probably will end up with 16ms.... which seems too much...

The reduction shader I write is very standard one which guided by this post so not a naive one (but there are some tricky things, I will cover latter....)

============================Questions==============================

So for standard GPU reduction, as this post described, the result dimension is equal to threadgroup number (for example Dispatch(threadgroupX, threadgroupY, 1) means you reduction result will be of size (threadgroupX, threadgroupY)).

In my case I need to reduce to only one value per one buffer, so ideally I should only dispatch one threadgroup, but one threadgroup could only have maximum 1024 threads, so even each thread do 16 fetches, one threadgroup could only handle 1024*16 size. which is not enough in my case 512x424.

There are other way to do reduction into 1 element even with multiple threadgroup, by using atomicAdd, however that only work with 32bit integer so I can't use that method (mine is sum over float4), (or there are some efficient workaround?? thanks)

To sum up 512x424 to 1 pixel I need to to at least 2 pass, but with that configuration, the second pass I will end up with summing up 1024*16 elements but with only 14 actually needed. So the ideal configuration will be thread_per_group * fetches_per_thread = sqrt(512*424) ~ 466, which means with per thread fetch set to 8, I should have 64 threads per threadgroup...isn't it to small compared to 1024 maximum? and should I not worry about 'summing up 1024*16 elements but with only 14 actually needed' at all?

Other question I have is about should I use compute shader at all? I can change my input format to Texture2D instead of Structured buffer, and use fullscreen quad pixel shader do log2(n) pass with linear_sampler times 4(linear_sampler will average 4 neighboring pixel right?) as sum over 4 neighboring pixel? (I can bound 8 RTs for one pass right?) I have such pixel shader reduction implemented way back using Dx9, but don't have GPU_Profiler implemented back then, so don't know the exact cost (and I hope not spending time figuring that out, so if someone know such cs vs. ps reduction perf result, please let me know thanks

Question 3, with 512x424 elements reduction it seems CPU could handle it pretty well in terms of performance (even though every element is float4 and I have 7 of such buf), anyone have compared GPU/CPU reduction? I haven't tried CPU reduction on this due to the fear that copying that 7 StructuredBuffer<float4> from GPU->CPU may cost much more (am I wrong?)

Question 4, as you may notice every iteration I have CPU GPU data dependency (cpu will wait for the reduction result to compute the matrix, and then gpu will wait for the matrix to perform the work....), so any suggestion about that? The reason is that I need to solve a small linear system Ax = b (A is 6x6), which seems is GPU solvable... So anyone know how to solve small linear system on GPU?

Any suggestion, comments, advises are super welcome, and appreciated.

Thanks in advance~

Suggestion on GPU reduction

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112