-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove broken and unused code path from L2 normalization #166
Conversation
bd4d957
to
94a2c64
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a minor comment
float norm; | ||
|
||
if( threadIdx.x == 0 ) { | ||
norm = normf( 128, src_desc ); | ||
} | ||
__syncthreads(); | ||
norm = popsift::shuffle( norm, 0 ); | ||
|
||
descr.x = min( descr.x, 0.2f*norm ); | ||
descr.y = min( descr.y, 0.2f*norm ); | ||
descr.z = min( descr.z, 0.2f*norm ); | ||
descr.w = min( descr.w, 0.2f*norm ); | ||
|
||
// 32 threads compute 4 squares each, then shuffle to performing a addition by | ||
// reduction for the sum of 128 squares, result in thread 0 | ||
norm = descr.x * descr.x |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
float norm = descr.x * descr.x ...
and remove the previous declaration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that I can do it because the first assignment to norm is inside the if (line 61). Thread 0 is initialized inside the if (lines 60-62), and the shuffle in line 64 initializes the other 31 threads.
The thing that actually happens underneath the C-like syntax is that the result of normf() is stored in the lowest 32 bits of a 1024-bit SIMD register (line 61). Shuffle is a single SIMD instruction that copies the lowest 32 bits of the SIMD register into every other set of 32 bits of the same register.
Technically, I could write
float norm = ( threadIdx.x == 0 ) ? normf( 128, src_desc ) : 0;
That looks nicer, but if would actually waste time.
Description
Remove an optional and apparently unused code path that relies on normf() to perform L2 normalization of the descriptors.
First, that code is buggy. Second, that code would be slower than the more "old-fashioned" code.
Features list
Implementation remarks
There was a bug in src/popsift/s_desc_norm_l2.h in an ifdef'd code path where the L2 norm incorrectly is computed.
It was actually a double bug. First, normf(array) is computed, which is sqrt(sum(array[n]²)) instead of rnormf(array), which would be 1/sqrt(sum(array[n]²)). After that, the min value is multiplied with it instead of the descriptor value. Really stupid bug that would be easily fixed.
However, nobody ever reported the bug because the code wasn't active by default. normf() is only used in PopSift if it is manually selected in CMake, and if the CC is >=7.5.
After learning more and more about CUDA, we know also that the code path would have removed parallelism. It was written under the belief the normf() uses several CUDA warps in the background, forming the equivalent of the reduce() operation that is performed with shuffle_down in the main code path. However, this is not the case.