SimHash returning 32-bit results, not 64-bits #21

tfmorris · 2016-03-15T22:53:23Z

Although the code and paper suggest that 64-bit hashes are being used, the Java Object.hashCode() function only returns 32 bits. The good news is that the bug in #19 has no effect since the upper 16-bits are always 0 (or perhaps all 1s, depending on sign extension effects).

The bad news is that because bits 32-47 are either all zero (or perhaps evenly divided between all zero & all one), I suspect all (or at least half) of the documents will end up being clustered together, making for a very expensive O(n^2) comparison.

You can probably ignore PR #20 for now. It'll get subsumed into the larger rework necessary.

tfmorris · 2016-03-15T23:02:42Z

Oops, ignore the part about word 2 being all zero/one. It'll actually be the same as word 0 because the 32-bit hashcode gets shifted through twice to test "all 64" bits, so the upper 32 bits will be duplicates of the lower 32 bits.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SimHash returning 32-bit results, not 64-bits #21

SimHash returning 32-bit results, not 64-bits #21

tfmorris commented Mar 15, 2016

tfmorris commented Mar 15, 2016

SimHash returning 32-bit results, not 64-bits #21

SimHash returning 32-bit results, not 64-bits #21

Comments

tfmorris commented Mar 15, 2016

tfmorris commented Mar 15, 2016