Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SimHash returning 32-bit results, not 64-bits #21

Open
tfmorris opened this issue Mar 15, 2016 · 1 comment
Open

SimHash returning 32-bit results, not 64-bits #21

tfmorris opened this issue Mar 15, 2016 · 1 comment

Comments

@tfmorris
Copy link
Contributor

Although the code and paper suggest that 64-bit hashes are being used, the Java Object.hashCode() function only returns 32 bits. The good news is that the bug in #19 has no effect since the upper 16-bits are always 0 (or perhaps all 1s, depending on sign extension effects).

The bad news is that because bits 32-47 are either all zero (or perhaps evenly divided between all zero & all one), I suspect all (or at least half) of the documents will end up being clustered together, making for a very expensive O(n^2) comparison.

You can probably ignore PR #20 for now. It'll get subsumed into the larger rework necessary.

@tfmorris
Copy link
Contributor Author

Oops, ignore the part about word 2 being all zero/one. It'll actually be the same as word 0 because the 32-bit hashcode gets shifted through twice to test "all 64" bits, so the upper 32 bits will be duplicates of the lower 32 bits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant