Skip to content

Tiny example project to test bitwise vector cosine similarity

License

Notifications You must be signed in to change notification settings

bigattichouse/bitvector_research

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TL;DR: Clone the repo. run make .. it'll build and run both tests and you'll see the timing comparison


My write-up if you'd like to read more: https://bigattichouse.medium.com/dreamcoat-cosine-similarity-using-bit-values-part-1-3c64cc4b9caa


Years ago I created VSDB, a vectorspace database for document similarity search and clustering. I'm slowly working on reorganizing it into a RAG tool for LLMs, which includes rethinking some things about it operates, and deciding if I want to continue to work on it in C, or switch to a newer language. The beast was built organinically as I learned about term vectorspaces, so it's a bit of a hot mess for anyone. I mostly use it for research.

VSBD allows vectors in the queries to have any number of dimensions, which are represented by a hash id. This allows the hashes to be meaningful (ex: MD5("House")=1 ) or even represent object properties ( ex: MD5("BOOK_METADATA_PAGES")=235 ). I use indicies to speed up comparisons, and it's generally very fast.

Recently I read about LLMs using 1.58bit quantization (-1,0,1), and a small mention of 0.68bit quantization (which, using the log ruil I believe represents 0,1), so I decided to pull out the relevant parts of VSDB into two programs, write something up and see if it works.

It does work, but has some caveats.

Here's the basis of where I landed. Just take the signs for given dimensions and convert to 1s and zeros. You'll notice that this is pretty much XNOR operations.

- * - = +   00 > 1
+ * - = -   10 > 0
- * + = -   01 > 0
+ * + = +   11 > 1

Here's a fairly simple representation of how we build similarity:

01011111
01101100
++--++-- = 50% similarity

+ = match, - = opposite

so XNOR, count the 1's, divide by number of places. WAY faster than the cosine.

Let's call it '"bit similarity"

I tried various algos

Population count. [https://en.wikipedia.org/wiki/Hamming_weight]

Magnitudes are as simple as counting the number of 1s in the integer.. which is a surprisingly difficult thing to do. I found a rather elegant method shown below.

 int BitCount(unsigned int u)
    {
         unsigned int uCount;
         uCount = u - ((u >> 1) & 033333333333) - ((u >> 2) & 011111111111);
         return ((uCount + (uCount >> 3)) & 030707070707) % 63;
    }

From: [https://web.archive.org/web/20151229003112/http://blogs.msdn.com/b/jeuge/archive/2005/06/08/hakmem-bit-count.aspx] [Explained: https://tekpool.wordpress.com/category/bit-count/]

Expected Output:

Testing Float Vectors:

MyDogHasFleas*MyCatHasFleas cosine=0.750000
MyDogHasFleas*MyNotDogHasNotFleas cosine=0.000000
MyDogHasFleas*MyHamsterHasHiccups cosine=0.500000
Computation of 30000000 cosines took about 1.50692 seconds


./bitvector

Testing Bit Vectors:

MyDogHasFleas*MyCatHasFleas cosine=0.750000
MyDogHasFleas*MyHamsterHasHiccups cosine=0.500000
We lose the ability to have negative dimensions.. but that's ok..
'MyDogHasFleas*MyNotDogHasNotFleas cosine=0.750000
Computation of 30000000 cosines took about 0.58750 seconds


Future work: I've got some ideas for a CNN based on this idea and kinda merging with Conway's Game of Life. Reddit u/StepFunction pointed me to https://arxiv.org/abs/1603.05279 which is very relevant.

Variably sized parameters If you think about it, you could just expand the bits for items you feel are more important. sorta variable bit length. Maybe some important items could expand to 4 bits in the stack, instead of only one. I'll have to mull that over more. Obvsiously would need to plan that from the start, as changing sizes would be a problem.

About

Tiny example project to test bitwise vector cosine similarity

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published