Replies: 4 comments 1 reply
-
@MXueguang and I have been discussing this on Slack... Why don't we just use the NPY format? https://numpy.org/devdocs/reference/generated/numpy.lib.format.html E.g., here's a start: https://github.com/dreamolight/JavaNpy?tab=readme-ov-file |
Beta Was this translation helpful? Give feedback.
-
Another one to consider: safetensors. It's quite close to what Jimmy proposed, but I'd say a de facto standard already because of its adoption by Huggingface. |
Beta Was this translation helpful? Give feedback.
-
@arjenpdevries good call! Since we want Java/Python compatibility, we'll have to look into the feasibility. Based on a quick skim: https://github.com/huggingface/safetensors
In the header, we can stuff the docids in there, as a bonus. |
Beta Was this translation helpful? Give feedback.
-
This is defunct as we've moved to Parquet: #2582 |
Beta Was this translation helpful? Give feedback.
-
As we explore indexing dense vectors in Lucene, we'll need an efficient exchange format for storing the vectors. I'll call this the
advf
format for "Anserini Dense Vector Format". The general idea is that we'll extract vectors out of Faiss and write asadvf
, and Anserini will index this format.Here's my initial proposal.
advf
will be a binary file (possibly compressed) comprising the following:And then, repeated for every document:
Note that the format is designed without any explicit delimiters. Also, I have not included any magic SYNC tokens or encoded any redundant metadata for consistency checks. (Although these might be both good ideas...)
So, the reader loop will be something like this:
Then, repeat until EOF:
Thoughts, comments?
Beta Was this translation helpful? Give feedback.
All reactions