-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auto Hashing ID for VectorDB Classes #4746
Conversation
@microsoft-github-policy-service agree |
@microsoft-github-policy-service agree |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## 0.2 #4746 +/- ##
==========================================
- Coverage 29.30% 28.86% -0.44%
==========================================
Files 117 117
Lines 13013 13049 +36
Branches 2469 2474 +5
==========================================
- Hits 3813 3767 -46
- Misses 8854 8935 +81
- Partials 346 347 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 1 out of 1 changed files in this pull request and generated no comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add or update the tests? https://github.com/microsoft/autogen/blob/f9295c4c39faf8bd3c1cef13f7ff235b2e081239/test/agentchat/contrib/vectordb/test_mongodb.py
Tests have been ammended. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for the PR, @mattbeardey ! I've left some comments. However, my major concern is that with this PR, MongoDBAtlasVectorDB is no more compatible with other vectordbs in autogen if a user created db w/o giving document ids.
I'd prefer set default id for Document
with something like
autogen/autogen/agentchat/contrib/retrieve_user_proxy_agent.py
Lines 375 to 379 in adbd46e
chunk_ids = ( | |
[hashlib.blake2b(chunk.encode("utf-8")).hexdigest()[:HASH_LENGTH] for chunk in chunks] | |
if not self._vector_db.type == "qdrant" | |
else [str(uuid.UUID(hex=hashlib.md5(chunk.encode("utf-8")).hexdigest())) for chunk in chunks] | |
) |
if users just don't want to set document ids manually. Is this your case, or you need to use mongodb auto generated ids for some reason?
WDYT? Thanks.
This commit looks good to me. Thank you, @mattbeardey . |
Scaled back the PR to focus on ID only, will look to embeddings as a separate topic, as I believe there may be a need for definition on what a "Embedding Function" is, and what it returns (i.e. errors with the "to_list" method not guaranteed to be present. Will handle separately. Please find latest pull. Thanks. Please review when possible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @mattbeardey , focusing on ID is a good idea!
I've just one suggestion, others look good to me!
@@ -123,7 +123,11 @@ def _wait_for_document(self, collection: Collection, index_name: str, doc: Docum | |||
if query_result and query_result[0][0]["_id"] == doc["id"]: | |||
return | |||
sleep(_DELAY) | |||
|
|||
if query_result and float(query_result[0][1]) == 1.0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to have documents with exactly the same content but different metadata. Like comments from different users. I'd suggest log a warning message instead of raising an error. And it's not a TimeoutError if it's an error. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adjusted now with a logger.warning, and check for metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you @mattbeardey !
@ekzhu , do we still support openai tests for 0.2 in the CI pipeline? |
It's still supported but the branch must be on the Microsoft repo. We need to create a branch, create a new PR from there, and have the author submit PR to that branch. Then we can run it. |
Why are these changes needed?
These changes allow for users to use the MongoDBAtlas class with more modifiability. This includes enabling users to pass in embeddings, without need to create in the class (note this functionality can be extended to other VectorDB), and make use of MongoDB auto ID assignment, removing the need to insert ID at upload.
Related issue number
Closes #4728
Checks