Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: addition of Gliner and Keybert link extraction components #5416

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

pedrocassalpacheco
Copy link
Contributor

This pull request introduces two new link extraction components to facilitate the creation of content graphs from text.

Components

  1. Gliner Link Extraction

Gline link extraction takes unstructured text, optionally splits the text into paragraphs, and uses a GLiNER model to perform entity recognition and turn it into Links consumable by the graph vector datastore.

2 Keybert Link Extraction

Keybert link extraction takes unstructured text, optionally splits the text into paragraphs, and uses key word extraction via keybert to extract keywords and transform them into links consumable by the graph vector store

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. enhancement New feature or request labels Dec 24, 2024
@github-actions github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Dec 24, 2024
@github-actions github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Dec 24, 2024
@cbornet
Copy link
Collaborator

cbornet commented Dec 24, 2024

This is already discussed in #3866 and #3867.
Also it seems these extractors should extend LCDocumentTransformerComponent

@pedrocassalpacheco
Copy link
Contributor Author

pedrocassalpacheco commented Dec 24, 2024

@cbornet - by "already discussed," do you mean it has already been done? If so, why wasn't it merged? Not all extractors were built using the same pattern, and using LCDocumentTransformerComponent as a base class requires modifications to the extractor classes, which is currently out of scope. Langchain won't allow further commits to the graph vector, so a simpler approach of using Component as a base class seemed like a good solution. I am happy to close this out if it is redundant. Cheers ...

PS: @cbornet - I see the objection raised by @ogabrielluiz. If this is truly a problem, we should wait for the link strategy to be redesigned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants