Change repo cloning, file splitting and general refine, second attempt #3
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hello @gcapuzzi,
this is a second attempt after PR #2
I changed how the repo is cloned: no more github api and no more need to use a github PAT but just a simple git clone running a bash command in a subprocess.
The files are filtered by a list of allowed extensions called ext_whitelist, this way you can grab md, mdx, and other files all together.
The files that pass the filter are splitted using a specific splitter for markdown files that creates metadata about the titles in the file itself, this is good for the quality and i think the speed of the research in the vector db. But the problem is that it works only with markdown, so, if your ext_whitelist has .js files, those should actually use a different splitter. The splitter selection should be based on the extension and more splitters should be used dynamically.
I took a stab at using Langchain LCEL but I couldn't implement memory the way that I wanted so it is commented.
The rest is just a refine but nothing important.
p.s. keeping the code in a notebook is pretty inconvenient because in this PR you cannot really see what I have changed unless you go to my project and try that, the history of the changes in the file is non-existent too.