Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change repo cloning, file splitting and general refine, second attempt #3

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

AlessandroAnnini
Copy link

Hello @gcapuzzi,
this is a second attempt after PR #2

I changed how the repo is cloned: no more github api and no more need to use a github PAT but just a simple git clone running a bash command in a subprocess.

The files are filtered by a list of allowed extensions called ext_whitelist, this way you can grab md, mdx, and other files all together.

The files that pass the filter are splitted using a specific splitter for markdown files that creates metadata about the titles in the file itself, this is good for the quality and i think the speed of the research in the vector db. But the problem is that it works only with markdown, so, if your ext_whitelist has .js files, those should actually use a different splitter. The splitter selection should be based on the extension and more splitters should be used dynamically.

I took a stab at using Langchain LCEL but I couldn't implement memory the way that I wanted so it is commented.

The rest is just a refine but nothing important.

p.s. keeping the code in a notebook is pretty inconvenient because in this PR you cannot really see what I have changed unless you go to my project and try that, the history of the changes in the file is non-existent too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant