-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
warc-gpt unable to find information #3
Comments
I looked at the terminal when I ask "do the data talk about privacy?". I see |
Hi @peterchanws
Very interesting!
This is thankfully unrelated. This warning comes from the library we use to talk to LLM APIs. In that case it warns us that it doesn't know what the maximum context length for the current model is, and therefore cannot trim your message if that were needed. |
I suspect the issue I have is caused by the handling warc files in Mailbagit. So, I ingested a few websites archived using Archive Web.Page. This time, the answer to my question is correct, however, the source quoted is wrong. Question: Yes, there is information on working with email archives in the files you provided. Specifically, the Palladium Project has published guides for handling email archives as part of their work to preserve and make accessible digital collections. The project's blog post from February 2023 discusses their progress in this area and provides links to resources related to email archives. Context: "This project is a stepping stone in our work with email archives, and there is still a great deal to do, but we are pleased to have made significant progress and taken some really positive steps." Source: https://blog.jgc.org/2023/02/palladium-project-guides-for-email-archives/ The correct source should be Somehow, the system merge another source I ingested Is this an issue with the LLM? |
Environment:
Apple M1 Pro, macOS 14.3.1, Chrome
I initially uploaded 41 WARC files into WARCgpt. Among these files was an email containing titles and links to several papers related to AI. When I queried WARCgpt about the email's content regarding AI, the system responded that the email did not directly mention AI. Instead, it referenced links to product pages on B&H Photo Video's website for various computer components, such as processors and memory, with encoded parameters that specify the products linked. It was unclear how these components were connected to AI. Although I have some WARC files containing emails from B&H Photo Video, they pertain to cameras, video equipment, etc.
Later, I crawled the page https://arxiv.org/html/2303.08774v5 and ingested it into WARCgpt. Asking the system about what the email said regarding transformers, it accurately responded that the emails discussed several transformer models, such as "gpt-j-6b," "gpt-neo," "bloom," and "opt," describing them as large-scale autoregressive language models. Some emails covered aspects like training, deployment, alignment, and human data collection for these models, in addition to contributions to datasets. The emails were authored by individuals ranging from researchers and engineers to product managers at companies including Microsoft, Meta, and Google, providing the correct sources.
I'm puzzled as to why the initial 41 WARC files didn't yield the correct response.
Here's the link to the files I used:
https://drive.google.com/drive/folders/1eMilyDZ9Bc3HrTuu429DtMM7oqPnbOnH
warc_old.zip contains 41 warc files
Archive.zip contain warc files I crawled using Browsertrix crawler
.google.com/drive/folders/1eMilyDZ9Bc3HrTuu429DtMM7oqPnbOnH
The text was updated successfully, but these errors were encountered: