Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

warc-gpt unable to find information #3

Open
peterchanws opened this issue Mar 6, 2024 · 3 comments
Open

warc-gpt unable to find information #3

peterchanws opened this issue Mar 6, 2024 · 3 comments

Comments

@peterchanws
Copy link

peterchanws commented Mar 6, 2024

Environment:
Apple M1 Pro, macOS 14.3.1, Chrome

I initially uploaded 41 WARC files into WARCgpt. Among these files was an email containing titles and links to several papers related to AI. When I queried WARCgpt about the email's content regarding AI, the system responded that the email did not directly mention AI. Instead, it referenced links to product pages on B&H Photo Video's website for various computer components, such as processors and memory, with encoded parameters that specify the products linked. It was unclear how these components were connected to AI. Although I have some WARC files containing emails from B&H Photo Video, they pertain to cameras, video equipment, etc.

Later, I crawled the page https://arxiv.org/html/2303.08774v5 and ingested it into WARCgpt. Asking the system about what the email said regarding transformers, it accurately responded that the emails discussed several transformer models, such as "gpt-j-6b," "gpt-neo," "bloom," and "opt," describing them as large-scale autoregressive language models. Some emails covered aspects like training, deployment, alignment, and human data collection for these models, in addition to contributions to datasets. The emails were authored by individuals ranging from researchers and engineers to product managers at companies including Microsoft, Meta, and Google, providing the correct sources.

I'm puzzled as to why the initial 41 WARC files didn't yield the correct response.

Here's the link to the files I used:
https://drive.google.com/drive/folders/1eMilyDZ9Bc3HrTuu429DtMM7oqPnbOnH
warc_old.zip contains 41 warc files
Archive.zip contain warc files I crawled using Browsertrix crawler

Screenshot 2024-03-06 at 10 36 35
.google.com/drive/folders/1eMilyDZ9Bc3HrTuu429DtMM7oqPnbOnH
Screenshot 2024-03-06 at 10 43 12

@peterchanws peterchanws changed the title warc-gpt unable to find information in warc-gpt unable to find information Mar 6, 2024
@peterchanws
Copy link
Author

I looked at the terminal when I ask "do the data talk about privacy?". I see
[2024-03-06 12:12:02,661] WARNING in api: litellm could not trim messages for ollama/mistral:latest

@matteocargnelutti
Copy link
Collaborator

Hi @peterchanws

I'm puzzled as to why the initial 41 WARC files didn't yield the correct response.

Very interesting!
There are a few things that you could look into here:

  • Was the system able to pull relevant sources for the question you asked?
    • If the response is incorrect, but what you see in the "Show sources" makes sense, maybe the text-generation model wasn't able to make sense of the sources that were pulled
    • Conversely, if what you see in "Show sources" appears to be irrelevant (which I think is what you've identified), then there might a problem with how the documents are ingested
  • It could be interesting to use WARC-GPT's visualize command to see:

I looked at the terminal when I ask "do the data talk about privacy?". I see
[2024-03-06 12:12:02,661] WARNING in api: litellm could not trim messages for ollama/mistral:latest

This is thankfully unrelated. This warning comes from the library we use to talk to LLM APIs. In that case it warns us that it doesn't know what the maximum context length for the current model is, and therefore cannot trim your message if that were needed.

@peterchanws
Copy link
Author

I suspect the issue I have is caused by the handling warc files in Mailbagit. So, I ingested a few websites archived using Archive Web.Page. This time, the answer to my question is correct, however, the source quoted is wrong.

Question:
is there information on working with email archives in the files?
ollama/mistral:latest

Yes, there is information on working with email archives in the files you provided. Specifically, the Palladium Project has published guides for handling email archives as part of their work to preserve and make accessible digital collections. The project's blog post from February 2023 discusses their progress in this area and provides links to resources related to email archives.

Context: "This project is a stepping stone in our work with email archives, and there is still a great deal to do, but we are pleased to have made significant progress and taken some really positive steps."

Source: https://blog.jgc.org/2023/02/palladium-project-guides-for-email-archives/

The correct source should be
https://rylandscollections.com/2022/02/18/palladium-project-guides-for-email-archives/

Somehow, the system merge another source I ingested
https://blog.jgc.org/2024/02/the-original-www-proposal-is-word-for.html

Is this an issue with the LLM?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants