Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop relying on archive.org Sites.xml #322

Open
benoit74 opened this issue Oct 22, 2024 · 5 comments
Open

Stop relying on archive.org Sites.xml #322

benoit74 opened this issue Oct 22, 2024 · 5 comments

Comments

@benoit74
Copy link
Collaborator

Now that archive.org is down, it becomes obvious that the scraper is down as well despite the fact that we have mirrored all dumps. And the reason is a bit sad: we did not mirrored Sites.xml and we are fetching it from online ... which is not online anymore.

I think we must mirror Sites.xml as well, and use the Sites.xml for our S3 bucket.

@benoit74
Copy link
Collaborator Author

Nota: we rely also on stackoverflow to be online in order to retrieve (at least):

  • the favicons (both normal + apple touch)
  • the primary and secondary css

While all these are obviously specific to every domain, I wonder if we should not mirror them as well (and source them from S3 in the scraper), to be able to run the scraper again even if StackExchange / Archive.org are down.

@kelson42
Copy link
Contributor

kelson42 commented Oct 30, 2024

I'm OK we just rely on both Web sites are online. IMHO, not worse the effort.

@benoit74
Copy link
Collaborator Author

benoit74 commented Nov 1, 2024

The Sites.xml file is not updated / provided anymore in recent dumps, see openzim/zimfarm#1041

@benoit74
Copy link
Collaborator Author

benoit74 commented Nov 1, 2024

And the URL to dumps is moving, so we cannot anymore always use https://archive.org/download/stackexchange/Sites.xml as we do currently in the scraper

@benoit74 benoit74 changed the title Stop relying on online archive.org Sites.xml Stop relying on archive.org Sites.xml Dec 16, 2024
@benoit74
Copy link
Collaborator Author

File is not provided/updated anymore anyway, see https://archive.org/services/search/beta/page_production/?user_query=subject:%22Stack%20Exchange%20Data%20Dump%22%20creator:%22Stack%20Exchange,%20Inc.%22&hits_per_page=1&page=1&sort=date:desc&aggregations=false&client_url=https://archive.org/search?query=subject%3A%22Stack+Exchange+Data+Dump%22+creator%3A%22Stack+Exchange%2C+Inc.%22

We also tried to reach our contacts at StackExchange without any success on getting an answer to this question.

We hence have to adapt the scraper to not rely anymore on this file. I propose that the strategy is to:

  • add a CLI parameter for every attribute which was coming from Sites.xml (css URL, icon location, ...)
  • compute number of tags, posts, ... at scraper startup (forces us to parse the XML twice, hopefully should not cause much harm than good - we will have to monitor memory usage anyway)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants