Stop relying on archive.org Sites.xml #322

benoit74 · 2024-10-22T20:31:33Z

Now that archive.org is down, it becomes obvious that the scraper is down as well despite the fact that we have mirrored all dumps. And the reason is a bit sad: we did not mirrored Sites.xml and we are fetching it from online ... which is not online anymore.

I think we must mirror Sites.xml as well, and use the Sites.xml for our S3 bucket.

The text was updated successfully, but these errors were encountered:

benoit74 · 2024-10-22T20:35:34Z

Nota: we rely also on stackoverflow to be online in order to retrieve (at least):

the favicons (both normal + apple touch)
the primary and secondary css

While all these are obviously specific to every domain, I wonder if we should not mirror them as well (and source them from S3 in the scraper), to be able to run the scraper again even if StackExchange / Archive.org are down.

kelson42 · 2024-10-30T18:25:23Z

I'm OK we just rely on both Web sites are online. IMHO, not worse the effort.

benoit74 · 2024-11-01T09:41:30Z

The Sites.xml file is not updated / provided anymore in recent dumps, see openzim/zimfarm#1041

benoit74 · 2024-11-01T09:42:33Z

And the URL to dumps is moving, so we cannot anymore always use https://archive.org/download/stackexchange/Sites.xml as we do currently in the scraper

benoit74 · 2024-12-16T10:44:53Z

File is not provided/updated anymore anyway, see https://archive.org/services/search/beta/page_production/?user_query=subject:%22Stack%20Exchange%20Data%20Dump%22%20creator:%22Stack%20Exchange,%20Inc.%22&hits_per_page=1&page=1&sort=date:desc&aggregations=false&client_url=https://archive.org/search?query=subject%3A%22Stack+Exchange+Data+Dump%22+creator%3A%22Stack+Exchange%2C+Inc.%22

We also tried to reach our contacts at StackExchange without any success on getting an answer to this question.

We hence have to adapt the scraper to not rely anymore on this file. I propose that the strategy is to:

add a CLI parameter for every attribute which was coming from Sites.xml (css URL, icon location, ...)
compute number of tags, posts, ... at scraper startup (forces us to parse the XML twice, hopefully should not cause much harm than good - we will have to monitor memory usage anyway)

benoit74 added the enhancement label Oct 22, 2024

benoit74 mentioned this issue Oct 24, 2024

Properly apply MathJax configuration #323

Merged

benoit74 changed the title ~~Stop relying on online archive.org Sites.xml~~ Stop relying on archive.org Sites.xml Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop relying on archive.org Sites.xml #322

Stop relying on archive.org Sites.xml #322

benoit74 commented Oct 22, 2024

benoit74 commented Oct 22, 2024

kelson42 commented Oct 30, 2024 •

edited

Loading

benoit74 commented Nov 1, 2024

benoit74 commented Nov 1, 2024

benoit74 commented Dec 16, 2024

Stop relying on archive.org Sites.xml #322

Stop relying on archive.org Sites.xml #322

Comments

benoit74 commented Oct 22, 2024

benoit74 commented Oct 22, 2024

kelson42 commented Oct 30, 2024 • edited Loading

benoit74 commented Nov 1, 2024

benoit74 commented Nov 1, 2024

benoit74 commented Dec 16, 2024

kelson42 commented Oct 30, 2024 •

edited

Loading