-
-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop relying on archive.org Sites.xml #322
Comments
Nota: we rely also on stackoverflow to be online in order to retrieve (at least):
While all these are obviously specific to every domain, I wonder if we should not mirror them as well (and source them from S3 in the scraper), to be able to run the scraper again even if StackExchange / Archive.org are down. |
I'm OK we just rely on both Web sites are online. IMHO, not worse the effort. |
The |
And the URL to dumps is moving, so we cannot anymore always use https://archive.org/download/stackexchange/Sites.xml as we do currently in the scraper |
File is not provided/updated anymore anyway, see https://archive.org/services/search/beta/page_production/?user_query=subject:%22Stack%20Exchange%20Data%20Dump%22%20creator:%22Stack%20Exchange,%20Inc.%22&hits_per_page=1&page=1&sort=date:desc&aggregations=false&client_url=https://archive.org/search?query=subject%3A%22Stack+Exchange+Data+Dump%22+creator%3A%22Stack+Exchange%2C+Inc.%22 We also tried to reach our contacts at StackExchange without any success on getting an answer to this question. We hence have to adapt the scraper to not rely anymore on this file. I propose that the strategy is to:
|
Now that archive.org is down, it becomes obvious that the scraper is down as well despite the fact that we have mirrored all dumps. And the reason is a bit sad: we did not mirrored
Sites.xml
and we are fetching it from online ... which is not online anymore.I think we must mirror Sites.xml as well, and use the Sites.xml for our S3 bucket.
The text was updated successfully, but these errors were encountered: