Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem while scraping home page of zh website #109

Open
benoit74 opened this issue Dec 20, 2024 · 0 comments
Open

Problem while scraping home page of zh website #109

benoit74 opened this issue Dec 20, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@benoit74
Copy link
Collaborator

See openzim/zim-requests#1234

All API calls before that (e.g. to retrieve list of guides) are ok. See e.g. https://farm.openzim.org/pipeline/cbec972f-b77a-44aa-8705-d3fd8ab0f39d

[MainThread::2024-12-20 01:24:02,319] INFO:Scraping home items (1 items remaining)
[MainThread::2024-12-20 01:24:02,319] INFO:  Scraping home 1 (0 items remaining)
[MainThread::2024-12-20 01:24:02,416] WARNING:Error while processing home 1
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/ifixit2zim/scraper_generic.py", line 148, in scrape_items
    self.scrape_one_item(item_key, item_data)
  File "/usr/local/lib/python3.12/site-packages/ifixit2zim/scraper_generic.py", line 114, in scrape_one_item
    item_content = self.get_one_item_content(item_key, item_data)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/ifixit2zim/scraper_homepage.py", line 26, in get_one_item_content
    soup, _ = self.utils.get_soup("/Guide")
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/ifixit2zim/utils.py", line 101, in get_soup
    content, paths = self.fetch(path, **params)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/ifixit2zim/utils.py", line 79, in fetch
    resp.raise_for_status()
  File "/usr/local/lib/python3.12/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://zh.ifixit.com/Guide
[MainThread::2024-12-20 01:24:02,426] WARNING:Not supposed to add a redirect for a home item
[MainThread::2024-12-20 01:24:02,427] ERROR:Interrupting process due to error
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/ifixit2zim/scraper.py", line 362, in run
    scraper.scrape_items()
  File "/usr/local/lib/python3.12/site-packages/ifixit2zim/scraper_generic.py", line 173, in scrape_items
    raise FinalScrapingFailureError(
ifixit2zim.exceptions.FinalScrapingFailureError: Too many homes failed to be processed: 1
[MainThread::2024-12-20 01:24:02,428] DEBUG:shutting down executor IMG-T- with wait=False
[MainThread::2024-12-20 01:24:02,428] INFO:Cleaning up
[MainThread::2024-12-20 01:24:02,428] DEBUG:Removing /output/ifixit_zh_6q4k5g7_

To be tested locally (I suspect we retry too fast, but I don't get why it fails in first attempt). Not sure available delays in configuration can help, tbc.

@benoit74 benoit74 added the bug Something isn't working label Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant