Add locking mechanism around scraping to support distributed setups #495

pushrbx · 2024-02-04T17:50:21Z

Currently Jikan API works the following way:

If you have multiple workers/instances of Jikan API running behind a load balancer, and if the requested anime/manga is not in mongodb then there can be a race condition, where two instances of Jikan API would scrape the same datasheet from MAL resulting in duplicated items in mongodb.

This behaviour has been reported several times: #262 #333

The solution I can think of is some sort of pessimistic locking through some lock ticket stored in a separate collection in mongodb, and if there is an entry for an anime/manga item there, the second worker which would try to scrape the item would not insert the scraped item. However this can cause the risk of ip ban, because while Worker A is scraping and inserting into mongodb, Worker B might get two requests for the same item and it would have to scrape twice.

In case of pessimistic locking the worker which finds a locking ticket should return 423 status code.
https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/423

Edit 2024.07.08:
If we upgrade the project to full Laravel instead of Lumen, the Laravel Job queue and Laravel Horizon can be used in this case. If we divide the scraping of the data from MAL and the persistence of it in the DB, we can put the persistence on a unique job queue:

If the cache is empty, the scraping starts, the retrieved data is put on the queue to be saved, and the data is returned to the client. -- only problem with this is that the size of the data we can put on the queue is limited
If the cache is not empty and it's stale, the stale data is returned, and a job is put on the queue to update the cache.

Edit 2024.11.13:
The above queue idea is solid. Another idea for it came to mind: if we store the mal_id and type of those items which require updating on the queue, a throttled consumer of the queue could do the scraping in backround. E.g. a worker which runs every 3-4 seconds. Laravel also has deduplication for queues, so there could be just one item of a mal_id and type pair on the queue.

The text was updated successfully, but these errors were encountered:

pushrbx added the help wanted label Feb 4, 2024

pushrbx added this to Jikan REST API Feb 4, 2024

irfan-dahir moved this to Planned in Jikan REST API Sep 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add locking mechanism around scraping to support distributed setups #495

Add locking mechanism around scraping to support distributed setups #495

pushrbx commented Feb 4, 2024 •

edited

Loading

Add locking mechanism around scraping to support distributed setups #495

Add locking mechanism around scraping to support distributed setups #495

Comments

pushrbx commented Feb 4, 2024 • edited Loading

pushrbx commented Feb 4, 2024 •

edited

Loading