-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Excessive refetching of review_comments and pull_requests after rate-limiting error #43
Comments
Related question: do you know if v1.3.2 is running on StitchData? That could be the issue here – the logs don't provide the version number. |
Here's the end of the log, in case that helps:
|
It looks like there might be 2 problems here:
We'll need to do a bit more research for the first problem. I think it might be okay to solve 2 via a backoff / sleep - Github appears to be sending headers to indicate when the rate limit resets though it feels very awkward to write code to sleep for up to an hour of time. |
I've noticed that the mechanism for fetching issue comments works like so:
If you've got 1000s of issues that can be... problematic. It's not clear if "fetch all issues since yyyy-mm-dd" will includes issues that haven't been modified but have had comment added, but that would be good to test. UPDATE: I'm full of crap, this is wrong. |
FYI - this issue appears to have been bypassed when I disabled the fetching of comments and review_comments, which suggests that the root cause relates to the fetching of these 2 datasets. |
OK I've done a bit of investigation. The underlying issues is that However, we don't need to get review comments of every one of the issues. In this loop - https://github.com/singer-io/tap-github/blob/master/tap_github.py#L198 - we should be able compare the "updated_at" value to our bookmark data, and if it hasn't been updated since the bookmark value, then don't bother looking for more review-comments and don't bother output the information to the Singer target. We would still need to paginate through all the pull requests, but that might be okay. If not, fetching the pull requests with |
I believe this is correct. We get all the PR's (and paginate them) every sync, but only dive deeper if its updated_at is greater than the bookmark. A simple change adding something in the sense of:
would be a good addition here. |
GitHub’s /pulls endpoint doesn’t have a ?since filter, so instead we check the updated_at values of each record imported. Since we have sorted descending by this value, we can break out of the loop once we have found the first updated_at value that is less than the bookmark time. Fixes singer-io#43
GitHub’s /pulls endpoint doesn’t have a ?since filter, so instead we check the updated_at values of each record imported. Since we have sorted descending by this value, we can break out of the loop once we have found the first updated_at value that is less than the bookmark time. Fixes singer-io#43
OK I've had a go at a PR for this – my python is a bit rusty but this appears to work locally, |
GitHub’s /pulls endpoint doesn’t have a ?since filter, so instead we check the updated_at values of each record imported. Since we have sorted descending by this value, we can break out of the loop once we have found the first updated_at value that is less than the bookmark time. Fixes #43
I am still encountering this problem with stitch trying to do an initial data sync due to the number of rows in pr_commits. I am trying to address this by doing a 5-minute sleep when receiving a rate limit error from the server in |
I see this discussion is happening in #63, so I will take this comment there. |
GitHub’s /pulls endpoint doesn’t have a ?since filter, so instead we check the updated_at values of each record imported. Since we have sorted descending by this value, we can break out of the loop once we have found the first updated_at value that is less than the bookmark time. Fixes singer-io#43
I'm making use of this integration via StitchData.
I've set up an integration with a lot of different repositories: silverstripe/silverstripe-admin silverstripe/silverstripe-reports silverstripe/silverstripe-upgrader silverstripe/silverstripe-siteconfig silverstripe/silverstripe-installer silverstripe/silverstripe-graphql silverstripe/silverstripe-asset-admin silverstripe/silverstripe-behat-extension silverstripe/silverstripe-testsession silverstripe/recipe-cms silverstripe/silverstripe-versioned silverstripe/silverstripe-errorpage silverstripe/silverstripe-campaign-admin silverstripe/recipe-core silverstripe/silverstripe-assets silverstripe/silverstripe-config silverstripe/silverstripe-versioned-admin silverstripe/silverstripe-cms silverstripe/silverstripe-framework
Unsurprisingly, the rate limit is hit. My assumption is that it would continue from where it left off after it was run.
However, based on the number of review_comment entries in my database vs how many records StitchData says it has fetched (8,554 vs 139,866) it seems to me that it's getting stuck in a loop and re-fetching the many of the same records over and over again.
It would be highly beneficial if the full in-progress state was preserved.
review_comments and pull_requests seems to be affected in this way but comments, commits, and issues do not, even though the latter still have large volumes of records imported.
The text was updated successfully, but these errors were encountered: