Excessive refetching of review_comments and pull_requests after rate-limiting error #43

sminnee · 2019-01-08T01:42:08Z

I'm making use of this integration via StitchData.

I've set up an integration with a lot of different repositories: silverstripe/silverstripe-admin silverstripe/silverstripe-reports silverstripe/silverstripe-upgrader silverstripe/silverstripe-siteconfig silverstripe/silverstripe-installer silverstripe/silverstripe-graphql silverstripe/silverstripe-asset-admin silverstripe/silverstripe-behat-extension silverstripe/silverstripe-testsession silverstripe/recipe-cms silverstripe/silverstripe-versioned silverstripe/silverstripe-errorpage silverstripe/silverstripe-campaign-admin silverstripe/recipe-core silverstripe/silverstripe-assets silverstripe/silverstripe-config silverstripe/silverstripe-versioned-admin silverstripe/silverstripe-cms silverstripe/silverstripe-framework

Unsurprisingly, the rate limit is hit. My assumption is that it would continue from where it left off after it was run.

However, based on the number of review_comment entries in my database vs how many records StitchData says it has fetched (8,554 vs 139,866) it seems to me that it's getting stuck in a loop and re-fetching the many of the same records over and over again.

It would be highly beneficial if the full in-progress state was preserved.

review_comments and pull_requests seems to be affected in this way but comments, commits, and issues do not, even though the latter still have large volumes of records imported.

sminnee · 2019-01-08T01:46:42Z

Related question: do you know if v1.3.2 is running on StitchData? That could be the issue here – the logs don't provide the version number.

sminnee · 2019-01-08T05:07:34Z

Here's the end of the log, in case that helps:

2019-01-08 04:57:00,961Z    tap - INFO HTTP request to "comments" endpoint took 0.078s, returned status code 200
2019-01-08 04:57:00,962Z    tap - WARNING Removed 38 paths during transforms:
2019-01-08 04:57:00,962Z    tap - 	_links
2019-01-08 04:57:00,963Z    tap - 	assignee
2019-01-08 04:57:00,963Z    tap - 	assignees
2019-01-08 04:57:00,963Z    tap - 	author_association
2019-01-08 04:57:00,963Z    tap - 	base
2019-01-08 04:57:00,963Z    tap - 	comments_url
2019-01-08 04:57:00,963Z    tap - 	commits_url
2019-01-08 04:57:00,963Z    tap - 	diff_url
2019-01-08 04:57:00,963Z    tap - 	head
2019-01-08 04:57:00,963Z    tap - 	html_url
2019-01-08 04:57:00,963Z    tap - 	issue_url
2019-01-08 04:57:00,963Z    tap - 	labels
2019-01-08 04:57:00,964Z    tap - 	locked
2019-01-08 04:57:00,964Z    tap - 	merge_commit_sha
2019-01-08 04:57:00,964Z    tap - 	milestone
2019-01-08 04:57:00,964Z    tap - 	node_id
2019-01-08 04:57:00,964Z    tap - 	patch_url
2019-01-08 04:57:00,964Z    tap - 	requested_reviewers
2019-01-08 04:57:00,964Z    tap - 	requested_teams
2019-01-08 04:57:00,964Z    tap - 	review_comment_url
2019-01-08 04:57:00,965Z    tap - 	review_comments_url
2019-01-08 04:57:00,965Z    tap - 	statuses_url
2019-01-08 04:57:00,965Z    tap - 	user.avatar_url
2019-01-08 04:57:00,965Z    tap - 	user.events_url
2019-01-08 04:57:00,965Z    tap - 	user.followers_url
2019-01-08 04:57:00,965Z    tap - 	user.following_url
2019-01-08 04:57:00,965Z    tap - 	user.gists_url
2019-01-08 04:57:00,965Z    tap - 	user.gravatar_id
2019-01-08 04:57:00,965Z    tap - 	user.html_url
2019-01-08 04:57:00,965Z    tap - 	user.node_id
2019-01-08 04:57:00,965Z    tap - 	user.organizations_url
2019-01-08 04:57:00,965Z    tap - 	user.received_events_url
2019-01-08 04:57:00,965Z    tap - 	user.repos_url
2019-01-08 04:57:00,965Z    tap - 	user.site_admin
2019-01-08 04:57:00,965Z    tap - 	user.starred_url
2019-01-08 04:57:00,965Z    tap - 	user.subscriptions_url
2019-01-08 04:57:00,965Z    tap - 	user.type
2019-01-08 04:57:00,966Z    tap - 	user.url
2019-01-08 04:57:00,966Z    tap - WARNING Removed paths list: ['_links', 'assignee', 'assignees', 'author_association', 'base', 'comments_url', 'commits_url', 'diff_url', 'head', 'html_url', 'issue_url', 'labels', 'locked', 'merge_commit_sha', 'milestone', 'node_id', 'patch_url', 'requested_reviewers', 'requested_teams', 'review_comment_url', 'review_comments_url', 'statuses_url', 'user.avatar_url', 'user.events_url', 'user.followers_url', 'user.following_url', 'user.gists_url', 'user.gravatar_id', 'user.html_url', 'user.node_id', 'user.organizations_url', 'user.received_events_url', 'user.repos_url', 'user.site_admin', 'user.starred_url', 'user.subscriptions_url', 'user.type', 'user.url']
2019-01-08 04:57:01,039Z    tap - INFO METRIC Point(metric_type='timer', metric='http_request_duration', value=0.07489776611328125, tags={'endpoint': 'comments', 'status': 'failed'})
2019-01-08 04:57:01,039Z    tap - INFO replicated 0 records from "reviews" endpoint
2019-01-08 04:57:01,039Z    tap - INFO replicated 488 records from "pull_requests" endpoint
2019-01-08 04:57:01,039Z    tap - CRITICAL {"message":"API rate limit exceeded for user ID 59968.","documentation_url":"https://developer.github.com/v3/#rate-limiting"}
2019-01-08 04:57:01,040Z    tap - Traceback (most recent call last):
2019-01-08 04:57:01,040Z    tap -   File "tap-env/bin/tap-github", line 11, in <module>
2019-01-08 04:57:01,040Z    tap -     sys.exit(main())
2019-01-08 04:57:01,040Z    tap -   File "/code/orchestrator/tap-env/lib/python3.5/site-packages/singer/utils.py", line 225, in wrapped
2019-01-08 04:57:01,040Z    tap -     return fnc(*args, **kwargs)
2019-01-08 04:57:01,040Z    tap -   File "/code/orchestrator/tap-env/lib/python3.5/site-packages/tap_github.py", line 515, in main
2019-01-08 04:57:01,040Z    tap -     do_sync(args.config, args.state, catalog)
2019-01-08 04:57:01,040Z    tap -   File "/code/orchestrator/tap-env/lib/python3.5/site-packages/tap_github.py", line 503, in do_sync
2019-01-08 04:57:01,040Z    tap -     state = sync_func(stream_schemas, repo, state, mdata)
2019-01-08 04:57:01,040Z    tap -   File "/code/orchestrator/tap-env/lib/python3.5/site-packages/tap_github.py", line 219, in get_all_pull_requests
2019-01-08 04:57:01,040Z    tap -     for review_comment_rec in get_review_comments_for_pr(pr_num, schemas['review_comments'], repo_path, state, mdata):
2019-01-08 04:57:01,040Z    tap -   File "/code/orchestrator/tap-env/lib/python3.5/site-packages/tap_github.py", line 244, in get_review_comments_for_pr
2019-01-08 04:57:01,040Z    tap -     'https://api.github.com/repos/{}/pulls/{}/comments'.format(repo_path,pr_number)
2019-01-08 04:57:01,040Z    tap -   File "/code/orchestrator/tap-env/lib/python3.5/site-packages/tap_github.py", line 100, in authed_get_all_pages
2019-01-08 04:57:01,040Z    tap -     r = authed_get(source, url, headers)
2019-01-08 04:57:01,040Z    tap -   File "/code/orchestrator/tap-env/lib/python3.5/site-packages/tap_github.py", line 91, in authed_get
2019-01-08 04:57:01,040Z    tap -     raise AuthException(resp.text)
2019-01-08 04:57:01,041Z    tap - tap_github.AuthException: {"message":"API rate limit exceeded for user ID 59968.","documentation_url":"https://developer.github.com/v3/#rate-limiting"}
2019-01-08 04:57:01,061Z target - INFO Sending batch with 2 messages for table pull_requests to https://api.stitchdata.com/v2/import/batch
2019-01-08 04:57:01,087Z target - INFO Exiting normally
2019-01-08 04:57:01,117Z   main - INFO Target exited normally with status 0
2019-01-08 04:57:01,120Z   main - INFO Exit status is: Discovery succeeded. Tap failed with code 1 and error message: "{"message":"API rate limit exceeded for user ID 59968.","documentation_url":"https://developer.github.com/v3/#rate-limiting"}". Target succeeded.

KAllan357 · 2019-01-08T14:43:21Z

It looks like there might be 2 problems here:

PR's, reviews and review_comments do not honor a bookmark (though they track one) probably because of limitations in the UI.
Github enforces a 5,000 request per hour rate limit (per user - so no multiple auth tokens!).

We'll need to do a bit more research for the first problem.

I think it might be okay to solve 2 via a backoff / sleep - Github appears to be sending headers to indicate when the rate limit resets though it feels very awkward to write code to sleep for up to an hour of time.

sminnee · 2019-01-09T00:31:39Z

I've noticed that the mechanism for fetching issue comments works like so:

Fetch all issues
For each issue, fetch comments since the last fetch date.

If you've got 1000s of issues that can be... problematic. It's not clear if "fetch all issues since yyyy-mm-dd" will includes issues that haven't been modified but have had comment added, but that would be good to test.

UPDATE: I'm full of crap, this is wrong.

sminnee · 2019-01-09T01:57:51Z

FYI - this issue appears to have been bypassed when I disabled the fetching of comments and review_comments, which suggests that the root cause relates to the fetching of these 2 datasets.

sminnee · 2019-01-09T04:12:32Z

OK I've done a bit of investigation. The underlying issues is that /pulls has no ?since= filter.

However, we don't need to get review comments of every one of the issues. In this loop - https://github.com/singer-io/tap-github/blob/master/tap_github.py#L198 - we should be able compare the "updated_at" value to our bookmark data, and if it hasn't been updated since the bookmark value, then don't bother looking for more review-comments and don't bother output the information to the Singer target.

We would still need to paginate through all the pull requests, but that might be okay. If not, fetching the pull requests with pulls?state=all&sort=updated&direction=desc, and then paginating only until updated_at was lower than our bookmark value, could reduce the number of requests.

KAllan357 · 2019-01-09T14:34:26Z

I believe this is correct. We get all the PR's (and paginate them) every sync, but only dive deeper if its updated_at is greater than the bookmark.

A simple change adding something in the sense of:

for pr in pull_requests:
  if pr['updated_at'] < bookmark_value:
    continue

  < rest of logic >

would be a good addition here.

GitHub’s /pulls endpoint doesn’t have a ?since filter, so instead we check the updated_at values of each record imported. Since we have sorted descending by this value, we can break out of the loop once we have found the first updated_at value that is less than the bookmark time. Fixes singer-io#43

sminnee · 2019-01-10T03:26:20Z

OK I've had a go at a PR for this – my python is a bit rusty but this appears to work locally,

GitHub’s /pulls endpoint doesn’t have a ?since filter, so instead we check the updated_at values of each record imported. Since we have sorted descending by this value, we can break out of the loop once we have found the first updated_at value that is less than the bookmark time. Fixes #43

KBorders01 · 2021-01-03T17:10:03Z

I am still encountering this problem with stitch trying to do an initial data sync due to the number of rows in pr_commits. I am trying to address this by doing a 5-minute sleep when receiving a rate limit error from the server in authed_get. @KAllan357, is there any reason not to just make this change to the code even though it feels awkward as you say? Sleeping for 5 minutes at a time rather than an hour will utilize time a bit more efficiently.

KBorders01 · 2021-01-03T17:11:21Z

I see this discussion is happening in #63, so I will take this comment there.

GitHub’s /pulls endpoint doesn’t have a ?since filter, so instead we check the updated_at values of each record imported. Since we have sorted descending by this value, we can break out of the loop once we have found the first updated_at value that is less than the bookmark time. Fixes singer-io#43

sminnee mentioned this issue Jan 9, 2019

Suggestion: mark "starting repository" in bookmark data #46

Open

sminnee mentioned this issue Jan 10, 2019

Fix unnecessary re-fetching of pull request data #49

Merged

KAllan357 closed this as completed in #49 Jan 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Excessive refetching of review_comments and pull_requests after rate-limiting error #43

Excessive refetching of review_comments and pull_requests after rate-limiting error #43

sminnee commented Jan 8, 2019

sminnee commented Jan 8, 2019

sminnee commented Jan 8, 2019

KAllan357 commented Jan 8, 2019

sminnee commented Jan 9, 2019 •

edited

Loading

sminnee commented Jan 9, 2019

sminnee commented Jan 9, 2019

KAllan357 commented Jan 9, 2019

sminnee commented Jan 10, 2019

KBorders01 commented Jan 3, 2021

KBorders01 commented Jan 3, 2021

Excessive refetching of review_comments and pull_requests after rate-limiting error #43

Excessive refetching of review_comments and pull_requests after rate-limiting error #43

Comments

sminnee commented Jan 8, 2019

sminnee commented Jan 8, 2019

sminnee commented Jan 8, 2019

KAllan357 commented Jan 8, 2019

sminnee commented Jan 9, 2019 • edited Loading

sminnee commented Jan 9, 2019

sminnee commented Jan 9, 2019

KAllan357 commented Jan 9, 2019

sminnee commented Jan 10, 2019

KBorders01 commented Jan 3, 2021

KBorders01 commented Jan 3, 2021

sminnee commented Jan 9, 2019 •

edited

Loading