Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive refetching of review_comments and pull_requests after rate-limiting error #43

Closed
sminnee opened this issue Jan 8, 2019 · 10 comments

Comments

@sminnee
Copy link
Contributor

sminnee commented Jan 8, 2019

I'm making use of this integration via StitchData.

I've set up an integration with a lot of different repositories: silverstripe/silverstripe-admin silverstripe/silverstripe-reports silverstripe/silverstripe-upgrader silverstripe/silverstripe-siteconfig silverstripe/silverstripe-installer silverstripe/silverstripe-graphql silverstripe/silverstripe-asset-admin silverstripe/silverstripe-behat-extension silverstripe/silverstripe-testsession silverstripe/recipe-cms silverstripe/silverstripe-versioned silverstripe/silverstripe-errorpage silverstripe/silverstripe-campaign-admin silverstripe/recipe-core silverstripe/silverstripe-assets silverstripe/silverstripe-config silverstripe/silverstripe-versioned-admin silverstripe/silverstripe-cms silverstripe/silverstripe-framework

Unsurprisingly, the rate limit is hit. My assumption is that it would continue from where it left off after it was run.

However, based on the number of review_comment entries in my database vs how many records StitchData says it has fetched (8,554 vs 139,866) it seems to me that it's getting stuck in a loop and re-fetching the many of the same records over and over again.

It would be highly beneficial if the full in-progress state was preserved.

review_comments and pull_requests seems to be affected in this way but comments, commits, and issues do not, even though the latter still have large volumes of records imported.

@sminnee
Copy link
Contributor Author

sminnee commented Jan 8, 2019

Related question: do you know if v1.3.2 is running on StitchData? That could be the issue here – the logs don't provide the version number.

@sminnee
Copy link
Contributor Author

sminnee commented Jan 8, 2019

Here's the end of the log, in case that helps:

2019-01-08 04:57:00,961Z    tap - INFO HTTP request to "comments" endpoint took 0.078s, returned status code 200
2019-01-08 04:57:00,962Z    tap - WARNING Removed 38 paths during transforms:
2019-01-08 04:57:00,962Z    tap - 	_links
2019-01-08 04:57:00,963Z    tap - 	assignee
2019-01-08 04:57:00,963Z    tap - 	assignees
2019-01-08 04:57:00,963Z    tap - 	author_association
2019-01-08 04:57:00,963Z    tap - 	base
2019-01-08 04:57:00,963Z    tap - 	comments_url
2019-01-08 04:57:00,963Z    tap - 	commits_url
2019-01-08 04:57:00,963Z    tap - 	diff_url
2019-01-08 04:57:00,963Z    tap - 	head
2019-01-08 04:57:00,963Z    tap - 	html_url
2019-01-08 04:57:00,963Z    tap - 	issue_url
2019-01-08 04:57:00,963Z    tap - 	labels
2019-01-08 04:57:00,964Z    tap - 	locked
2019-01-08 04:57:00,964Z    tap - 	merge_commit_sha
2019-01-08 04:57:00,964Z    tap - 	milestone
2019-01-08 04:57:00,964Z    tap - 	node_id
2019-01-08 04:57:00,964Z    tap - 	patch_url
2019-01-08 04:57:00,964Z    tap - 	requested_reviewers
2019-01-08 04:57:00,964Z    tap - 	requested_teams
2019-01-08 04:57:00,964Z    tap - 	review_comment_url
2019-01-08 04:57:00,965Z    tap - 	review_comments_url
2019-01-08 04:57:00,965Z    tap - 	statuses_url
2019-01-08 04:57:00,965Z    tap - 	user.avatar_url
2019-01-08 04:57:00,965Z    tap - 	user.events_url
2019-01-08 04:57:00,965Z    tap - 	user.followers_url
2019-01-08 04:57:00,965Z    tap - 	user.following_url
2019-01-08 04:57:00,965Z    tap - 	user.gists_url
2019-01-08 04:57:00,965Z    tap - 	user.gravatar_id
2019-01-08 04:57:00,965Z    tap - 	user.html_url
2019-01-08 04:57:00,965Z    tap - 	user.node_id
2019-01-08 04:57:00,965Z    tap - 	user.organizations_url
2019-01-08 04:57:00,965Z    tap - 	user.received_events_url
2019-01-08 04:57:00,965Z    tap - 	user.repos_url
2019-01-08 04:57:00,965Z    tap - 	user.site_admin
2019-01-08 04:57:00,965Z    tap - 	user.starred_url
2019-01-08 04:57:00,965Z    tap - 	user.subscriptions_url
2019-01-08 04:57:00,965Z    tap - 	user.type
2019-01-08 04:57:00,966Z    tap - 	user.url
2019-01-08 04:57:00,966Z    tap - WARNING Removed paths list: ['_links', 'assignee', 'assignees', 'author_association', 'base', 'comments_url', 'commits_url', 'diff_url', 'head', 'html_url', 'issue_url', 'labels', 'locked', 'merge_commit_sha', 'milestone', 'node_id', 'patch_url', 'requested_reviewers', 'requested_teams', 'review_comment_url', 'review_comments_url', 'statuses_url', 'user.avatar_url', 'user.events_url', 'user.followers_url', 'user.following_url', 'user.gists_url', 'user.gravatar_id', 'user.html_url', 'user.node_id', 'user.organizations_url', 'user.received_events_url', 'user.repos_url', 'user.site_admin', 'user.starred_url', 'user.subscriptions_url', 'user.type', 'user.url']
2019-01-08 04:57:01,039Z    tap - INFO METRIC Point(metric_type='timer', metric='http_request_duration', value=0.07489776611328125, tags={'endpoint': 'comments', 'status': 'failed'})
2019-01-08 04:57:01,039Z    tap - INFO replicated 0 records from "reviews" endpoint
2019-01-08 04:57:01,039Z    tap - INFO replicated 488 records from "pull_requests" endpoint
2019-01-08 04:57:01,039Z    tap - CRITICAL {"message":"API rate limit exceeded for user ID 59968.","documentation_url":"https://developer.github.com/v3/#rate-limiting"}
2019-01-08 04:57:01,040Z    tap - Traceback (most recent call last):
2019-01-08 04:57:01,040Z    tap -   File "tap-env/bin/tap-github", line 11, in <module>
2019-01-08 04:57:01,040Z    tap -     sys.exit(main())
2019-01-08 04:57:01,040Z    tap -   File "/code/orchestrator/tap-env/lib/python3.5/site-packages/singer/utils.py", line 225, in wrapped
2019-01-08 04:57:01,040Z    tap -     return fnc(*args, **kwargs)
2019-01-08 04:57:01,040Z    tap -   File "/code/orchestrator/tap-env/lib/python3.5/site-packages/tap_github.py", line 515, in main
2019-01-08 04:57:01,040Z    tap -     do_sync(args.config, args.state, catalog)
2019-01-08 04:57:01,040Z    tap -   File "/code/orchestrator/tap-env/lib/python3.5/site-packages/tap_github.py", line 503, in do_sync
2019-01-08 04:57:01,040Z    tap -     state = sync_func(stream_schemas, repo, state, mdata)
2019-01-08 04:57:01,040Z    tap -   File "/code/orchestrator/tap-env/lib/python3.5/site-packages/tap_github.py", line 219, in get_all_pull_requests
2019-01-08 04:57:01,040Z    tap -     for review_comment_rec in get_review_comments_for_pr(pr_num, schemas['review_comments'], repo_path, state, mdata):
2019-01-08 04:57:01,040Z    tap -   File "/code/orchestrator/tap-env/lib/python3.5/site-packages/tap_github.py", line 244, in get_review_comments_for_pr
2019-01-08 04:57:01,040Z    tap -     'https://api.github.com/repos/{}/pulls/{}/comments'.format(repo_path,pr_number)
2019-01-08 04:57:01,040Z    tap -   File "/code/orchestrator/tap-env/lib/python3.5/site-packages/tap_github.py", line 100, in authed_get_all_pages
2019-01-08 04:57:01,040Z    tap -     r = authed_get(source, url, headers)
2019-01-08 04:57:01,040Z    tap -   File "/code/orchestrator/tap-env/lib/python3.5/site-packages/tap_github.py", line 91, in authed_get
2019-01-08 04:57:01,040Z    tap -     raise AuthException(resp.text)
2019-01-08 04:57:01,041Z    tap - tap_github.AuthException: {"message":"API rate limit exceeded for user ID 59968.","documentation_url":"https://developer.github.com/v3/#rate-limiting"}
2019-01-08 04:57:01,061Z target - INFO Sending batch with 2 messages for table pull_requests to https://api.stitchdata.com/v2/import/batch
2019-01-08 04:57:01,087Z target - INFO Exiting normally
2019-01-08 04:57:01,117Z   main - INFO Target exited normally with status 0
2019-01-08 04:57:01,120Z   main - INFO Exit status is: Discovery succeeded. Tap failed with code 1 and error message: "{"message":"API rate limit exceeded for user ID 59968.","documentation_url":"https://developer.github.com/v3/#rate-limiting"}". Target succeeded.

@KAllan357
Copy link
Contributor

It looks like there might be 2 problems here:

  1. PR's, reviews and review_comments do not honor a bookmark (though they track one) probably because of limitations in the UI.
  2. Github enforces a 5,000 request per hour rate limit (per user - so no multiple auth tokens!).

We'll need to do a bit more research for the first problem.

I think it might be okay to solve 2 via a backoff / sleep - Github appears to be sending headers to indicate when the rate limit resets though it feels very awkward to write code to sleep for up to an hour of time.

@sminnee
Copy link
Contributor Author

sminnee commented Jan 9, 2019

I've noticed that the mechanism for fetching issue comments works like so:

  • Fetch all issues
  • For each issue, fetch comments since the last fetch date.

If you've got 1000s of issues that can be... problematic. It's not clear if "fetch all issues since yyyy-mm-dd" will includes issues that haven't been modified but have had comment added, but that would be good to test.

UPDATE: I'm full of crap, this is wrong.

@sminnee
Copy link
Contributor Author

sminnee commented Jan 9, 2019

FYI - this issue appears to have been bypassed when I disabled the fetching of comments and review_comments, which suggests that the root cause relates to the fetching of these 2 datasets.

@sminnee
Copy link
Contributor Author

sminnee commented Jan 9, 2019

OK I've done a bit of investigation. The underlying issues is that /pulls has no ?since= filter.

However, we don't need to get review comments of every one of the issues. In this loop - https://github.com/singer-io/tap-github/blob/master/tap_github.py#L198 - we should be able compare the "updated_at" value to our bookmark data, and if it hasn't been updated since the bookmark value, then don't bother looking for more review-comments and don't bother output the information to the Singer target.

We would still need to paginate through all the pull requests, but that might be okay. If not, fetching the pull requests with pulls?state=all&sort=updated&direction=desc, and then paginating only until updated_at was lower than our bookmark value, could reduce the number of requests.

@KAllan357
Copy link
Contributor

I believe this is correct. We get all the PR's (and paginate them) every sync, but only dive deeper if its updated_at is greater than the bookmark.

A simple change adding something in the sense of:

for pr in pull_requests:
  if pr['updated_at'] < bookmark_value:
    continue

  < rest of logic >

would be a good addition here.

sminnee pushed a commit to sminnee/tap-github that referenced this issue Jan 10, 2019
GitHub’s /pulls endpoint doesn’t have a ?since filter, so instead we
check the updated_at values of each record imported. Since we have
sorted descending by this value, we can break out of the loop once we
have found the first updated_at value that is less than the bookmark
time.

Fixes singer-io#43
sminnee pushed a commit to sminnee/tap-github that referenced this issue Jan 10, 2019
GitHub’s /pulls endpoint doesn’t have a ?since filter, so instead we
check the updated_at values of each record imported. Since we have
sorted descending by this value, we can break out of the loop once we
have found the first updated_at value that is less than the bookmark
time.

Fixes singer-io#43
@sminnee
Copy link
Contributor Author

sminnee commented Jan 10, 2019

OK I've had a go at a PR for this – my python is a bit rusty but this appears to work locally,

KAllan357 pushed a commit that referenced this issue Jan 10, 2019
GitHub’s /pulls endpoint doesn’t have a ?since filter, so instead we
check the updated_at values of each record imported. Since we have
sorted descending by this value, we can break out of the loop once we
have found the first updated_at value that is less than the bookmark
time.

Fixes #43
@KBorders01
Copy link

I am still encountering this problem with stitch trying to do an initial data sync due to the number of rows in pr_commits. I am trying to address this by doing a 5-minute sleep when receiving a rate limit error from the server in authed_get. @KAllan357, is there any reason not to just make this change to the code even though it feels awkward as you say? Sleeping for 5 minutes at a time rather than an hour will utilize time a bit more efficiently.

@KBorders01
Copy link

I see this discussion is happening in #63, so I will take this comment there.

AJWurts pushed a commit to villagelabsco/tap-github that referenced this issue Oct 24, 2024
GitHub’s /pulls endpoint doesn’t have a ?since filter, so instead we
check the updated_at values of each record imported. Since we have
sorted descending by this value, we can break out of the loop once we
have found the first updated_at value that is less than the bookmark
time.

Fixes singer-io#43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants