Skip to content

Commit

Permalink
Fix unnecessary re-fetching of pull request data (singer-io#49)
Browse files Browse the repository at this point in the history
GitHub’s /pulls endpoint doesn’t have a ?since filter, so instead we
check the updated_at values of each record imported. Since we have
sorted descending by this value, we can break out of the loop once we
have found the first updated_at value that is less than the bookmark
time.

Fixes singer-io#43
  • Loading branch information
Sam Minnée authored and KAllan357 committed Jan 10, 2019
1 parent 5f1b09a commit a428856
Showing 1 changed file with 16 additions and 1 deletion.
17 changes: 16 additions & 1 deletion tap_github.py
Original file line number Diff line number Diff line change
Expand Up @@ -187,15 +187,30 @@ def get_all_pull_requests(schemas, repo_path, state, mdata):
'''
https://developer.github.com/v3/pulls/#list-pull-requests
'''

bookmark_value = get_bookmark(state, repo_path, "pull_requests", "since");
if bookmark_value:
bookmark_time = singer.utils.strptime_to_utc(bookmark_value)
else:
bookmark_time = 0

with metrics.record_counter('pull_requests') as counter:
with metrics.record_counter('reviews') as reviews_counter:
for response in authed_get_all_pages(
'pull_requests',
'https://api.github.com/repos/{}/pulls?state=all'.format(repo_path)
'https://api.github.com/repos/{}/pulls?state=all&sort=updated&direction=desc'.format(repo_path)
):
pull_requests = response.json()
extraction_time = singer.utils.now()
for pr in pull_requests:

# skip records that haven't been updated since the last run
# the GitHub API doesn't currently allow a ?since param for pulls
# once we find the first piece of old data we can return, thanks to
# the sorting
if bookmark_time and singer.utils.strptime_to_utc(pr.get('updated_at')) < bookmark_time:
return state

pr_num = pr.get('number')
pr['_sdc_repository'] = repo_path

Expand Down

0 comments on commit a428856

Please sign in to comment.