Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Safe pulls: Implement pull by year instead off one shot #7

Open
davidclarance opened this issue Sep 1, 2019 · 2 comments
Open

Safe pulls: Implement pull by year instead off one shot #7

davidclarance opened this issue Sep 1, 2019 · 2 comments
Labels
enhancement New feature or request

Comments

@davidclarance
Copy link
Owner

There's a suggestion to have defaults as the entire date range. Implementing this in the current functions will increase server utilization by a lot. This isn't healthy for the server and the package. Therefore we need to break requests by year instead of pull it all at once and then combine it inside the function.

@davidclarance davidclarance added the enhancement New feature or request label Sep 1, 2019
@bluehill
Copy link

bluehill commented Sep 2, 2019

Internal data breaks are a good idea, but currently the extract_species function is within data call parameters: the most data intensive species call which is for all time available for cape turtle dove for south africa: this returns 150 000 records: so the function is still well within margin's of good etiquette (<250 000). Reproducible example:
ctdove_raw_records <- extract_species(

species_ids = 316,
start_date = '2007-01-01',
end_date = '2019-09-01',
region_type = 'country',
region_id = 'southafrica'

)

Happy coding :)

@davidclarance
Copy link
Owner Author

Great point and thanks for the example. I think I'll still do it for three reasons:

  • The imperative word is that this works now. We want to avoid accumulating technical debt and so if it's something easy to implement why not do it now especially as the extract functions are the bed rock of all future functions.
  • Even though 250K is the limit, it doesn't mean that multiple queries operating at 250K are healthy. You can think of it in terms of many users drawing on one common CPU. You don't want everyone maxing out their limits.
  • The larger the query/operation, the longer the rollback takes if there's a timeout or server error, the longer competing queries are blocked.

I think the addition is simple, can be used across all the extract functions and provides a safety net.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants