Skip to content

Latest commit

 

History

History
25 lines (22 loc) · 1.49 KB

README.md

File metadata and controls

25 lines (22 loc) · 1.49 KB

ETL in background jobs

Sometimes there are long running data collection and aggregation tasks (commonly referred to as ETL) that need to be done in R. These jobs can be run in the background in order to keep the current R session free. This can be especially helpful when the background job saves a small portion of the data initially and then continues running to collect more data. In this case, the user can use the initial data to experiment and build an initial analysis, all while the rest of the dataset is being collected in the background.

This example uses the rtweet package to collect data from Twitter. By default, tweet search queries return at most 18,000 results within a 15 minute window. The search_tweets function includes a retryonratelimit parameter that can be used to collect more than 18,000 tweets for a given search. However, this can take a long time to run since it waits until the 15 minute window has passed in order to submit another query. Instead of collecting all the data at once, etl.R collects just a few tweets, saves the results, and then collects more tweets. This way, instead of waiting for all data to be collected, a user can send etl.R to a background job (either local or Launcher), then use the preliminary results to start designing an analysis while the remaining results are being collected. Since etl.R saves data as .rds files, there's no need to copy job results into the global environment.