-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added download_file()
; fixed missing word in write_disk()
docs
#599
Closed
Closed
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,155 @@ | ||
possibly <- function(.f, otherwise, quiet = TRUE) { | ||
force(otherwise) | ||
function(...) { | ||
tryCatch( | ||
.f(...), | ||
error = function(e) { | ||
if (!quiet) | ||
message("Error: ", e$message) | ||
otherwise | ||
}, | ||
interrupt = function(e) { | ||
stop("Terminated by user", call. = FALSE) | ||
} | ||
) | ||
} | ||
} | ||
|
||
safe_GET <- possibly(GET, NULL, quiet = TRUE) | ||
|
||
#' Download file from the Internet (cache-aware) | ||
#' | ||
#' This is an alternative to [utils::download.file()] and a convenience wrapper for | ||
#' [GET()] + [httr::write_disk()] to perform file downloads. | ||
#' | ||
#' Since this function uses [GET()], callers can pass in `httr` configuration | ||
#' options to customize the behaviour of the download process (e.g. specify a `User-Agent` via | ||
#' [user_agent()], set proxy config via [use_proxy()], etc.). | ||
#' | ||
#' The function is also "cache-aware" in the sense that you deliberately have to specify | ||
#' `overwrite = TRUE` to force a re-download. This has the potential to save bandwidth | ||
#' of both the caller and the site hosting files for download. | ||
#' | ||
#' @note While this function supports specifying multiple URLs and download paths it | ||
#' does not perform concurrent downloads. | ||
#' @param url the url(s) of the file to retrieve. If multiple URLs are provided then the same | ||
#' number of `path`s must also be provided. | ||
#' @param path Path(s) to save content to. If more than one `path` is specified then the same | ||
#' number of `url`s must also be provided. THis parameter will be [path.expand()]ed. | ||
#' @param overwrite Will only overwrite existing path if `TRUE`. | ||
#' @param ... passed on to [GET()] | ||
#' @return a data frame containing the `url`(s), `path`(s), cache status, and HTTP status code(s). | ||
#' If there was an error downloading a file the path, status code, and HTTP status | ||
#' columns will be `NA`. If the file was now re-downloaded the status code will be 399 | ||
#' @seealso [GET()]; [write_disk()] | ||
#' @export | ||
#' @examples | ||
#' tmp1 <- tempfile() | ||
#' tmp2 <- tempfile() | ||
#' tmp3 <- tempfile() | ||
#' | ||
#' download_file("https://google.com", tmp1) # downloads fine | ||
#' download_file("https://google.com", tmp1) # doesn't re-download since it's cached | ||
#' download_file("https://google.com", tmp1, overwrite = TRUE) # re-downloads (overwrites file) | ||
#' download_file("https://google.com", tmp2) # re-downloads (new file) | ||
#' download_file("badurl", tmp3) # handles major errors gracefully | ||
#' | ||
#' # multi-file example with no-caching | ||
#' download_file( | ||
#' c(rep("https://google.com", 2), "badurl"), | ||
#' c(tmp1, tmp2, tmp3), | ||
#' overwrite = TRUE | ||
#' ) | ||
#' | ||
#' # multi-file example with caching | ||
#' download_file( | ||
#' c(rep("https://google.com", 2), "badurl"), | ||
#' c(tmp1, tmp2, tmp3), | ||
#' overwrite = FALSE | ||
#' ) | ||
download_file <- function(url, path, overwrite = FALSE, ...) { | ||
|
||
url <- as.character(url) | ||
path <- as.character(path) | ||
|
||
if (length(url) != length(path)) { | ||
stop("The lengths of the 'url' and 'path' parameters must be equal.", call.=FALSE) | ||
} | ||
|
||
path <- path.expand(path) | ||
|
||
overwrite <- as.logical(overwrite) | ||
stopifnot(length(overwrite) == 1) | ||
|
||
out <- vector("list", length = length(url)) | ||
|
||
for (idx in seq_along(url)) { | ||
|
||
u <- url[[idx]] | ||
p <- path[[idx]] | ||
|
||
if (file.exists(p)) { | ||
|
||
if (overwrite) { # file exists but caller wants to re-download | ||
res <- safe_GET(u, write_disk(p, overwrite = TRUE), ...) | ||
if (is.null(res)) { | ||
p <- NA_character_ | ||
cache_used = FALSE | ||
status <- NA_integer_ | ||
} else { | ||
cache_used <- FALSE | ||
status <- status_code(res) | ||
} | ||
} else { # file exists but caller does not want to re-download | ||
if (is.null(parse_url(u)[["hostname"]])) { # quick non-network test for invalid URL | ||
p <- NA_character_ | ||
cache_used = FALSE | ||
status <- NA_integer_ | ||
} else { | ||
cache_used <- TRUE | ||
status <- 399L | ||
} | ||
} | ||
|
||
} else { # file does not exist, so do the thing | ||
|
||
res <- safe_GET(u, write_disk(p, overwrite = overwrite), ...) | ||
|
||
if (is.null(res)) { | ||
p <- NA_character_ | ||
cache_used <- FALSE | ||
status <- NA_integer_ | ||
} else { | ||
status <- status_code(res) | ||
cache_used <- FALSE | ||
} | ||
|
||
} | ||
|
||
out[[idx]] <- data.frame( | ||
url = u, path = p, | ||
status_code = status, | ||
cache_used = cache_used, | ||
stringsAsFactors = FALSE | ||
) | ||
|
||
} | ||
|
||
out <- do.call(rbind.data.frame, out) | ||
class(out) <- c("tbl_df", "tbl", "data.frame") | ||
|
||
invisible(out) | ||
|
||
} | ||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,51 @@ | ||
context("test-download-file") | ||
|
||
test_that("download_file length 1 ops behave as expected", { | ||
|
||
tmp1 <- tempfile() | ||
tmp2 <- tempfile() | ||
tmp3 <- tempfile() | ||
|
||
res <- download_file("https://httpbin.org", tmp1) | ||
expect_equal(res$status_code[[1]], 200L) | ||
expect_false(res$cache_used[[1]]) | ||
|
||
res <- download_file("https://httpbin.org", tmp1) | ||
expect_equal(res$status_code[[1]], 399L) | ||
expect_true(res$cache_used[[1]]) | ||
|
||
res <- download_file("https://httpbin.org", tmp1, overwrite = TRUE) | ||
expect_equal(res$status_code[[1]], 200L) | ||
expect_false(res$cache_used[[1]]) | ||
|
||
res <- download_file("badurl", tmp3) | ||
expect_true(is.na(res$status_code[[1]])) | ||
expect_false(res$cache_used[[1]]) | ||
|
||
}) | ||
|
||
test_that("download_file multi-file ops behave as expected", { | ||
|
||
tmp1 <- tempfile() | ||
tmp2 <- tempfile() | ||
tmp3 <- tempfile() | ||
|
||
res <- download_file( | ||
c(rep("https://google.com", 2), "badurl"), | ||
c(tmp1, tmp2, tmp3), | ||
overwrite = TRUE | ||
) | ||
|
||
expect_identical(res$status_code, c(200L, 200L, NA)) | ||
expect_identical(res$cache_used, c(FALSE, FALSE, FALSE)) | ||
|
||
res <- download_file( | ||
c(rep("https://google.com", 2), "badurl"), | ||
c(tmp1, tmp2, tmp3), | ||
overwrite = FALSE | ||
) | ||
|
||
expect_identical(res$status_code, c(399L, 399L, NA)) | ||
expect_identical(res$cache_used, c(TRUE, TRUE, FALSE)) | ||
|
||
}) |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes me think of something I do in
usethis::use_course()
and recently decided to not present as a PR into curl (jeroen/curl#187): trying to determine local file name from the Content-Disposition header (and falling back to base name of the URL).@hrbrmstr Do you have any thoughts on that? Maybe the curl issue is the best place to put them btw.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a super gd question. I have a similar idiom setup for some light crawls we do but hadn't thought to make it generic somewhere.
I wonder if adding a "helper" to/for
write_disk()
might be a way to go (not surecurl
is the right place for it).Something like (seriously off the top of my head)
write_disk <- function(path, infer = FALSE, overwrite = FALSE)
where if
infer
wereTRUE
it'd have the behaviour you posited:Content-Disposition
and use that if found (potentially sanitizing it since it's theoretically untrusted content)/
AND store it in the directory pointed to in
path
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What you describe it basically exactly what I do in usethis (Content-Disposition > last bit of URL > random name + sanitization):
https://github.com/r-lib/usethis/blob/master/R/course.R
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 It'd be pretty straightforward for me to add this to
write_disk()
(on holiday in 2 days so wldn't be until after that unless it seriously downpours in Acadia :-) if folks deem that a worthy add.