Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added download_file(); fixed missing word in write_disk() docs #599

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@ export(content_type_json)
export(content_type_xml)
export(cookies)
export(curl_docs)
export(download_file)
export(get_callback)
export(guess_media)
export(handle)
Expand Down
3 changes: 3 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,9 @@
* `RETRY()` now throws the correct error message if an error occurs during the
request (@austin3dickey, #581).

* New `download_file()` function to support cache-aware single and multi-URL
file downloads. (@hrbrmstr)

# httr 1.4.0

## OAuth
Expand Down
155 changes: 155 additions & 0 deletions R/download-file.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@
possibly <- function(.f, otherwise, quiet = TRUE) {
force(otherwise)
function(...) {
tryCatch(
.f(...),
error = function(e) {
if (!quiet)
message("Error: ", e$message)
otherwise
},
interrupt = function(e) {
stop("Terminated by user", call. = FALSE)
}
)
}
}

safe_GET <- possibly(GET, NULL, quiet = TRUE)

#' Download file from the Internet (cache-aware)
#'
#' This is an alternative to [utils::download.file()] and a convenience wrapper for
#' [GET()] + [httr::write_disk()] to perform file downloads.
#'
#' Since this function uses [GET()], callers can pass in `httr` configuration
#' options to customize the behaviour of the download process (e.g. specify a `User-Agent` via
#' [user_agent()], set proxy config via [use_proxy()], etc.).
#'
#' The function is also "cache-aware" in the sense that you deliberately have to specify
#' `overwrite = TRUE` to force a re-download. This has the potential to save bandwidth
#' of both the caller and the site hosting files for download.
#'
#' @note While this function supports specifying multiple URLs and download paths it
#' does not perform concurrent downloads.
#' @param url the url(s) of the file to retrieve. If multiple URLs are provided then the same
#' number of `path`s must also be provided.
#' @param path Path(s) to save content to. If more than one `path` is specified then the same
#' number of `url`s must also be provided. THis parameter will be [path.expand()]ed.
#' @param overwrite Will only overwrite existing path if `TRUE`.
#' @param ... passed on to [GET()]
#' @return a data frame containing the `url`(s), `path`(s), cache status, and HTTP status code(s).
#' If there was an error downloading a file the path, status code, and HTTP status
#' columns will be `NA`. If the file was now re-downloaded the status code will be 399
#' @seealso [GET()]; [write_disk()]
#' @export
#' @examples
#' tmp1 <- tempfile()
#' tmp2 <- tempfile()
#' tmp3 <- tempfile()
#'
#' download_file("https://google.com", tmp1) # downloads fine
#' download_file("https://google.com", tmp1) # doesn't re-download since it's cached
#' download_file("https://google.com", tmp1, overwrite = TRUE) # re-downloads (overwrites file)
#' download_file("https://google.com", tmp2) # re-downloads (new file)
#' download_file("badurl", tmp3) # handles major errors gracefully
#'
#' # multi-file example with no-caching
#' download_file(
#' c(rep("https://google.com", 2), "badurl"),
#' c(tmp1, tmp2, tmp3),
#' overwrite = TRUE
#' )
#'
#' # multi-file example with caching
#' download_file(
#' c(rep("https://google.com", 2), "badurl"),
#' c(tmp1, tmp2, tmp3),
#' overwrite = FALSE
#' )
download_file <- function(url, path, overwrite = FALSE, ...) {

url <- as.character(url)
path <- as.character(path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me think of something I do in usethis::use_course() and recently decided to not present as a PR into curl (jeroen/curl#187): trying to determine local file name from the Content-Disposition header (and falling back to base name of the URL).

@hrbrmstr Do you have any thoughts on that? Maybe the curl issue is the best place to put them btw.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a super gd question. I have a similar idiom setup for some light crawls we do but hadn't thought to make it generic somewhere.

I wonder if adding a "helper" to/for write_disk() might be a way to go (not sure curl is the right place for it).

Something like (seriously off the top of my head) write_disk <- function(path, infer = FALSE, overwrite = FALSE)

where if infer were TRUE it'd have the behaviour you posited:

  • look for Content-Disposition and use that if found (potentially sanitizing it since it's theoretically untrusted content)
  • use last part of URL if ^^ does not exist, BUT
  • use a generated UUID-ish name if ^^ is a /

AND store it in the directory pointed to in path.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you describe it basically exactly what I do in usethis (Content-Disposition > last bit of URL > random name + sanitization):

https://github.com/r-lib/usethis/blob/master/R/course.R

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 It'd be pretty straightforward for me to add this to write_disk() (on holiday in 2 days so wldn't be until after that unless it seriously downpours in Acadia :-) if folks deem that a worthy add.


if (length(url) != length(path)) {
stop("The lengths of the 'url' and 'path' parameters must be equal.", call.=FALSE)
}

path <- path.expand(path)

overwrite <- as.logical(overwrite)
stopifnot(length(overwrite) == 1)

out <- vector("list", length = length(url))

for (idx in seq_along(url)) {

u <- url[[idx]]
p <- path[[idx]]

if (file.exists(p)) {

if (overwrite) { # file exists but caller wants to re-download
res <- safe_GET(u, write_disk(p, overwrite = TRUE), ...)
if (is.null(res)) {
p <- NA_character_
cache_used = FALSE
status <- NA_integer_
} else {
cache_used <- FALSE
status <- status_code(res)
}
} else { # file exists but caller does not want to re-download
if (is.null(parse_url(u)[["hostname"]])) { # quick non-network test for invalid URL
p <- NA_character_
cache_used = FALSE
status <- NA_integer_
} else {
cache_used <- TRUE
status <- 399L
}
}

} else { # file does not exist, so do the thing

res <- safe_GET(u, write_disk(p, overwrite = overwrite), ...)

if (is.null(res)) {
p <- NA_character_
cache_used <- FALSE
status <- NA_integer_
} else {
status <- status_code(res)
cache_used <- FALSE
}

}

out[[idx]] <- data.frame(
url = u, path = p,
status_code = status,
cache_used = cache_used,
stringsAsFactors = FALSE
)

}

out <- do.call(rbind.data.frame, out)
class(out) <- c("tbl_df", "tbl", "data.frame")

invisible(out)

}












6 changes: 3 additions & 3 deletions R/write-function.R
Original file line number Diff line number Diff line change
Expand Up @@ -17,17 +17,17 @@ write_function <- function(subclass, ...) {
#' it avoids a round-trip to disk. If you want to save a file that's bigger
#' than memory, use `write_disk()` to save it to a known path.
#'
#' @param path Path to content to.
#' @param path Path to save content to.
#' @param overwrite Will only overwrite existing `path` if TRUE.
#' @export
#' @examples
#' tmp <- tempfile()
#' r1 <- GET("https://www.google.com", write_disk(tmp))
#' readLines(tmp)
#'
#'
#' # The default
#' r2 <- GET("https://www.google.com", write_memory())
#'
#'
#' # Save a very large file
#' \dontrun{
#' GET(
Expand Down
69 changes: 69 additions & 0 deletions man/download_file.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions man/httr-package.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion man/write_disk.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

51 changes: 51 additions & 0 deletions tests/testthat/test-download-file.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
context("test-download-file")

test_that("download_file length 1 ops behave as expected", {

tmp1 <- tempfile()
tmp2 <- tempfile()
tmp3 <- tempfile()

res <- download_file("https://httpbin.org", tmp1)
expect_equal(res$status_code[[1]], 200L)
expect_false(res$cache_used[[1]])

res <- download_file("https://httpbin.org", tmp1)
expect_equal(res$status_code[[1]], 399L)
expect_true(res$cache_used[[1]])

res <- download_file("https://httpbin.org", tmp1, overwrite = TRUE)
expect_equal(res$status_code[[1]], 200L)
expect_false(res$cache_used[[1]])

res <- download_file("badurl", tmp3)
expect_true(is.na(res$status_code[[1]]))
expect_false(res$cache_used[[1]])

})

test_that("download_file multi-file ops behave as expected", {

tmp1 <- tempfile()
tmp2 <- tempfile()
tmp3 <- tempfile()

res <- download_file(
c(rep("https://google.com", 2), "badurl"),
c(tmp1, tmp2, tmp3),
overwrite = TRUE
)

expect_identical(res$status_code, c(200L, 200L, NA))
expect_identical(res$cache_used, c(FALSE, FALSE, FALSE))

res <- download_file(
c(rep("https://google.com", 2), "badurl"),
c(tmp1, tmp2, tmp3),
overwrite = FALSE
)

expect_identical(res$status_code, c(399L, 399L, NA))
expect_identical(res$cache_used, c(TRUE, TRUE, FALSE))

})