Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added download_file(); fixed missing word in write_disk() docs #599

Closed
wants to merge 3 commits into from

Conversation

hrbrmstr
Copy link
Contributor

GET + write_disk() does a fine job dealing with downloads but is somewhat verbose to type.

Base download.file() works fine but isn't cached-file-aware like write_disk() is.

GET + write_disk() is not multi-URL/file-aware like base download.file() is.

Base download.file() doesn't truly have all of the nice browser-like HTTP operations like GET does.

Abusing bandwidth (i.e. needlessly re-downloading) even in these modern times is not cool.

Thus begat download_file().

The function:

  • supports single URL/path and multiple URLs/paths for multi-file downloading (tho does not perform these ops in parallel like download.file() can in some circumstances)
  • is cached-file aware but also validates that a given URL is a valid URL (from a URL structure perspective, not "reach out and touch someone" perspective) when returning a response.
  • returns a data frame (classed as a tibble) with a summary of the operations that includes the requested URL, file path, status code and whether a cached file was used. A 399 status code is used to identify whether a URL resource was already in a cached state since 304 Not Modified does not make sense (since the function does not test for existence) and 399 is not likely to be used in our lifetime as a "standard" 300-level response

Example(s):

library(httr)
library(tibble)

tmp1 <- tempfile()
tmp2 <- tempfile()
tmp3 <- tempfile()

(download_file("https://google.com", tmp1)) # downloads fine
## # A tibble: 1 x 4
##   url          path                                  status_code cache_used
##   <chr>        <chr>                                       <int> <lgl>     
## 1 https://goo… /var/folders/1w/2d82v7ts3gs98tc6v772…         200 FALSE

(download_file("https://google.com", tmp1)) # doesn't re-download since it's cached
## # A tibble: 1 x 4
##   url          path                                  status_code cache_used
##   <chr>        <chr>                                       <int> <lgl>     
## 1 https://goo… /var/folders/1w/2d82v7ts3gs98tc6v772…         399 TRUE

(download_file("https://google.com", tmp1, overwrite = TRUE)) # re-downloads (overwrites file)
## # A tibble: 1 x 4
##   url          path                                  status_code cache_used
##   <chr>        <chr>                                       <int> <lgl>     
## 1 https://goo… /var/folders/1w/2d82v7ts3gs98tc6v772…         200 FALSE

(download_file("https://google.com", tmp2)) # re-downloads (new file)
## # A tibble: 1 x 4
##   url          path                                  status_code cache_used
##   <chr>        <chr>                                       <int> <lgl>     
## 1 https://goo… /var/folders/1w/2d82v7ts3gs98tc6v772…         200 FALSE

(download_file("badurl", tmp3)) # handles major errors gracefully
## # A tibble: 1 x 4
##   url    path  status_code cache_used
##   <chr>  <chr>       <int> <lgl>     
## 1 badurl <NA>           NA FALSE

# multi-file example with no-caching
(download_file(
  c(rep("https://google.com", 2), "badurl"),
  c(tmp1, tmp2, tmp3),
  overwrite = TRUE
))
## # A tibble: 3 x 4
##   url          path                                  status_code cache_used
##   <chr>        <chr>                                       <int> <lgl>     
## 1 https://goo… /var/folders/1w/2d82v7ts3gs98tc6v772…         200 FALSE     
## 2 https://goo… /var/folders/1w/2d82v7ts3gs98tc6v772…         200 FALSE     
## 3 badurl       <NA>                                           NA FALSE

# multi-file example with caching
(download_file(
  c(rep("https://google.com", 2), "badurl"),
  c(tmp1, tmp2, tmp3),
  overwrite = FALSE
))
## # A tibble: 3 x 4
##   url          path                                  status_code cache_used
##   <chr>        <chr>                                       <int> <lgl>     
## 1 https://goo… /var/folders/1w/2d82v7ts3gs98tc6v772…         399 TRUE      
## 2 https://goo… /var/folders/1w/2d82v7ts3gs98tc6v772…         399 TRUE      
## 3 badurl       <NA>                                           NA FALSE

There are tests for single & multi-file downloads.

I also added a missing word to the path parameter in write_disk() docs.

devtools::check() passed.

Copy link
Member

@jennybc jennybc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of this convenience function. It's not my call but I wonder if this sort of layer is something we would book against httr2 (https://github.com/r-lib/httr2/issues) or perhaps there is no reason to not just get on with it.

download_file <- function(url, path, overwrite = FALSE, ...) {

url <- as.character(url)
path <- as.character(path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me think of something I do in usethis::use_course() and recently decided to not present as a PR into curl (jeroen/curl#187): trying to determine local file name from the Content-Disposition header (and falling back to base name of the URL).

@hrbrmstr Do you have any thoughts on that? Maybe the curl issue is the best place to put them btw.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a super gd question. I have a similar idiom setup for some light crawls we do but hadn't thought to make it generic somewhere.

I wonder if adding a "helper" to/for write_disk() might be a way to go (not sure curl is the right place for it).

Something like (seriously off the top of my head) write_disk <- function(path, infer = FALSE, overwrite = FALSE)

where if infer were TRUE it'd have the behaviour you posited:

  • look for Content-Disposition and use that if found (potentially sanitizing it since it's theoretically untrusted content)
  • use last part of URL if ^^ does not exist, BUT
  • use a generated UUID-ish name if ^^ is a /

AND store it in the directory pointed to in path.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you describe it basically exactly what I do in usethis (Content-Disposition > last bit of URL > random name + sanitization):

https://github.com/r-lib/usethis/blob/master/R/course.R

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 It'd be pretty straightforward for me to add this to write_disk() (on holiday in 2 days so wldn't be until after that unless it seriously downpours in Acadia :-) if folks deem that a worthy add.

@hadley
Copy link
Member

hadley commented Oct 31, 2023

httr has been superseded in favour of httr2, so is no longer under active development. Thanks for using httr, thanks for contributing, and my apologies that your contribution never made it into the package.

@hadley hadley closed this Oct 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants