README.Rmd

---
output: github_document
---

# htmldf <img src="man/figures/hex.png" align="right" width="150" />
![build](https://github.com/alastairrushworth/htmldf/workflows/R-CMD-check/badge.svg)
[![codecov](https://codecov.io/gh/alastairrushworth/htmldf/branch/master/graph/badge.svg)](https://app.codecov.io/gh/alastairrushworth/htmldf)
[![CRAN status](https://www.r-pkg.org/badges/version/htmldf)](https://CRAN.R-project.org/package=htmldf)
[![](https://cranlogs.r-pkg.org/badges/htmldf)](https://CRAN.R-project.org/package=htmldf)
[![cran checks](https://cranchecks.info/badges/summary/htmldf)](https://cran.r-project.org/web/checks/check_results_htmldf.html)  

Overview
---

The package `htmldf` contains a single function `html_df()` which accepts a vector of urls as an input and from each will attempt to download each page, extract and parse the html.  The result is returned as a `tibble` where each row corresponds to a document, and the columns contain page attributes and metadata extracted from the html, including:

+ page title
+ inferred language (uses Google's compact language detector)
+ RSS feeds
+ tables coerced to tibbles, where possible
+ hyperlinks
+ image links
+ social media profiles
+ the inferred programming language of any text with code tags
+ page size, generator and server
+ page accessed date
+ page published or last updated dates
+ HTTP status code
+ full page source html

Installation
---  

To install the CRAN version of the package:

```{r, eval=FALSE}
install.packages('htmldf')
```

To install the development version of the package:

```{r, eval=FALSE}
remotes::install_github('alastairrushworth/htmldf')
```

Usage
---  

First define a vector of URLs you want to gather information from.  The function `html_df()` returns a `tibble` where each row corresponds to a webpage, and each column corresponds to an attribute of that webpage:
```{r, message=FALSE, warning=FALSE}
library(htmldf)
library(dplyr)

# An example vector of URLs to fetch data for
urlx <- c("https://alastairrushworth.github.io/Visualising-Tour-de-France-data-in-R/",
          "https://medium.com/dair-ai/pytorch-1-2-introduction-guide-f6fa9bb7597c",
          "https://www.tensorflow.org/tutorials/images/cnn", 
          "https://www.analyticsvidhya.com/blog/2019/09/introduction-to-pytorch-from-scratch/")

# use html_df() to gather data
z <- html_df(urlx, show_progress = FALSE)

# have a quick look at the first page
glimpse(z[1, ])
```

To see the page titles, look at the `titles` column.  
```{r}
z %>% select(title, url2)
```


Where there are tables embedded on a page in the `<table>` tag, these will be gathered into the list column `tables`.  `html_df` will attempt to coerce each table to `tibble` - where that isn't possible, the raw html is returned instead.

```{r}
z$tables
```

`html_df()` does its best to find RSS feeds embedded in the page:
```{r}
z$rss
```

`html_df()` will try to parse out any social profiles embedded or mentioned on the page.  Currently, this includes profiles for the sites

`bitbucket`, `dev.to`, `discord`, `facebook`, `github`, `gitlab`, `instagram`, `kakao`, `keybase`, `linkedin`, `mastodon`, `medium`, `orcid`, `patreon`, `researchgate`, `stackoverflow`, `reddit`, `telegram`, `twitter`, `youtube`


```{r}
z$social
```

Code language is inferred from `<code>` chunks using a preditive model.  The `code_lang` column contains a numeric score where values near 1 indicate mostly R code, values near -1 indicate mostly Python code:
```{r}
z %>% select(code_lang, url2)
```

Publication dates

```{r}
z %>% select(published, url2)
```


Comments? Suggestions? Issues?
---  

Any feedback is welcome! Feel free to write a github issue or send me a message on [twitter](https://twitter.com/rushworth_a).