-
Notifications
You must be signed in to change notification settings - Fork 20
/
README.Rmd
342 lines (259 loc) · 12.9 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "README-"
)
```
# tidyxl
<!-- badges: start -->
[![R-CMD-check](https://github.com/nacnudus/tidyxl/workflows/R-CMD-check/badge.svg)](https://github.com/nacnudus/tidyxl/actions)
[![Cran Status](http://www.r-pkg.org/badges/version/tidyxl)](https://cran.r-project.org/package=tidyxl)
[![Cran Downloads](https://cranlogs.r-pkg.org/badges/tidyxl)](https://www.r-pkg.org/pkg/tidyxl)
[![codecov](https://app.codecov.io/gh/nacnudus/tidyxl/coverage.svg?branch=master)](https://app.codecov.io/gh/nacnudus/tidyxl)
[![R-CMD-check](https://github.com/nacnudus/tidyxl/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/nacnudus/tidyxl/actions/workflows/R-CMD-check.yaml)
<!-- badges: end -->
[tidyxl](https://github.com/nacnudus/tidyxl) imports non-tabular data from Excel
files into R. It exposes cell content, position, formatting and comments in a
tidy structure for further manipulation, especially by the
[unpivotr](https://github.com/nacnudus/unpivotr) package. It supports the
xml-based file formats '.xlsx' and '.xlsm' via the embedded
[RapidXML](https://rapidxml.sourceforge.net) C++ library. It does not support
the binary file formats '.xlsb' or '.xls'.
It also provides a function `xlex()` for tokenizing formulas. See the
[vignette](https://nacnudus.github.io/tidyxl/articles/smells.html) for
details. It is useful for detecting 'spreadsheet smells' (poor practice such as
embedding constants in formulas, or using deep levels of nesting), and for
understanding the dependency structures within spreadsheets.
## Make cells tidy
Tidyxl represents each cell as one row in a dataframe. Like this.
```{r, echo = FALSE, fig.alt = "Gif of tidyxl converting cells into a tidy representation of one row per cell"}
knitr::include_graphics("man/figures/tidy_xlsx.gif")
```
What can you do with tidy cells? The best places to start are:
* [Spreadsheet Munging
Strategies](https://nacnudus.github.io/spreadsheet-munging-strategies/), a
free, online cookbook using [tidyxl](https://github.com/nacnudus/tidyxl) and
[unpivotr](https://github.com/nacnudus/unpivotr)
* [Screencasts](https://www.youtube.com/watch?v=1sinC7wsS5U) on YouTube.
* [Worked examples](https://github.com/nacnudus/ukfarm) on GitHub.
Otherwise see examples below.
***Breaking changes***
The version 1.0.0 release has some breaking changes. See `NEWS.md` for details.
The previous version can be installed as follows.
```r
devtools::install_version("tidyxl", version = "0.2.3", repos = "http://cran.us.r-project.org")
```
## Installation
```{r, echo = TRUE, eval = FALSE}
devtools::install_github("nacnudus/tidyxl")
```
## Examples
The package includes a spreadsheet, 'titanic.xlsx', which contains the following
pivot table:
```{r, echo = TRUE}
ftable(Titanic, row.vars = 1:2)
```
The multi-row column headers make this difficult to import. A popular package
for importing spreadsheets coerces the pivot table into a dataframe. It treats
the second header row as though it were observations.
```{r, echo = TRUE}
titanic <- system.file("extdata/titanic.xlsx", package = "tidyxl")
readxl::read_excel(titanic)
```
[tidyxl](https://github.com/nacnudus/tidyxl) doesn't coerce the pivot table into
a data frame. Instead, it represents each cell in its own row, where it
describes the cell's address, value and other properties.
```{r, echo = TRUE}
library(tidyxl)
x <- xlsx_cells(titanic)
dplyr::glimpse(x)
```
In this structure, the cells can be found by filtering.
```{r, echo = TRUE}
x[x$data_type == "character", c("address", "character")]
x[x$row == 4, c("address", "character", "numeric")]
```
Specific sheets can be requested using `xlsx_cells(file, sheet)`, and the names
of all sheets in a file are given by `xlsx_sheet_names()`.
### Formatting
The original spreadsheet has formatting applied to the cells. This can also be
retrieved using [tidyxl](https://github.com/nacnudus/tidyxl), with the
`xlsx_formats()` function.
```{r, echo = FALSE, fig.alt = "Screenshot of a spreadsheet with meaningful formatting"}
knitr::include_graphics("vignettes/titanic-screenshot.png")
```
Formatting is available by using the columns `local_format_id` and
`style_format` as indexes into a separate list-of-lists structure. 'Local'
formatting is the most common kind, applied to individual cells. 'Style'
formatting is usually applied to blocks of cells, and defines several formats at
once. Here is a screenshot of the styles buttons in Excel.
```{r, echo = FALSE, fig.alt = "Screenshot of a formatting options available in Excel"}
knitr::include_graphics("vignettes/styles-screenshot.png")
```
Formatting can be looked up as follows.
```{r, echo = TRUE}
# Bold
formats <- xlsx_formats(titanic)
formats$local$font$bold
x[x$local_format_id %in% which(formats$local$font$bold),
c("address", "character")]
# Yellow fill
formats$local$fill$patternFill$fgColor$rgb
x[x$local_format_id %in%
which(formats$local$fill$patternFill$fgColor$rgb == "FFFFFF00"),
c("address", "numeric")]
# Styles by name
formats$style$font$name["Normal"]
head(x[x$style_format == "Normal", c("address", "character")])
# In-cell formatting is available in the `character_formatted` column as a data
# frame, one row per substring.
examples <- system.file("/extdata/examples.xlsx", package = "tidyxl")
xlsx_cells(examples)$character_formatted[77]
```
To see all the available kinds of formats, use `str(formats)`.
### Comments
Comments are available alongside cell values.
```{r, echo = TRUE}
x[!is.na(x$comment), c("address", "comment")]
```
### Formulas
Formulas are available, but with a few quirks.
```{r, echo = TRUE}
options(width = 120)
y <- xlsx_cells(examples, "Sheet1")
y[!is.na(y$formula),
c("address", "formula", "is_array", "formula_ref", "formula_group",
"error", "logical", "numeric", "date", "character")]
```
The top five cells show that the results of formulas are available as usual in
the columns `error`, `logical`, `numeric`, `date`, and `character`.
#### Shared formulas
Cells `A20` and `A21` illustrate how formulas are normalised before being
written to file, using the `formula_ref` and `formula_group` columns. When
there is a group of cells whose formulas only differ by cell reference (e.g.
"=A1+1", "=A2+1", "=A3+1", etc.), only one formula in each group is written to
the file, so `tidyxl` infers what the formulas in the other cells in the group
must be, from their relative positions.
#### Array formulas
There are two kinds of array formulas: ones that compute over arrays, and ones
whose output is an array (of cells).
Both kinds are distinguished in
spreadsheet programs by curly braces, e.g. `{SUM(A19:A21*B19:B21)}`. In
`tidyxl`, the curly braces are ommitted (as they are from the file itself), and
instead the `is_array` column has the value `TRUE`.
The first kind (those that compute over arrays) is illustrated by cell `A22`.
The second kind (those whose value is spread across an array of cells) is
illustrated by cells `A23` and `A24`. The formula is only given in the top-left
cell (`A23`), which is also the only cell that describes the range of cells
containing the result, in the `formula-ref` column. The results themselves are
stored in all relevant cells (`A23` and `A24`). Unlike shared formulas, there
is no `formula_group` to associate the cells of an array formula's result. If
you need to do identify those cells, use the
[cellranger](https://github.com/rsheets/cellranger) package and the
`formula_ref` column.
#### Formulas referring to other files
Cell `A25` contains a formula that refers to another file. The `[1]` is an
index into a table of files. The roadmap for
[tidyxl](https://github.com/nacnudus/tidyxl) includes de-referencing such
numbers.
#### Tokenizing formulas
The function `xlex()` separates formulas into tokens of different types, and
gives their depth within a nested formula. Its name is a bad pun on 'Excel' and
'lexer'. Try the [online demo](https://duncan-garmonsway.shinyapps.io/xlex/),
or install the more experimental [lexl](https://nacnudus.github.io/lexl/)
package to run `demo_lexl()` locally.
It is useful for detecting spreadsheet smells, which are poor practices in
spreadsheet design, such as deep nests of functions, or embedding
constants in formulas.
```{r, fig.width = 7, fig.height = 5}
x <- xlex("MIN(3,MAX(2,A1))")
x
```
See the [vignette](https://nacnudus.github.io/tidyxl/articles/smells.html) for
more examples and details.
### Named ranges
Names are imported with `xlsx_names()`. Also called 'named formulas' and 'defined
names', these are usually used to name particular cells or ranges, making
formulas that refer to them more readable. Ones that *are* ranges are identifed
by the `is_range` column (using `is_range()`), making it easier to match the
names to the cells returned by `xlsx_cells()` -- e.g. by using the
[`cellranger`](https://github.com/rsheets/cellranger) package.
When the scope of the name is within a particular sheet, rather than global, the
sheet name is given.
```{r}
xlsx_names(examples)
```
### Data validation rules
Data validation rules are imported with `xlsx_validation()`. These rules
control what values may be entered into a cell, and are often used to create a
drop-down list in a cell. Read the
[vignette](https://nacnudus.github.io/tidyxl/articles/data-validation-rules.html)
for details.
```{r}
xlsx_validation(examples)
```
## Philosophy
Information in in many spreadsheets cannot be easily imported into R. Why?
Most R packages that import spreadsheets have difficulty unless the layout of
the spreadsheet conforms to a strict definition of a 'table', e.g.:
* observations in rows
* variables in columns
* a single header row
* all information represented by characters, whether textual, logical, or
numeric
These rules are designed to eliminate ambiguity in the interpretation of the
information. But most spreadsheeting software relaxes these rules in a trade of
ambiguity for expression via other media:
* proximity (other than headers, i.e. other than being the first value at the
top of a column)
* formatting (colours and borders)
Humans can usually resolve the ambiguities with contextual knowledge, but
computers are limited by their ignorance. Programmers are hampered by:
* their language's expressiveness
* loss of information in transfer from spreadsheet to programming library
Information is lost when software discards it in order to force the data into
tabular form. Sometimes date formatting is retained, but mostly formatting
is lost, and position has to be inferred again.
[tidyxl](https://github.com/nacnudus/tidyxl) addresses the programmer's problems
by not discarding information. It imports the content, position and formatting
of cells, leaving it up to the user to associate the different forms of
information, and to re-encode them in tabular form without loss. The
[unpivotr](https://github.com/nacnudus/unpivotr) package has been developed to
assist with that step.
## Similar projects
[tidyxl](https://github.com/nacnudus/tidyxl) was originally derived from
[readxl](https://github.com/tidyverse/readxl) and still contains some of the same
code, hence it inherits the GPL-3 licence.
[readxl](https://github.com/tidyverse/readxl) is intended for importing tabular
data with a single row of column headers, whereas
[tidyxl](https://github.com/nacnudus/tidyxl) is more general, and less magic.
The [rsheets](https://github.com/rsheets) project of several R packages is in
the early stages of importing spreadsheet information from Excel and Google
Sheets into R, manipulating it, and potentially parsing and processing formulas
and writing out to spreadsheet files. In particular,
[jailbreaker](https://github.com/rsheets/jailbreakr) attempts to extract
non-tabular data from spreadsheets into tabular structures automatically via
some clever algorithms.
[tidyxl](https://github.com/nacnudus/tidyxl) differs from
[rsheets](https://github.com/rsheets) in scope
([tidyxl](https://github.com/nacnudus/tidyxl) will never import charts, for
example), and implementation ([tidyxl](https://github.com/nacnudus/tidyxl) is
implemented mainly in C++ and is quite fast, only a little slower than
[readxl](https://github.com/tidyverse/readxl)).
[unpivotr](https://github.com/nacnudus/unpivotr) is a package related to
[tidyxl](https://github.com/nacnudus/tidyxl) that provides tools for unpivoting
complex and non-tabular data layouts using I not AI (intelligence, not
artificial intelligence). In this way it corresponds to
[jailbreaker](https://github.com/rsheets/jailbreakr), but with a different
philosophy.
## Mailing list
For bugs and/or issues, create a new issue on
[GitHub](https://github.com/nacnudus/tidyxl/issues) For other questions or
comments, please subscribe to the [tidyxl-devel mailing
list](https://groups.google.com/forum/#!forum/tidyxl-devel). You must be a
member to post messages, but anyone can read the archived discussions.