forked from hadley/adv-r
-
Notifications
You must be signed in to change notification settings - Fork 2
/
Package-basics.rmd
285 lines (193 loc) · 14.6 KB
/
Package-basics.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
---
title: Package basics
layout: default
---
# Package basics
An R package is the basic unit of reusable code. You need to master the art of making R packages if you want others to use your code. At their heart, packages are quite simple and only have two essential components:
* a `DESCRIPTION` file that describes the package, what it does, who's allowed
to use it (the license) and who to contact if you need help
* an `R` directory that contains your R code.
If you want to distribute R code to someone else, there's no excuse not to use a simple package: it's a standard structure, and you can easily build it out into a complete package by adding documentation, data and tests.
This document explains how to get started, with a description of package structure, tips for naming your package, and more details about the `DESCRIPTION` file.
The most accurate resource for up-to-date details on package development is always the official [writing R extensions][r-ext] guide. However, it's rather hard to read and follow if you're not already familiar with the basics of packages. It's also exhaustive, covering every possible package component, rather than focussing on the most common and useful components as this package does. Once you are familiar with the content here, you should find R extensions a little easier to read.
## Package essentials
As mentioned above, there are only two elements that you must have:
* the `DESCRIPTION` file, which provides metadata about the package, and is
described in the following section.
* the `R/` directory where your R code lives (in `.R` or `.r` files).
If you don't want to create this by hand, you can use `devtools::create` which initialises the directory structure and includes a few other files that most packages have.
## Optional components
Almost all R packages also have:
* the `man/` directory where your [[function documentation|documenting-functions]]. In
the style of package development described in this book, you'll never
personally touch the files in this directory. Instead, they will be
automatically generated from comments in your source code using the
`roxygen2` package
After the code and function documentation, the most important optional components of an R package help your users learn how to use your package. The following files and directories are described in more detail in [[documenting packages]].
* the `NEWS` file describes the changes in each version of the package. Using
the standard R format will allow you to take advantage of many automated
tools for displaying changes between versions.
* the `README` file gives a general overview of your package, including why
it's important. This text should be included in any package announcement, to
help others understand why they might want to use your package.
* the `inst/CITATION` file describes how to cite your package. If you have
published a peer reviewed article which you'd like people to cite when they
use your software, this is the place to put it.
* the `demo/` directory contains larger scale demos, that use many
features of the package.
* the `inst/doc/` directory is used for larger scale documentation, like
vignettes, long-form documents which show how to combine multiple parts
of your package to solve problems.
Other optional files and directories are part of good development practice:
* a `NAMESPACE` file describes which functions are part of the formal API of
the package and are available for others to use. See [[namespaces]] for more
details.
* `tests/` and `inst/tests/` contains [[unit tests|testing]] which ensure that
your package is operating as designed. In this book, we'll focus on the
`testthat` package for writing tests.
* the `data/` directory contains `.rdata` files, used to include sample
datasets (or other R objects) with your package.
There are other directories that we won't cover. You might see these in other packages you download from CRAN, but these topics are outside the scope of this book.
* `src/`: C, C++ and fortran source code
* `exec/`: executable scripts
* `po/`: translation files
## Getting started
When creating a package the first thing (and sometimes the most difficult) is to come up with a name for it. There's only one formal requirement:
* The package name can only consist of letters and numbers, and must start
with a letter.
But I have a few additional recommendations:
* Make the package name googleable, so that if you google the name you can
easily find it. This makes it easy for potential users to find your package,
and it's also useful for you, because it makes it easier to find out who is
using it.
* Avoid using both upper and lower case letters: they make the package name
hard to type and hard to remember. For example, I can never remember if it's
`Rgtk2` or `RGTK2` or `RGtk2`.
Some strategies I've used in the past to create packages names:
* Use abbreviations: `lvplot` (letter value plots), `meifly` (models explored
interactively)
* Add an extra R: `stringr` (string processing), `tourr` (grand tours), `httr`
(HTTP requests), `helpr` (alternative documentation view)
* Find a name evocative of the problem and modify it so that it's unique:
`plyr` (generalisation of apply tools), `lubridate` (makes dates and times
easier), `mutatr` (mutable objects), `classifly` (high-dimensional views of
classification)
Once you have a name, create a directory with that name, and inside that create an `R` subdirectory and a `DESCRIPTION` file (note that's there's no extension, and the file name must be all upper case).
## The `R/` directory
The `R/` directory contains all your R code, so copy in any existing code.
It's up to you how you arrange your functions into files. There are two
possible extremes: all functions in one file, and each function in its own
file. I think these are both too extreme, and I suggest grouping related
functions into a single file. My rule of thumb is that if I can't remember
which file a function lives in, I probably need to split them up into more
files: having only one function in a file is perfectly reasonable,
particularly if the functions are large or have a lot of documentation. As
you'll see in the next chapter, often the code for the function is small
compared to its documentation (it's much easier to do something than it is to
explain to someone else how to do it.)
The next step is to create a `DESCRIPTION` file that defines package metadata.
## A minimal `DESCRIPTION` file
A minimal description file (this one is taken from an early version of plyr) looks like this:
Package: plyr
Title: Tools for splitting, applying and combining data
Description:
Version: 0.1
Author: Hadley Wickham <[email protected]>
Maintainer: Hadley Wickham <[email protected]>
License: MIT
This is the critical subset of package metadata: what it's called (`Package`), what it does (`Title`, `Description`), who's allowed to use and distribute it (`License`), who wrote it (`Author`), and who to contact if you have problems (`Maintainer`). Here I've left the `Description` blank to illustrate that if you haven't decided what the correct value is yet, it's ok to leave it blank.
Again, the six required elements are:
* `Package`: name of the package. Should be the same as the directory name.
* `Title`: a one line description of the package.
* `Description`: a more detailed paragraph-length description.
* `Version`: the version number, which should be of the the form
`major.minor.patchlevel`. See `?package_version` for more details on the
package version formats. I recommended following the principles of [semantic
versioning](http://semver.org/).
* `Maintainer`: a single name and email address for the person responsible for
package maintenance.
* `License`: a standard abbreviation for an open source license, like `GPL-2`
or `BSD`. A complete list of possibilities can be found by running
`file.show(file.path(R.home(), "share/licenses/license.db"))`. If you are
using a non-standard license, put `file LICENSE` and then include the full
text of the license in a `LICENSE`.
## Other `DESCRIPTION` components
A more complete `DESCRIPTION` (this one from a more recent version of `plyr`) looks like this:
Package: plyr
Title: Tools for splitting, applying and combining data
Description: plyr is a set of tools that solves a common set of
problems: you need to break a big problem down into manageable
pieces, operate on each pieces and then put all the pieces back
together. For example, you might want to fit a model to each
spatial location or time point in your study, summarise data by
panels or collapse high-dimensional arrays to simpler summary
statistics. The development of plyr has been generously supported
by BD (Becton Dickinson).
URL: http://had.co.nz/plyr
Version: 1.3
Maintainer: Hadley Wickham <[email protected]>
Author: Hadley Wickham <[email protected]>
Depends: R (>= 2.11.0)
Suggests: abind, testthat (>= 0.2), tcltk, foreach
Imports: itertools, iterators
License: MIT
This `DESCRIPTION` includes other components that are optional, but still
important:
* `Depends`, `Suggests`, `Imports` and `Enhances` describe which packages
this package needs. They are described in more detail in [[namespaces]].
* `URL`: a url to the package website. Multiple urls can be separated with a
comma or whitespace.
Instead of `Maintainer` and `Author`, you can `Authors@R`, which takes a vector of `person()` elements. Each person object specifies the name of the person and their role in creating the package:
* `aut`: full authors who have contributed much to the package
* `ctb`: people who have made smaller contributions, like patches.
* `cre`: the package creator/maintainer, the person you should bother if you
have problems
Other roles are listed in the help for person. Using `Authors@R` is useful when your package gets bigger and you have multiple contributors that you want to acknowledge appropriately. The equivalent `Authors@R` syntax for plyr would be:
Authors@R: person("Hadley", "Wickham", role = c("aut", "cre"))
There are a number of other less commonly used fields like `BugReports`, `KeepSource`, `OS_type` and `Language`. A complete list of the `DESCRIPTION` fields that R understands can be found in the [R extensions manual][description].
## Source, binary and bundled packages
So far we've just described the structure of a source package: the development version of a package that lives on your computer. There are also two other types of package: bundled packages and binary packages.
A package __bundle__ is a compressed version of a package in a single file. By convention, package bundles in R use the extension `.tar.gz`. This is linux convention indicating multiple files have been collapsed into a single file (`.tar`) and then compressed using gzip (`.gz`). The package bundle is useful if you want to manually distribute your package to another R package developer. It is not OS specific. You can use `devtools::build()` to make a package bundle.
If you want to distribute your package to another R user (i.e. someone who doesn't necessarily have the development tools installed) you need to make a __binary__ package. Like a package bundle, a binary package is a single file, but if you uncompress it, you'll see that the internal structure is a little different to a source package:
* a `Meta/` directory contains a number of `Rds` files. These contain cached
metadata about the package, like what topics the help files cover and
parsed versions of the `DESCRIPTION` files. (If you want to look at what's
in these files you can use `readRDS`)
* a `html/` directory contains some files needed for the html help.
* there are no `.R` files in the `R/` directory - instead there are three
files that store the parsed functions in an efficient format. This is
basically the result of loading all the R code and then saving the
functions with `save`, but with a little extra metadata to make things as
fast as possible.
* If you had any code in the `src/` directory there will now be a `libs/`
directory that contains the results of compiling that code for 32 bit
(`i386/`) and 64 bit (`x64`)
Binary packages are platform specific: you can't install a windows binary package on a mac or vice versa. You can use `devtools::build(binary = TRUE)` to make a package bundle.
An __installed__ package is just a binary package that's been uncompressed into a package library, described next.
## Package libraries
A library is a collection of installed packages. You can have multiple libraries on your computer and most people have at least two: one for the recommended packages that come with a base R install (like `base`, `stats` etc), and one library where the packages you've installed live. The default is to make that directory dependent on which version of R you have installed - that's why you normally lose all your packages when you reinstall R. If you want to avoid this behaviour, you can manually set the `R_LIBS` environmental variable to point somewhere else. `.libPaths()` tells you where your current libraries live.
When you use `library(pkg)` to load a package, R looks through each path in `.libPaths()` to see if a directory called `pkg` exists.
## Installing packages
Package installation is the process whereby a source package gets converted into a binary package and then installed into your local package library. There are a number of tools that automate this process:
* `install.packages()` installs a package from CRAN. Here CRAN takes care of
making the binary package and so installation from CRAN basically is
equivalent to downloading the binary package value and unzipping it in
`.libPaths()[1]` (but you should never do this by hand because the process
also does other checks)
* `devtools::install()` installs a source package from a directory on your
computer.
* `devtools::install_github()` installs a package that someone has published
on their [github](http://github) account. There are a number of similar
functions that make it easy to install packages from other internet
locations: `install_url`, `install_gitorious`, `install_bitbucket`, and so
on.
## Exercises
(to be integrated throughout the chapter)
* Go to CRAN and download the source and binary for XXX. Unzip and compare.
How do they differ?
* Download the __source__ packages for XXX, YYY, ZZZ. What directories do they
contain?
* Where is your default library? What happens if you install a new package
from CRAN?
[r-ext]:http://cran.r-project.org/doc/manuals/R-exts.html#Creating-R-packages
[description]: http://cran.r-project.org/doc/manuals/R-exts.html#The-DESCRIPTION-file