-
Notifications
You must be signed in to change notification settings - Fork 0
/
slideshow.qmd
413 lines (310 loc) · 13.8 KB
/
slideshow.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
---
title: "Intro to R with RStudio"
author: "Abner Heredia Bustos, CSCAR"
format:
revealjs:
incremental: true
project:
type: default
output-dir: docs/
---
```{r}
source("load_clean_flower_df.R")
```
## About this workshop
For novice or inexperienced coders that want to use **R**. We will use RStudio to learn:
+ How to use and write basic functions.
+ How R stores and handles different types of data.
+ Basic ways to create, manipulate, import, clean, and summarize data.
+ But *NOT* statistical modeling.
## Workshop format
+ From 1 to 4:45 pm.
+ Breaks every 90 minutes.
+ A few slides for context and extra information.
+ A lot of hands-on coding and live demonstrations.
+ All materials will be available after the workshop ends.
::: {.notes}
This workshop runs from 1 to 4:45 pm, with breaks every 90 minutes. Right now I will use a few slides for context and extra information. But this is a WORKshop, so there will be plenty of hands-on coding and live demonstrations. All materials will be available after the workshop ends, so don't worry about copying these slides.
:::
## Tips for this workshop
+ Coding along with me is the best way to learn.
+ Ask questions at any time.
+ During exercises, interact with your peers.
::: {.notes}
Coding along with me is the best way to learn. If you just watch, you will not remember anything after you leave today. Feel free to ask questions at any time. Code can be confusing and mistakes are easy to make, but I'm here to help, so don't be afraid to interrupt me. During exercises, interact with your peers. We all struggle with computers, but it's easier if we can help each other.
:::
## What is CSCAR?
+ Full name: Consulting for Statistics, Computing and Analytics Research.
+ A unit of the Office of the Vice President of Research (OVPR).
+ Guides and trains researchers in data collection, management, and analysis.
+ Also helps researchers to use technical software and advanced computing.
::: {.notes}
Before we get to the coding part, let me tell you a little bit about the people behind this workshop: CSCAR.
:::
## CSCAR is here to help you
+ Free, one-hour consultations with graduate-level statisticians.
+ GSRAs are available for walk-in consultations Monday through Friday, between 9am and 5pm (we close on Tuesdays between noon and 1pm).
+ All of our scheduled appointments can be either remote or in-person.
::: {.notes}
All of our scheduled appointments can be either remote or in-person. So, if you live out of town, work in a different campus or just don't want to deal with bad weather, you can still ask CSCAR for help.
:::
## Contact CSCAR
+ To request a consultation: email <[email protected]>, or fill [this form](https://docs.google.com/forms/d/e/1FAIpQLSei-twcjFkoobUrVwSQTmSxdKKEc1Ub8w5LHmeIZUmTV1wmIg/viewform?pli=1). Or visit [cscar.research.umich.edu](cscar.research.umich.edu).
+ Self-schedule a consultation with a GSRA using [this link](https://calendar.google.com/calendar/u/0/selfsched?sstoken=UUMyTFpCR1RXbmhYfGRlZmF1bHR8ZWNjNGJlMWZlYTA4ZWE5NzYzNmNkNzgyZjUyZDYxNDg).
+ Address: The University of Michigan, 3560 Rackham, 915 E. Washington St., Ann Arbor, MI 48109-1070.
::: {.notes}
There are several ways to contact CSCAR. You can request a consultation by email or by filling this form. You can also self-schedule a consultation with a GSRA using this link. Our office is at 3560 Rackham, 915 E. Washington St., Ann Arbor, Michigan.
:::
## Who am I?
+ Abner Heredia Bustos, a data science consultant at CSCAR.
+ I want to make coding as simple and effortless as possible...
+ ...which means learning it well from the beginning.
::: {.notes}
My name is Abner Heredia Bustos. I am a data science consultant at CSCAR. Apart from this, all you need to know is that, for me, coding is just a mean to an end. So, I will try hard to make coding as simple and effortless as possible for you; but to achieve this you will need to put some effort in learning the basics.
:::
# Why do you want to learn R?
## R is cheap and powerful
+ **R** is gratis ($0) and it runs on Windows, MacOS, and several Unix platforms.
+ You can start with this:
---
```{r loading flower data}
#| echo: false
head(flower_df, 5)
```
---
and, in 8 lines of code or less, make this:
```{r height by nitrogen boxplots}
#| echo: false
#| fig-width: 6.2
#| fig-height: 4.5
#| fig-align: center
boxplot(
height ~ nitrogen,
data = flower_df,
col = c("yellow", "blue", "pink"),
main = "No clear association between height and nitrogen level",
xlab = "Nitrogen",
ylab = "Height"
)
```
::: {.notes}
You can change the colors, the order of the boxes, the names, and much more. Doing all of this will be straightforward once you are familiar with R's syntax.
:::
## R is an *environment*, not a package
+ A package is a fixed set of tools.
+ An environment is for combining, modifying, and creating tools.
::: {.notes}
R is very powerful because it is an environment, not a package. A package is a fixed set of tools---what you see is what you get and that's it. An environment is for combining, modifying, and creating tools. So, even if a tool is not readily available in an environment, chances are there is a way to make it.
:::
## R has plenty of statistical tools and models
+ Generalized linear models (including linear regression).
+ Survival analysis.
+ Time series analysis.
+ Multilevel models.
+ Classification and clustering.
+ Sample size and power calculations.
+ Multivariable analysis (e.g., factor analysis, PCA, and SEM).
::: {.notes}
Luckily for us, other people have already built tools and models to do a lot of statistics. We have...
Better yet, people add more tools every day.
:::
## Even more tools and models
+ Users constantly publish their own code packages: more than 13 thousand in the Comprehensive **R** Archive Network (CRAN) as of March 2019.
+ Many complex statistical routines are not (and may never be) available in other statistical software.
::: {.notes}
As of March 2019, users like you and I have published more than 13 thousand packages in CRAN. Many of these packages implement complex statistical routines that are not (and may never be) available in other statistical software.
:::
## Why Isn't Everyone a Use**R**?
+ Some people only use the software they learned first, which is not always **R**.
+ Each package in R has its own rules to learn.
+ Help pages and error messages may be hard to understand.
::: {.notes}
But if R is so good, why isn't everyone a user? Some people only use whatever they learned first. They took a course in statistics years ago that used SPSS or STATA and that has been enough for them. Also, each package in R has its own rules to learn. You can find a lot of good help for popular
packages written by professional developers, but not so much for smaller
packages written by other common users. Worst of all, some of the error messages in R are uninformative, so fixing problems can be difficult. Still, I think the advantages are well worth the effort.
:::
## Suggestions for Learning **R**
+ Learn interactively.
+ Don't worry about getting errors.
+ Ask other **R** users for help.
::: {.notes}
Learn interactively. Retype, experiment, go crazy with sample code. Today I will show you many examples that you can use and you can find many more online. Also, don't worry about making mistakes. Even professional coders make errors all the time, and you can learn a lot from error messages. Besides you can always ask other users for help. Take advantage of R's popularity to tap into our collective knowledge. It's also a good excuse to get up from your desk every once in a while.
:::
## Some useful links
::: {.nonincremental}
- <https://www.r-project.org>: Here you will find links for downloading **R**,
downloading additional packages for **R**, and more.
- <https://cran.r-project.org/web/views/>: Summaries of important packages by subject field or analysis type.
- <https://journal.r-project.org>: The **R** Journal.
- <https://stats.stackexchange.com>: Cross-Validated.
- <https://www.r-bloggers.com>
:::
## More useful links
::: {.nonincremental}
- <https://stats.idre.ucla.edu/r/>: Institute for Digital Research and
Education at UCLA.
- <https://socialsciences.mcmaster.ca/jfox/>: John Fox's home page.
- <https://sas-and-r.blogspot.com/>: Examples of code to perform same task
in SAS and R.
:::
# Let's start coding
## Practice your arithmetic
Think of an integer, double it, add six, divide it in half, subtract the number you started with, and then square it. If your code is correct, the answer should be nine.
## Object names have rules
+ Names can be a combination of letters, digits, periods `.` and underscores
`_`.
+ Names can *not* include white spaces.
+ If a name starts with a period `.`, it can *not* be followed by a digit.
+ Names can *not* start with a number or an underscore `_`.
## Object names have rules
+ Names are case-sensitive (`age`, `Age` and `AGE` are three different objects).
+ Reserved words (`TRUE`, `FALSE`, `NULL`, `if`, ...) can *not* be used as names
## Tips for naming objects
+ Avoid giving your object the same name as a built-in function.
+ To separate words, use an underscore (`my_object`) or a dot (`my.object`), or capitalize the different words (`MyObject`). Choose your favorite way, but *be consistent* with it.
+ Use names that illustrate what you want to do with the objects.
## Exercise
Write a function that can simulate the roll of two six-sided dice, one red and one blue, an arbitrary number of times. This function should return a vector with the values of the red die that were strictly larger than the corresponding values of the blue die.
## Exercise step by step (part 1)
+ Step 1: define a function that takes one argument, `num_rolls`, representing
the number of times to roll the dice.
+ Step 2: create two objects called `red` and `blue` to store the results from
the dice rolls.
+ Step 3: simulate the dice rolls using function `sample()` (read its help page
if you need to).
## Exercise step by step (part 2)
+ Step 4: create a vector of indices that identifies the values in the red die that were larger than the values in the blue die.
+ Step 5: use this vector of indices to extract the values from the red die.
+ Step 6: make sure that your function returns the values you extracted in step 5.
## Coercion
When adding different data types to the same atomic vector, **R** follows specific rules to coerce everything to be of the same type.
+ If a character string is present in an atomic vector, **R** will convert all other values to character strings.
+ If a vector only contains logicals and numbers, **R** will convert the logicals to numbers; every `TRUE` becomes a `1`, and every `FALSE` becomes a `0`.
+ `NA`s are never coerced automatically.
## Make a histogram
Use the example I showed to you to make a histogram for variable `height`. Bonus: can you color the bars?
Your histogram result should resemble this one:
```{r histogram of height}
#| echo: false
#| fig-width: 6.5
#| fig-height: 5
#| fig-align: center
hist(
flower_df$height,
breaks = 12,
xlim = c(0, 20),
xlab = "Height",
main = "Few weights are above 20",
col = "green"
)
```
---
Here is the code for the histogram:
```{r}
#| echo: true
#| eval: false
hist(
flower_df$height,
breaks = 12,
xlim = c(0, 20),
xlab = "Height",
main = "Few weights are above 20",
col = "green"
)
```
## Make a boxplot
Use the example I showed to you to make a histogram for variable `leaf_area`. Bonus: can you color the box?
```{r boxplot for leaf_area}
#| echo: false
#| fig-width: 4
#| fig-height: 5
#| fig-align: center
boxplot(
flower_df$leaf_area,
ylab = "Leaf area",
col = "blue",
main = "Most leaf areas are between 11 and 18"
)
```
---
Here is the code for the boxplot.
```{r}
#| echo: true
#| eval: false
boxplot(
flower_df$leaf_area,
ylab = "Leaf area",
col = "blue",
main = "Most leaf areas are between 11 and 18"
)
```
## Make a scatterplot
Use the example I showed to you to make a scatter plot with `height` and `weight`, coloring by nitrogen level. Remember to add a legend to the plot. Your scatter plot should resemble this one:
---
```{r}
#| echo: false
#| fig-width: 6
#| fig-height: 4.5
#| fig-align: center
plot(
x = flower_df$weight,
y = flower_df$height,
col = flower_df$nitrogen,
main = "No clear association between height and weight",
xlab = "Weight",
ylab = "Height"
)
# Add a legend to the plot
legend(
x = "bottomright",
legend = levels(flower_df$nitrogen),
col = 1:length(levels(flower_df$nitrogen)),
pch = 16
)
```
---
Here is the code for the scatter plot.
```{r}
#| echo: true
#| eval: false
plot(
x = flower_df$weight,
y = flower_df$height,
col = flower_df$nitrogen,
main = "No clear association between height and weight",
xlab = "Weight",
ylab = "Height"
)
# Add a legend to the plot
legend(
x = "bottomright",
legend = levels(flower_df$nitrogen),
col = 1:length(levels(flower_df$nitrogen)),
pch = 16
)
```
## Make a mosaic plot
Use the example I showed to you to make a mosaic plot to visualize how frequently the values of nitrogen and treat combine with each other, but only for flowers with a weight below 10. Your plot should resemble this:
---
```{r mosaic plot for nitrogen vs treat}
#| echo: false
#| fig-width: 5
#| fig-height: 4
#| fig-align: center
nitrogen_by_treat_table = xtabs(
formula = ~ nitrogen + treat,
data = flower_df[which(flower_df$weight < 10),]
)
mosaicplot(nitrogen_by_treat_table, main = "Nitrogen by treat, weight below 10")
```
---
Here is the code for the mosaic plot
```{r}
#| echo: true
#| eval: false
nitrogen_by_treat_table = xtabs(
formula = ~ nitrogen + treat,
data = flower_df[which(flower_df$weight < 10),]
)
mosaicplot(nitrogen_by_treat_table, main = "Nitrogen by treat, weight below 10")
```