forked from btskinner/downloadipeds
-
Notifications
You must be signed in to change notification settings - Fork 0
/
tutorial.qmd
689 lines (477 loc) · 22.7 KB
/
tutorial.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
---
image: apple-touch-icon.png
execute:
warning: false
---
# `IPEDtaS` Tutorial
- This page will walk you the basics of how to use `IPEDtaS` to automagically retrieve labelled IPEDS `.dta` files
## What `IPEDtaS` Does
- NCES provide all the information needed to create labelled IPEDS data
- However, it can be cumbersome to put all the pieces together manually
- If you've worked with IPEDS before, you've likely seen this screen
![Screenshot of IPEDS Data Center](IPEDS-data-center.jpg)
- When I first started working with IPEDS, I assume the "Stata Data File" would be a nicely labelled data set, like you can get with other NCES data sets
- However, it's not
- Instead, it's just a plain .csv data set that is designed to be combined with the "STATA Program File" which applies the labels
- These program files are also full of issues
- Such as hard coded file paths from the NCES worker who wrote the script's computer which you would have to update for each piece of data
![](hard-file-path.png)
- And error-causing line breaks which you'd have to fix before the script would run
![](broken-line.png)
- Between downloading two files, fixing the issues, then running them together, it's a lot of work so most people don't bother
- Plus, they only work in Stata, R users without a Stata license are left out...
- The R and Stata `IPEDtaS` scripts do all this heavy lifting for you so you can use nicely labelled IPEDS data without the extra effort!
## Why Would I Want Labels Any?
You will see more of this in the applied example below, but, in short, they make data analysis much easier and reduce the amount you will have to look at the dictionary/codebook!
- Look at the difference from these simple data checks on the number of colleges in each region
```{r}
#| echo: false
library(tidyverse)
data_with_labels <- haven::read_dta("data/hd2021.dta")
data_without_labels <- haven::zap_labels(data_with_labels)
```
- Without labels, we just get the numeric code for each region which we'd have to look up
```{r}
data_without_labels |>
count(obereg)
```
- With labels, we get a description of each region too
```{r}
data_with_labels |>
count(obereg)
```
- First, this makes life easier reducing the amount we have to look back and forth to the code book
- Second, it makes it easier to spot accidental errors when checking our work, leading to more reliable analyses
## System Requirements
First things first, let's consider the what you will need on your computer to get started
::: {.panel-tabset}
## {{< fa brands r-project >}} R
1. An up-to-date version of `R`
- If you're not sure how up-to-date your R is, download a new version of R from [https://cran.r-project.org](https://cran.r-project.org)
2. An up-to-date `tidyverse` package
- In your R console type `install.packages("tidyverse")` to get the latest version
3. If you want to download all of IPEDS, up to 12gb of permanent space and 36gb of temporary space
## ![](stata-icon.png){width=0.15in} Stata
1. An up-to-date and licensed copy of Stata version 16.0 or higher (BE or Basic is sufficient)
- You can upgrade/purchase/download Stata from [https://www.stata.com](https://www.stata.com)
2. An up-to-date installation of Python
- Don't fret, you don't have to use Python, but the .do file uses PyStata to clean up the scripts, so Python just needs to be on your machine
- You can see if you already have Python by typing `python search` into your Stata command box
- You can install a copy of Python from [https://www.python.org/downloads/](https://www.python.org/downloads/)
3. If you want to download all of IPEDS, up to 4gb of permanent space and 12gb of temporary space
:::
# Setting up Your Project Folder
This part is identical for both Stata and R users, the main points to note are:
1. Download either the Stata or R version of the script from the links at the top of this page
- Hint: If the file is opening in your browser, use "download linked file" on macOS, or, "save link as" on windows to save the file (you can also just copy and paste)
2. The script is designed to treat where ever you place the `IPEDtaS.do` or `IPEDtaS.R` file as the "working directory"
- **Check the working directory is set to your current project before doing anything else**
- If you download the script, save it in your project folder, then open it, you will often get the correct working directory by default, but it's always best to check
- When the script runs it will store output in `./data` and `./dictionaries` folders
- **Caution**: Anything you have in folders with that name will be overwritten
- This also applies to `./zip-data`, `./zip-do-files`, `./zip-dictionaries`, `./unzip-data`, `./unzip-do-files`, `./unzip-dictionaries` which are folders used temporarily behind-the-scenes
3. Personally, I set up my projects with scripts in the top-level of the project folder (as in, not in a sub-folder), so that is how `IPEDtaS` was designed
- If you **need** everything in a sub-folder for sanity reasons either:
a. Place `IPEDtaS` in your data folder (e.g `./data/IPEDtaS.do`) which will place the data in `./data/data/hd2022.dta`
b. Place `IPEDtaS` in your `./scripts` folder and go through adjusting all the relative paths by adding `../` to back out one level
# File Selection
The only real change you have to make in the whole process is to the scripts is selecting which files you want to download
- By default the scripts are set to download directory information (HD) and enrollment data (EFFY) for the 2023 reporting cycle
- To change these you need to update the list to the files you want
- **Note**: at the bottom of the script there is a list with every single IPEDS file in it, if you want the entire dataset you can just copy and paste that longer list to the top of the script and edit as needed
- To edit the list you basically just need to follow the list rules for each language
::: {.panel-tabset}
## {{< fa brands r-project >}} R
- The only rule is that the `selected_files <- c()` must be a valid list of IPEDS file names
- Each line/entry **must end in a comma `,`** except the final one
Here are some short examples of file selection
1. A simple list
```{r}
#| eval: false
selected_files <- c("HD2021", "EFFY2022", "SFA2122")
## OR
selected_files <- c(
"HD2021",
"EFFY2022",
"SFA2122"
)
```
2. You can also comment out files you don't want from a longer list
```{r}
#| eval: false
selected_files <- c(
"HD2021",
# "IC2022",
# "IC2022_AY",
"EFFY2022",
# "EFFY2022_DIST",
"SFA2122"
)
```
That's about it, when you run the script, the files you put in `selected_files` will be downloaded
## ![](stata-icon.png){width=0.15in} Stata
- For the Stata version `local selected_files` needs to be a valid list of IPEDS file names
- Each line in the list **must end in `///`** except the final one
Here are some short examples of file selection
1. A simple list
```{stata}
*| eval: false
local selected_files ///
"HD2021" ///
"EFFY2022" ///
"SFA2122"
```
2. Use multi-line comments to comment out files you don't want
- Stata has both single-line comments `//` or `*` and multi-line commments which start `/*` and end `*/`
- Because of the way the list is structured, we have to use multi-line comments here (even to comment a single line) which have 3 rules
1. To work in the list, the line before a multi-line comment must be `///` and nothing else
2. Below this start the first line of a multi-line comment with `/*`
3. To close out a multi-line comment somewhere else use `*/`
```{stata}
*| eval: false
local selected_files ///
"HD2021" ///
///
/*
"IC2022" ///
"IC2022_AY" ///
*/
"EFFY2022" ///
///
/*
"EFFY2022_DIST" ///
*/
"SFA2122"
```
That's about it, when you run the script, the files you put in `local selected_files` will be downloaded
:::
# Runnning the Script
Once you have the file selection set, simply save the script and hit run/do!
If you're using this tool as part of a reproducible research project, you might want to include running it as part of your analysis code
- However, you don't want to run it every time you run your code, only if the data isn't already downloaded
- The below code blocks will do exactly that if you include them at the start of your analysis code
- Just change `hd2021.dta` to a file you download
::: {.panel-tabset}
## {{< fa brands r-project >}} R
```{r}
#| eval: false
if(!file.exists("data/hd2021.dta")) { source("IPEDtaS.R") }
```
## ![](stata-icon.png){width=0.15in} Stata
```{stata}
*| eval: false
if(!fileexists("data/hd2021.dta")) { do "IPEDtaS.do" }
```
:::
# Applied Example Using with Labelled IPEDS Data
Okay, now we have our labelled IPEDS data, let's walk through a simple descriptive analysis using
- HD2021 (institutional characteristics as of Fall 2021)
- EFFY2022 (enrollment for 2021-2022 school year)
- SFA2022 (financial aid for 2021-2022 school year)
Two things to note:
1. These examples are just meant to illustrate how the labels can help in your work, they are not meant to be ground-breaking informative analyses
- If you're feeling adventurous, play around and swap out different variables as you follow along
2. To be able to understand some of the code, you'd a decent understanding of R and the tidyverse, but again, the point is just to see how the labels can help
## 1. Running `IPEDtaS`, Reading Data, & Joining Data
::: {.panel-tabset}
## {{< fa brands r-project >}} R
**1**. Create a new folder on your computer, download a copy of `IPEDtaS.R`, and place it in the folder
**2**. Adjust `selected_files <- c()` in `IPEDtaS.R` to download the 3 files we want like below
```{r}
selected_files <- c(
"HD2021",
"EFFY2022",
"SFA2122"
)
```
**3**. Select the whole `IPEDtaS.R` script and hit "Run"
**4**. Start a new R script in that same folder
**5**. Load tidyverse, haven (part of tidyverse, but requires loading separately), labelled (what haven uses behind the scenes), and gtsummary (to easily create output tables)
```{r}
#| message: false
library(tidyverse)
library(haven)
library(labelled)
library(gtsummary)
```
**6**. Read our data in
```{r}
data_info <- read_dta("data/hd2021.dta")
data_enroll <- read_dta("data/effy2022.dta")
data_aid <- read_dta("data/sfa2122.dta")
```
Okay, now, take a look at the enrollment data we just read in (click on `data_enroll` in the environment in the top right)
![Screenshot of Enrollment Data Showing Variable Labels](var-labels.png)
Notice the descriptions under each variable name
- If you're familiar with IPEDS data, you won't be used to seeing those
- They're the variable labels we added, super useful for quick questions without having to open the code book!
**7**. Now we want to join our data together
```{r}
data <- left_join(data_info, data_enroll, by = "unitid") |>
left_join(data_aid, by = "unitid")
```
## ![](stata-icon.png){width=0.15in} Stata
```{r}
#| include: false
library(Statamarkdown)
```
**1**. Create a new folder on your computer, download a copy of `IPEDtaS.do`, and place it in the folder
**2**. Adjust `local selected_files` in `IPEDtaS.do` to download the 3 files we want like below
```{stata}
*| eval: false
local selected_files ///
"HD2021" ///
"EFFY2022" ///
"SFA2122"
```
**3**. Select the whole `IPEDtaS.do` script and hit "Run"
**4**. Start a new Stata do file in that same folder
**5**. Load our first data set, hd2022
```{stata}
*| collectcode: true
use "data/hd2021.dta", clear
```
Okay, now, take a look at the variables panel (by default in right hand panel)
- Each of the variables has a label that describes what the variable means
- If you're familiar with standard IPEDS data, you won't be used to seeing those
- They're the variable labels we added, super useful for quick questions without having to open the code book!
![Screenshot of Showing Variable Labels](stata-var-labels.png){height=450}
**6**. Join in our other data sets in a "left join" style (i.e., all observations in the first data set are kept even if they don't have a match in the second)
```{stata}
*| collectcode: true
joinby unitid using "data/sfa2122.dta", unmatched(master) _merge(sfa)
joinby unitid using "data/effy2022.dta", unmatched(master) _merge(effy)
```
:::
## 2. Data Cleaning with Labels
Now we have everything read in the advantage of the labels will truly begin to show!
::: {.panel-tabset}
## {{< fa brands r-project >}} R
**8**. Some of you may have noticed that our data has become extremely "long"
- As in, our data now has many more observations than we originally had
- This we means we probably have a little light data-wrangling to do
- Let's check how many observations our data set now contains
```{r}
nrow(data)
```
- I have a hunch that the data might be "long" by the variable `effylev`, so, let's look at how many observations we have for each value of `effylev`
```{r}
data |> count(effylev)
```
Once again, if you're used to IPEDS data, you wouldn't usually see the information in the `[square brackets]`
- These are our value labels, again, super useful for quick questions without having to open the code book!
- One thing I really like about using labels is you get the best of both worlds
- We still have the original values to check with the code book (which you don't get with some tools we will discuss later)
The labels help us quickly identify what the different values of effylev mean and that if we are interested in undergraduate figures (which for now, we are) we want to keep rows that are `effylev == 2`
```{r}
data <- data |> filter(effylev == 2)
```
## ![](stata-icon.png){width=0.15in} Stata
**7**. Some of you may have noticed that our data has become extremely "long"
- As in, our data now has many more observations than we originally had
- This we means we probably have a little light data-wrangling to do
- Let's check how many observations our data set now contains
```{stata}
count
```
- I have a hunch that the data might be "long" by the variable `effylev`, so, let's look at how many observations we have for each value of `effylev`
```{stata}
tabulate effylev
```
Once again, if you're used to IPEDS data, you would usually see a bunch of numbers in the left-hand column, but now we see informative labels
- These are our value labels, again, super useful for quick questions without having to open the code book!
- If these are ever unclear, the data still contains the original values to check with the code book (which you don't get with some tools we will discuss later)
- You can use the command `labelbook` to check these
```{stata}
labelbook label_effylev
```
The labels help us quickly identify what the different values of effylev mean and that if we are interested in undergraduate figures (which for now, we are) we want to keep rows that are `effylev == 2`
```{stata}
*| collectcode: true
keep if effylev == 2
```
:::
## 3. Tables with Labels
::: {.panel-tabset}
## {{< fa brands r-project >}} R
**9**. Now let's explore some trends in our data to show how labels can help. How does the percent of students paying out-of-state tuition vary by region?
```{r}
data |>
group_by(obereg) |>
summarize(median_perc_out_of_state = median(scfa13p, na.rm = TRUE))
```
Notice how again the labels make our analysis instantly more informative
- We know what obereg 7 means without going to the code book
- Now, if we want to just use the labels the column, `haven` has a handy tool for that as well `as_factor()`
- This converts a column with value labels to a factor using the label as the value
```{r}
data |>
group_by(as_factor(obereg)) |>
summarize(median_perc_out_of_state = median(scfa13p, na.rm = TRUE))
```
## ![](stata-icon.png){width=0.15in} Stata
**8**. Now let's explore some trends in our data to show how labels can help. How does the percent of students paying out-of-state tuition vary by region?
```{stata}
tabstat scfa13p, s(median) by(obereg)
```
Notice how again the labels make our analysis instantly more informative
- We know which region has 19% of students paying out-of-state tuition without going to the code book (it would previously just have said 7)
:::
## 4. Plots with Labels
::: {.panel-tabset}
## {{< fa brands r-project >}} R
**10**. What about the relationship between total enrollment and the percent paying instate tuition? Are bigger schools relying more on out-of-state students? Does this trend vary by region?
- This wouldn't work in a table, so, let's look at a simple scatter plot
```{r}
#| label: fig-rplot-none
ggplot(data |> filter(efytotlt < 50000),
aes(x = efytotlt,
y = scfa13p)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 0.8)) +
facet_wrap(~obereg)
```
Okay... But what do those variables mean?
- Without labels, the plot it hard to understand
So, let's add labels
- The first step is to change `facet_wrap(~obereg)` to `facet_wrap(~as_factor(obereg))`
- This is the same as we did in the table above, using a new version of the column that uses the value labels as the value
- The second step involves pulling out the variable labels to go on the x and y axis
- This is a little more manual, but, we can set our x and y labels using the `labs()` argument as normal
- But instead of putting something like `x = "my x axis label"`, we use the `var_label()` from the `labelled` package
```{r}
#| label: fig-rplot-vals
ggplot(data |> filter(efytotlt < 50000),
aes(x = efytotlt,
y = scfa13p)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 0.8)) +
labs(x = var_label(data$efytotlt),
y = var_label(data$scfa13p)) +
facet_wrap(~as_factor(obereg))
```
Well, that's more informative, but a little messy
- With a couple of tweaks to allow longer labels to wrap around, we now have a much better looking plot
- `y = str_wrap(var_label(data$scfa13p), 40)` says to make a new line every 40 characters on the y axis
- `labeller = label_wrap_gen(multi_line = TRUE)` inside our `facet_wrap()` allows the facet labels to wrap onto multiple lines
```{r}
#| label: fig-rplot-all
ggplot(data |> filter(efytotlt < 50000),
aes(x = efytotlt,
y = scfa13p)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 0.8)) +
labs(x = var_label(data$efytotlt),
y = str_wrap(var_label(data$scfa13p), 40)) +
facet_wrap(~as_factor(obereg),
labeller = label_wrap_gen(multi_line = TRUE))
```
## ![](stata-icon.png){width=0.15in} Stata
**9**. What about the relationship between total enrollment and the percent paying instate tuition? Are bigger schools relying more on out-of-state students? Does this trend vary by region?
- This wouldn't work in a table, so, let's look at a simple scatter plot
```{stata}
scatter scfa13p efytotlt if efytotlt < 50000, by(obereg, col(2))
quietly graph export scatter.svg, replace
```
![](scatter.svg)
```{stata}
*| include: false
*| eval: false
scatter scfa13p efytotlt if efytotlt < 50000, by(obereg, col(3) legend(off)) || lfit scfa13p efytotlt if efytotlt < 50000
quietly graph export scatter2.svg, replace
```
- See how by default the x, y, and by/facet labels use the labels and not the variable names/values?
- This instantly makes your plots more intuitive
- I don't typically use Stata for plotting, so I'm not sure how to get the longer labels to wrap, but I'm sure there's a way
:::
## 5. Models with Labels
::: {.panel-tabset}
## {{< fa brands r-project >}} R
**11**. Lastly, let's look at how labels can show up in modeling. Let's see if the percentage of students paying out of state changes by the level of the institution (4 year, 2 year, Less than 2 year)
```{r}
model <- lm(scfa13p ~ factor(iclevel),
data = data)
tbl_regression(model)
```
Without using labels, the regression output needs the code book to interpret
- What is iclevel 2?
Remember from above, using `as_factor()` rather than `factor()` tells R to use the labels as the levels
```{r}
model <- lm(scfa13p ~ as_factor(iclevel),
data = data)
tbl_regression(model)
```
Okay that is much clearer what is going on!
- `as_factor(iclevel)` is still a bit messy though
- Similarly to the plot above, using variable labels is a little more tricky, but, we can do it using the `var_label()` function again alongside the `label =` argument in `tbl_regression`
```{r}
tbl_regression(model,
label = list(`as_factor(iclevel)` = var_label(data$iclevel)))
```
## ![](stata-icon.png){width=0.15in} Stata
**10**. Lastly, let's look at how labels can show up in modeling. Let's see if the percentage of students paying out of state changes by the level of the institution (4 year, 2 year, Less than 2 year)
```{stata}
regress scfa13p i.iclevel
```
As you can see, the variable labels automatically show up in our regression output
- Before, you would have seen 2 and 3 in the iclevel column, but now you get informative labels
- With the labels, it's easier to interpret which makes you work easier to read and also less likely you will get mixed up and report the wrong value!
:::
# Removing Value Labels
It's rare, but, there are occasions where you might need to remove value labels from your data
- For instance, in R certain advanced analysis packages get confused when you have value labels
- You may also spot an error in labelling and need to get rid of it
- Luckily, it's pretty easy!
::: {.panel-tabset}
## {{< fa brands r-project >}} R
In R, just use the `zap_labels()` function from `haven` to create an unlabelled version of your data
```{r}
data |>
count(iclevel)
data_unlabelled <- zap_labels(data)
data_unlabelled |>
count(iclevel)
```
## ![](stata-icon.png){width=0.15in} Stata
In Stata, simply type `label drop _all`
```{stata}
tabulate iclevel
label drop _all
tabulate iclevel
```
:::
# Getting Capitalized Variable Names
- By default, `IPEDtaS` gives you lower-case variable names (which is the default for Stata-style data)
- Usually, this is going to be easier to work with
- However, sometimes you might need to keep the original upper-case variable names, such as if you're adding this to an existing project that already uses upper-case variable names
- To do this, you just need to add a single line near the end of the `IPEDtaS` script
::: {.panel-tabset}
## {{< fa brands r-project >}} R
Add this line
```{r}
#| eval: false
data_file <- data_file |> dplyr::rename_all(stringr::str_to_upper)
```
directly above (near end of script)
```{r}
#| eval: false
haven::write_dta(data_file, dta_name)
```
## ![](stata-icon.png){width=0.15in} Stata
Add this line
```{stata}
*| eval: false
rename *, upper
```
directly above (near end of script)
```{stata}
*| eval: false
save ../data/`dta_name'
```
:::