Skip to content

Latest commit

 

History

History
158 lines (124 loc) · 5.62 KB

README.md

File metadata and controls

158 lines (124 loc) · 5.62 KB

folk

R-CMD-check codecov Lifecycle: experimental

folk provides easy access to datasets that can be used to benchmark machine learning algorithms. The goal of folk is to facilitate and encourage work on fair machine learning among R users.

The folk package has three key features:

Feature Description
get_() The get_() functions provide easy access to data. Currently, there is only one get_() function, get_acs(), which provides access to the US Census Bureau’s American Community Survey (ACS) Public Use Microdata Sample.
set_task() The set_task() function preprocesses data for pre-defined prediction tasks. Pre-defined tasks can be viewed with show_tasks().
new_task() The new_task() function allows users to create custom tasks. A custom task created via new_task() returns an object consistent with that returned by set_task().

Installation

Install the development version from GitHub:

# install.packages("devtools")
devtools::install_github("george-wood/folk")

Usage

library(folk)
  • Easy access to data via folk’s API: get_acs(), …
devtools::load_all()
# optionally, set a path to write to
delaware <- get_acs(state = "de", year = 2014, period = 1, survey = "person")
  • Show pre-defined prediction tasks for data accessed through the API: show_tasks()
show_tasks(delaware)

#> $income
#> function(
#>     features = c("AGEP",
#>                  "COW",
#>                  "SCHL",
#>                  "MAR",
#>                  "OCCP",
#>                  "POBP",
#>                  "RELP",
#>                  "WKHP",
#>                  "SEX",
#>                  "RAC1P"),
#>     target = "PINCP",
#>     group = "RAC1P",
#>     filter = filter_adult,
#>     target_transform = function(y) binary_target_(y > 50000),
#>     group_transform = NULL,
#>     preprocess = NULL,
#>     postprocess = function(x) replace_na_(x, value = -1L)
#> ) {
#>   invisible(FALSE)
#> }
#> 
#> ...
  • Set a pre-defined prediction task: set_task()
delaware_income <- set_task(delaware, task = "income")
#> ℹ Setting income prediction task. See `folk::show_definition()()` for details.
head(delaware_income)
#>   PINCP RAC1P AGEP COW SCHL MAR OCCP POBP RELP WKHP SEX
#> 1     0     1   25   1   16   5 5400   17   16   40   2
#> 2     0     1   37   1   21   1 3255   34    0   40   2
#> 3     0     1   36   2   19   5  110   40    0   40   1
#> 4     0     1   59   2   20   1 5120   54    0   40   2
#> 5     0     1   21   1   19   5 5240   10    2   36   2
#> 6     1     1   51   1   16   3 7150   24    0   40   1

Example

library(tidymodels)

delaware <- get_acs(state = "de", year = 2014, period = 1, survey = "person")
delaware_income <- set_task(delaware, task = "income")
#> ℹ Setting income prediction task. See `folk::task_income()` for details.

set.seed(0)
split <- initial_split(delaware_income, prop = 0.8)
train <- training(split)
test  <- testing(split)

income_recipe <-
  recipe(PINCP ~ ., data = train) |>
  step_normalize()

income_model <-
  logistic_reg(mode = "classification", engine = "glm")

income_flow <-
  workflow() |>
  add_recipe(income_recipe) |>
  add_model(income_model)

yhat <- 
  fit(income_flow, data = train) |>
  predict(new_data = test, type = "class")
yhat <- as.numeric(as.character(yhat$.pred_class))
black_tpr <- mean(yhat[test$PINCP == 1 & test$RAC1P == 2])
black_fpr <- mean(yhat[test$PINCP == 0 & test$RAC1P == 2])
white_tpr <- mean(yhat[test$PINCP == 1 & test$RAC1P == 1])
white_fpr <- mean(yhat[test$PINCP == 0 & test$RAC1P == 1])

black_tpr
#> [1] 0.3414634
black_fpr
#> [1] 0.1025641

white_tpr
#> [1] 0.5992063
white_fpr
#> [1] 0.1648352

# equalized odds difference:
max(abs(black_tpr - white_tpr), abs(black_fpr - white_fpr))
#> [1] 0.2577429

Acknowledgements

The folk package is inspired by the folktables Python package. For more information on folktables see Ding, Hardt, Miller, and Schmidt (2022), Retiring Adult: New Datasets for Fair Machine Learning. The pre-defined prediction tasks for the American Community Survey data are implementations of the tasks introduced in this paper.