diff --git a/.nojekyll b/.nojekyll
index f7c8040..4f7ec7b 100644
--- a/.nojekyll
+++ b/.nojekyll
@@ -1 +1 @@
-334f3c1f
\ No newline at end of file
+4d8fcb43
\ No newline at end of file
diff --git a/assets/permits_animation.gif b/assets/permits_animation.gif
index 23c5f41..58a1d0f 100644
Binary files a/assets/permits_animation.gif and b/assets/permits_animation.gif differ
diff --git a/final.html b/final.html
index 84be0d7..ddf0339 100644
--- a/final.html
+++ b/final.html
@@ -9172,19 +9172,19 @@
This study leverages open data sources including permit counts, council district boundaries, racial mix, median income, housing cost burden to holistically understand what drives development pressure. Generally, data is collected at the block group or parcel level and aggregated up to the council district to capture both local and more citywide trends.
@@ -9324,9 +9324,10 @@
-
3.1 Permits
-
Firstly, 10 years of permit data from 2012 to 2023 from the Philadelphia Department of Licenses and Inspections are critical to the study. This study filters only for new construction permits granted for residential projects. In the future, filtering for full and substantial renovations could add more nuance to what constitutes as development pressure.
+
+
3.1 Construction Permits
+
Permit data from 2013 through 2023, collected from the Philadelphia Department of Licenses & Inspections, are the basis of our model. We consider only new construction permits granted for residential projects, but in the future, filtering for data on “full” or “substantial” renovations could add nuance to the compelexities of development pressure. Given the granular spatial scale of our analysis, and the need to aggregate Census data to our unit of analysis, we chose to aggregate these permits data to the block group level.
We note a significant uptick in new construction permits as we approach 2021, followed by a sharp decline. It is generally acknowledged that this trend was due to the expiration of a tax abatement program for developers.
When assessing new construction permit count by Council Districts, a few districts issued the bulk of new permits during that 2021 peak. Hover over the lines to see more about the volume of permits and who granted them.
New construction exhibits sizable spatial and temporal autocorrelation. In other words, there is a strong relationship between the number of permits in a given block group and the number of permits in neighboring block groups; as well as between the number of permits issued in a block group in a given year and the number of permits issued in that same block group in the previous year. To account for these relationships, we engineer new features, including both space and time lags. We note that all of these engineered features have strong correlation coefficients with our dependent variable, permits_count, and p-values indicating that these relationships are statistically significant.
Racial Mix (white vs non-white), median income, and housing cost burden are socioeconomic factors that often play an outsized role in affordability in cities like Philadelphia, with a pervasive and persistent history of housing discrimination and systemic disinvestment. This data is all pulled from the US Census Bureau’s American Community Survey 5-Year survey.
+
+
3.3 Socioeconomic Features
+
Socioeconomic factors such as race, income, and housing cost burden play an outsized role in affordability in cities like Philadelphia, which are marred by a pervasive and persistent history housing discrimination and systemic disinvestment in poor and minority neighborhoods. To account for these issues, we incorporate various data from the US Census Bureau’s American Community Survey 5-Year survey. Later, we also consider our model’s generalizability across different racial and economic contexts to ensure that it will not inadvertently reinforce structural inequity.
Spatially, is clear that non-white communities earn lower median incomes and experience higher rates of extreme rent burden (household spends more than 35% of income on gross rent).
Considering the strong spatial relationship between socioeconomics and certain areas of Philadelphia, we will be sure to investigate our model’s generalizability against race and income.
-
-
4 Build Predictive Models
+
+
4 Model Building
“All the complaints about City zoning regulations really boil down to the fact that City Council has suppressed infill housing or restricted multi-family uses, which has served to push average housing costs higher.” - Jon Geeting, Philly 3.0 Engagement Director
SmartZoning® seeks to predict where permits are most likely to be filed as a measure to predict urban growth. As discussed, predicting growth is fraught because growth is influenced by political forces rather than by plans published by the city’s Planning Commission. Comprehensive plans, typically set on ten-year timelines, tend to become mere suggestions, ultimately subject to the prerogatives of city council members rather than serving as steadfast guides for smart growth. With these dynamics in mind, SmartZoning’s prediction model accounts for socioeconomics, council district, and time-space lag.
-
-
4.1 Tests for Correlation
-
The goal is to select variables that most significantly correlate to permit count to include in the predictive model. Correlation is a type of association test. For example, are permit counts more closely associated to population or to median income? Or, do racial mix and rent burden offer redundant insight? These are the types of subtle but important distinctions we aim to seek out.
+
+
4.1 Tests for Correlation and Collinearity
4.1.1 Correlation Coefficients
+
In building our model, we aim to select variables that correlate significantly with permit_count. Using a correlation matrix, we can assess whether our predictors are, in fact, meaningfully associated with our dependent variable. As it turns out, socioeconomic variables are not (we exclude the other variables, which we have previously established to be significant), but we retain them for the sake of later analysis.
To ensure that our predictive model does not have multicollinearity, or multiple values telling the same story about permit counts, we use the VIF test. The table below lists each variables’s VIF score. Variables that have over a 5 are considered to potentially have some multicollinearity, and those over 10 certainly need to be flagged. Generally the council district and zoning overlays such as historic districts may be conflicting.
We also aim to minimize or eliminate multicollinearity in our model. For this purpose, we evaluate the variance inflation factor (VIF) of a given predictor. The table below lists the VIF of all of our predictors; we exclude any with a VIF of 5 or more from our final model, including district, which is council district, and several historic district and planning overlays.
ggplot(permits_bg %>% st_drop_geometry %>%filter(!year %in%c(2024)), aes(x = permits_count)) +geom_histogram(fill = palette[1], color =NA, alpha =0.7) +labs(title ="Permits per Block Group per Year",
@@ -9983,9 +9982,9 @@
-
4.2 Examine Spatial Patterns
-
To to identify spatial clusters, or hotspots, in geographic data, we performed a Local Moran’s I test. It assesses the degree of spatial autocorrelation, which is the extent to which the permit counts in a block group tend to be similar to neighboring block group. We used a p-value of 0.1 as our hotspot threshold.
+
+
4.2 Spatial Patterns
+
In addition to correlation between non-spatial variables, our dependent variable, permits_count, displays a high degree of spatial autocorrelation. That is, the number of permits at a given location is closely related to the number of permits at neighboring locations. We’ve accounted for this in our model by factoring in spatial lag, and we explore it here by evaluating the local Moran’s I values, which is the measure of how concentrated high or low values are at a given location. Here, we identify hotspots for new construction in 2023 by looking at statistically signficant concentrations of new building permits.
Show the code
@@ -10010,50 +10009,61 @@
morans_i <-tmap_theme(tm_shape(lisa) +tm_polygons(col ="ii", border.alpha =0, style ="jenks", palette = mono_5_green, title ="Moran's I"),
-"Local Moran's I")
+"Local Moran's I (2023)")p_value <-tmap_theme(tm_shape(lisa) +tm_polygons(col ="p_ii", border.alpha =0, style ="jenks", palette = mono_5_green, title ="P-Value"),
-"Moran's I P-Value")
+"Moran's I P-Value (2023)")sig_hotspots <-tmap_theme(tm_shape(lisa) +tm_polygons(col ="hotspot", border.alpha =0, style ="cat", palette =c(mono_5_green[1], mono_5_green[5]), textNA ="Not a Hotspot", title ="Hotspot?"),
-"New Construction Hotspots")
+"Construction Hotspots (2023)")tmap_arrange(morans_i, p_value, sig_hotspots, ncol =3)
# # Prepare the data
+# permits_data <- permits_bg %>%
+# select(permits_count, geoid10, year) %>%
+# na.omit()
+#
+# # Check for infinite values or other anomalies
+# if(any(is.infinite(permits_data$permits_count), na.rm = TRUE)) {
+# stop("Infinite values found in permits_count")
+# }
+#
+# # Create spacetime object
+# stc <- as_spacetime(permits_data,
+# .loc_col = "geoid10",
+# .time_col = "year")
+#
+# # Run emerging hotspot analysis
+# ehsa <- emerging_hotspot_analysis(
+# x = stc,
+# .var = "permits_count",
+# k = 1,
+# nsim = 3
+# )
+#
+# # Analyze the result
+# count(ehsa, classification)
4.3 Compare Models
-
Make sure to note that we train, test, and then validate. So these first models are based on 2022 data, and then we run another on 2023 (and then predict 2024 at the end).
-
There are various regression models available, each with its assumptions, strengths, and weaknesses. We compared Ordinary Least Square, Poisson, and Random Forest. This comparative study allowed us to consider the model’s accuracy, if it overfit, its generalizability, as well as compuationl efficiency.
-
The Poisson model was unviable because it overvalued outliers and therefore is not detailed below.
+
To actually build our model, we have a range to choose from. OLS, or least squares regression, is among the most common; here, we use it as the basis for comparison with a random forest model, which is somewhat more sophisticated. We also considered a Poisson model, although found that it drastically overpredicted for outliers, and we therefore discarded it. As a point of comparison, we built both OLS and random forest models, trained them on data from 2013 through 2021, tested them on 2022 data, dn compared the results for accuracy, overfitting, and generalizability.
4.3.1 OLS
-
OLS (Ordinary least squares) is a method to explore relationships between a dependent variable and one or more explanatory variables. It considers the strength and direction of these relationships and the goodness of model fit. Our model incorporates three engineered groups of features: space lag, time lag, and distance to 2022. We include this last variable because of the Philadelphia tax abatement policy that led to a significant increase in residential development in the years immediately before 2022 discussed earlier. We used this as a baseline model to compare to Poisson and Random Forest. Given how tightly aligned the observed and predicted prices are we performed dozens of variable combinations to rule out over fitting. We are confident that our variables are generalizable and do not over-fit.
+
OLS (Ordinary least squares) is a method to explore relationships between a dependent variable and one or more explanatory variables. It considers the strength and direction of these relationships and the goodness of model fit. Our model incorporates three engineered groups of features: space lag, time lag, and distance to 2022. We include this last variable because of the Philadelphia tax abatement policy that led to a significant increase in residential development in the years immediately before 2022 discussed earlier. We used this as a baseline model to compare to Poisson and Random Forest.
+
Overall, we found that our basic OLS model performed quite well; with a mean absolute error (MAE) of 2.68, it is fairly accurate in prediciting future development. We also note that it overpredicts in most cases which, given our goal of anticipating and preparing for high demand for future development, is preferrable to underpredicting. That said, it still produces a handful of outliers that deviate substantially from the predicted value. As a result, we considered a random forest model to see if it would handle these outliers better.
Show the code
@@ -10072,71 +10082,112 @@
+
+
Show the code
-
ggplot(ols_preds, aes(x = abs_error)) +
-geom_histogram(fill = palette[3], color =NA, alpha =0.7) +
-labs(title ="Distribution of Absolute Error per Block Group",
-subtitle ="OLS, 2022") +
-theme_minimal()
Random forest models are superior to OLS in their ability to capture non-linear patterns, outliers, and so forth. They also tend to be less sensitive to multicolinearity. Thus, we considered whether a random forest model would improve on some of the weaknesses of the OLS model. We found that this was indeed the case; the random forest model yielded a MAE of 2.91. Furthermore, the range of absolute error in the model was sizably reduced, with outliers exerting less of an impact on the model.
suppressMessages(
+ggplot(rf_test_preds, aes(x = permits_count, y = rf_test_preds)) +
+geom_point() +
+labs(title ="Predicted vs. Actual Permits: RF",
+subtitle ="2022 Data",
+x ="Actual Permits",
+y ="Predicted Permits") +
+geom_abline() +
+geom_smooth(method ="lm", se =FALSE, color = palette[3]) +
+theme_minimal()
+)
+
+
+
-
Our OLS model exhibits a Mean Absolute Error (MAE) of 2.66, a decent performance for a model of its simplicity. However, its efficacy is notably diminished in critical domains where optimization is imperative. Consequently, we intend to enhance the predictive capacity by incorporating more pertinent variables and employing a more sophisticated modeling approach.
We find that our OLS model has an MAE of only MAE: 2.68–not bad for such a simple model! Still, it struggles most in the areas where we most need it to succeed, so we will try to introduce better variables and apply a more complex model to improve our predictions.
-
-
4.3.2 Random Forest
-
OLS and Random Forest represent different modeling paradigms. OLS is a linear regression model suitable for capturing linear relationships, while Random Forest is an ensemble method capable of capturing non-linear patterns and offering greater flexibility in handling various data scenarios. Considering, Random Forest is generally less sensitive to multicollinearity because it considers subsets of features in each tree and averages their predictions and because the effect of outliers tends to be mitigated, we decided it worth investigating Random Forest as an alternative model.
-
Compared to the OLS model, the relationship between predicted vs actual permits…
+
+
+
+
5 Model Testing
+
Model training, validation, and testing involved three steps. First, we partitioned our data into training, validation, and testing sets. We used data from 2013 through 2021 for initial model training. Next, we evaluated our models’ ability to accurately predict 2022 construction permits using our validation set, which consisted of all permits in 2022. We carried out additional feature engineering and model tuning, iterating based on the results of these training and testing splits. We sought to minimize both the mean absolute error (MAE) of our best model and the distribution of absolute error. Finally, when we were satisfied with the results of our best model, we evaluated it again by training it on all data from 2013 through 2022 and validating it on data from 2023 (all but the last two weeks, which we consider negligible for our purposes), which the model had never “seen” before. As Kuhn and Johnson write in Applied Predictive Modeling (2013), “Ideally, the model should be evaluated on samples that were not used to build or fine-tune the model, so that they provide an unbiased sense of model effectiveness.”
+
Again, testing confirms the strength of our model; based on 2023 data, our random forest model produces a MAE of 2.19. We note again that the range of model error is relatively narrow.
Show the code
-
ggplot(rf_test_preds, aes(x = abs_error)) +
-geom_histogram(fill = palette[3], alpha =0.7, color =NA) +
-labs(title ="Distribution of Absolute Error per Block Group",
-subtitle ="Random Forest, 2022") +
-theme_minimal()
Considering Random Forest’s favorable results and attributes for our study compared to OLS, we will train and test our predictive model using the random forest model.
-
We decided to split our training and testing data up to 2022 in an effort to balance permiting activity pre- and post- tax abatement policy.
-
[code block here]
-
We train and test up to 2022–we use this for model tuning and feature engineering.
-
Having settled on our model features and tuning, we now validate on 2023 data.
The constructed boxplot, categorizing observations based on racial composition, indicates that the random forest model generalizes effectively, showcasing consistent and relatively low absolute errors across majority non-white and majority white categories. The discernible similarity in error distributions suggests that the model’s predictive performance remains robust across diverse racial compositions, affirming its ability to generalize successfully.
We find that error is not related to affordability and actually trends downward with percent nonwhite. (This is probably because there is less total development happening there in majority-minority neighborhoods to begin with, so the magnitude of error is less, even though proportionally it might be more.) Error increases slightly with total pop. This makes sense–more people –> more development.
Our analysis reveals that the error is not correlated with affordability and demonstrates a downward trend in conjunction with the percentage of the nonwhite population. This observed pattern may be attributed to the likelihood that majority-minority neighborhoods experience a comparatively lower volume of overall development, thereby diminishing the absolute magnitude of error, despite potential proportional increases. Additionally, there is a slight increase in error with the total population, aligning with the intuitive expectation that higher population figures correspond to more extensive development activities.
This study leverages open data sources including permit counts, council district boundaries, racial mix, median income, housing cost burden to holistically understand what drives development pressure. Generally, data is collected at the block group or parcel level and aggregated up to the council district to capture both local and more citywide trends.
@@ -9324,9 +9324,10 @@
-
3.1 Permits
-
Firstly, 10 years of permit data from 2012 to 2023 from the Philadelphia Department of Licenses and Inspections are critical to the study. This study filters only for new construction permits granted for residential projects. In the future, filtering for full and substantial renovations could add more nuance to what constitutes as development pressure.
+
+
3.1 Construction Permits
+
Permit data from 2013 through 2023, collected from the Philadelphia Department of Licenses & Inspections, are the basis of our model. We consider only new construction permits granted for residential projects, but in the future, filtering for data on “full” or “substantial” renovations could add nuance to the compelexities of development pressure. Given the granular spatial scale of our analysis, and the need to aggregate Census data to our unit of analysis, we chose to aggregate these permits data to the block group level.
We note a significant uptick in new construction permits as we approach 2021, followed by a sharp decline. It is generally acknowledged that this trend was due to the expiration of a tax abatement program for developers.
When assessing new construction permit count by Council Districts, a few districts issued the bulk of new permits during that 2021 peak. Hover over the lines to see more about the volume of permits and who granted them.
New construction exhibits sizable spatial and temporal autocorrelation. In other words, there is a strong relationship between the number of permits in a given block group and the number of permits in neighboring block groups; as well as between the number of permits issued in a block group in a given year and the number of permits issued in that same block group in the previous year. To account for these relationships, we engineer new features, including both space and time lags. We note that all of these engineered features have strong correlation coefficients with our dependent variable, permits_count, and p-values indicating that these relationships are statistically significant.
Racial Mix (white vs non-white), median income, and housing cost burden are socioeconomic factors that often play an outsized role in affordability in cities like Philadelphia, with a pervasive and persistent history of housing discrimination and systemic disinvestment. This data is all pulled from the US Census Bureau’s American Community Survey 5-Year survey.
+
+
3.3 Socioeconomic Features
+
Socioeconomic factors such as race, income, and housing cost burden play an outsized role in affordability in cities like Philadelphia, which are marred by a pervasive and persistent history housing discrimination and systemic disinvestment in poor and minority neighborhoods. To account for these issues, we incorporate various data from the US Census Bureau’s American Community Survey 5-Year survey. Later, we also consider our model’s generalizability across different racial and economic contexts to ensure that it will not inadvertently reinforce structural inequity.
Spatially, is clear that non-white communities earn lower median incomes and experience higher rates of extreme rent burden (household spends more than 35% of income on gross rent).
Considering the strong spatial relationship between socioeconomics and certain areas of Philadelphia, we will be sure to investigate our model’s generalizability against race and income.
-
-
4 Build Predictive Models
+
+
4 Model Building
“All the complaints about City zoning regulations really boil down to the fact that City Council has suppressed infill housing or restricted multi-family uses, which has served to push average housing costs higher.” - Jon Geeting, Philly 3.0 Engagement Director
SmartZoning® seeks to predict where permits are most likely to be filed as a measure to predict urban growth. As discussed, predicting growth is fraught because growth is influenced by political forces rather than by plans published by the city’s Planning Commission. Comprehensive plans, typically set on ten-year timelines, tend to become mere suggestions, ultimately subject to the prerogatives of city council members rather than serving as steadfast guides for smart growth. With these dynamics in mind, SmartZoning’s prediction model accounts for socioeconomics, council district, and time-space lag.
-
-
4.1 Tests for Correlation
-
The goal is to select variables that most significantly correlate to permit count to include in the predictive model. Correlation is a type of association test. For example, are permit counts more closely associated to population or to median income? Or, do racial mix and rent burden offer redundant insight? These are the types of subtle but important distinctions we aim to seek out.
+
+
4.1 Tests for Correlation and Collinearity
4.1.1 Correlation Coefficients
+
In building our model, we aim to select variables that correlate significantly with permit_count. Using a correlation matrix, we can assess whether our predictors are, in fact, meaningfully associated with our dependent variable. As it turns out, socioeconomic variables are not (we exclude the other variables, which we have previously established to be significant), but we retain them for the sake of later analysis.
To ensure that our predictive model does not have multicollinearity, or multiple values telling the same story about permit counts, we use the VIF test. The table below lists each variables’s VIF score. Variables that have over a 5 are considered to potentially have some multicollinearity, and those over 10 certainly need to be flagged. Generally the council district and zoning overlays such as historic districts may be conflicting.
We also aim to minimize or eliminate multicollinearity in our model. For this purpose, we evaluate the variance inflation factor (VIF) of a given predictor. The table below lists the VIF of all of our predictors; we exclude any with a VIF of 5 or more from our final model, including district, which is council district, and several historic district and planning overlays.
ggplot(permits_bg %>% st_drop_geometry %>%filter(!year %in%c(2024)), aes(x = permits_count)) +geom_histogram(fill = palette[1], color =NA, alpha =0.7) +labs(title ="Permits per Block Group per Year",
@@ -9983,9 +9982,9 @@
-
4.2 Examine Spatial Patterns
-
To to identify spatial clusters, or hotspots, in geographic data, we performed a Local Moran’s I test. It assesses the degree of spatial autocorrelation, which is the extent to which the permit counts in a block group tend to be similar to neighboring block group. We used a p-value of 0.1 as our hotspot threshold.
+
+
4.2 Spatial Patterns
+
In addition to correlation between non-spatial variables, our dependent variable, permits_count, displays a high degree of spatial autocorrelation. That is, the number of permits at a given location is closely related to the number of permits at neighboring locations. We’ve accounted for this in our model by factoring in spatial lag, and we explore it here by evaluating the local Moran’s I values, which is the measure of how concentrated high or low values are at a given location. Here, we identify hotspots for new construction in 2023 by looking at statistically signficant concentrations of new building permits.
Show the code
@@ -10010,50 +10009,61 @@
morans_i <-tmap_theme(tm_shape(lisa) +tm_polygons(col ="ii", border.alpha =0, style ="jenks", palette = mono_5_green, title ="Moran's I"),
-"Local Moran's I")
+"Local Moran's I (2023)")p_value <-tmap_theme(tm_shape(lisa) +tm_polygons(col ="p_ii", border.alpha =0, style ="jenks", palette = mono_5_green, title ="P-Value"),
-"Moran's I P-Value")
+"Moran's I P-Value (2023)")sig_hotspots <-tmap_theme(tm_shape(lisa) +tm_polygons(col ="hotspot", border.alpha =0, style ="cat", palette =c(mono_5_green[1], mono_5_green[5]), textNA ="Not a Hotspot", title ="Hotspot?"),
-"New Construction Hotspots")
+"Construction Hotspots (2023)")tmap_arrange(morans_i, p_value, sig_hotspots, ncol =3)
# # Prepare the data
+# permits_data <- permits_bg %>%
+# select(permits_count, geoid10, year) %>%
+# na.omit()
+#
+# # Check for infinite values or other anomalies
+# if(any(is.infinite(permits_data$permits_count), na.rm = TRUE)) {
+# stop("Infinite values found in permits_count")
+# }
+#
+# # Create spacetime object
+# stc <- as_spacetime(permits_data,
+# .loc_col = "geoid10",
+# .time_col = "year")
+#
+# # Run emerging hotspot analysis
+# ehsa <- emerging_hotspot_analysis(
+# x = stc,
+# .var = "permits_count",
+# k = 1,
+# nsim = 3
+# )
+#
+# # Analyze the result
+# count(ehsa, classification)
4.3 Compare Models
-
Make sure to note that we train, test, and then validate. So these first models are based on 2022 data, and then we run another on 2023 (and then predict 2024 at the end).
-
There are various regression models available, each with its assumptions, strengths, and weaknesses. We compared Ordinary Least Square, Poisson, and Random Forest. This comparative study allowed us to consider the model’s accuracy, if it overfit, its generalizability, as well as compuationl efficiency.
-
The Poisson model was unviable because it overvalued outliers and therefore is not detailed below.
+
To actually build our model, we have a range to choose from. OLS, or least squares regression, is among the most common; here, we use it as the basis for comparison with a random forest model, which is somewhat more sophisticated. We also considered a Poisson model, although found that it drastically overpredicted for outliers, and we therefore discarded it. As a point of comparison, we built both OLS and random forest models, trained them on data from 2013 through 2021, tested them on 2022 data, dn compared the results for accuracy, overfitting, and generalizability.
4.3.1 OLS
-
OLS (Ordinary least squares) is a method to explore relationships between a dependent variable and one or more explanatory variables. It considers the strength and direction of these relationships and the goodness of model fit. Our model incorporates three engineered groups of features: space lag, time lag, and distance to 2022. We include this last variable because of the Philadelphia tax abatement policy that led to a significant increase in residential development in the years immediately before 2022 discussed earlier. We used this as a baseline model to compare to Poisson and Random Forest. Given how tightly aligned the observed and predicted prices are we performed dozens of variable combinations to rule out over fitting. We are confident that our variables are generalizable and do not over-fit.
+
OLS (Ordinary least squares) is a method to explore relationships between a dependent variable and one or more explanatory variables. It considers the strength and direction of these relationships and the goodness of model fit. Our model incorporates three engineered groups of features: space lag, time lag, and distance to 2022. We include this last variable because of the Philadelphia tax abatement policy that led to a significant increase in residential development in the years immediately before 2022 discussed earlier. We used this as a baseline model to compare to Poisson and Random Forest.
+
Overall, we found that our basic OLS model performed quite well; with a mean absolute error (MAE) of 2.68, it is fairly accurate in prediciting future development. We also note that it overpredicts in most cases which, given our goal of anticipating and preparing for high demand for future development, is preferrable to underpredicting. That said, it still produces a handful of outliers that deviate substantially from the predicted value. As a result, we considered a random forest model to see if it would handle these outliers better.
Show the code
@@ -10072,71 +10082,112 @@
+
+
Show the code
-
ggplot(ols_preds, aes(x = abs_error)) +
-geom_histogram(fill = palette[3], color =NA, alpha =0.7) +
-labs(title ="Distribution of Absolute Error per Block Group",
-subtitle ="OLS, 2022") +
-theme_minimal()
Random forest models are superior to OLS in their ability to capture non-linear patterns, outliers, and so forth. They also tend to be less sensitive to multicolinearity. Thus, we considered whether a random forest model would improve on some of the weaknesses of the OLS model. We found that this was indeed the case; the random forest model yielded a MAE of 2.91. Furthermore, the range of absolute error in the model was sizably reduced, with outliers exerting less of an impact on the model.
suppressMessages(
+ggplot(rf_test_preds, aes(x = permits_count, y = rf_test_preds)) +
+geom_point() +
+labs(title ="Predicted vs. Actual Permits: RF",
+subtitle ="2022 Data",
+x ="Actual Permits",
+y ="Predicted Permits") +
+geom_abline() +
+geom_smooth(method ="lm", se =FALSE, color = palette[3]) +
+theme_minimal()
+)
+
+
+
-
Our OLS model exhibits a Mean Absolute Error (MAE) of 2.66, a decent performance for a model of its simplicity. However, its efficacy is notably diminished in critical domains where optimization is imperative. Consequently, we intend to enhance the predictive capacity by incorporating more pertinent variables and employing a more sophisticated modeling approach.
We find that our OLS model has an MAE of only MAE: 2.68–not bad for such a simple model! Still, it struggles most in the areas where we most need it to succeed, so we will try to introduce better variables and apply a more complex model to improve our predictions.
-
-
4.3.2 Random Forest
-
OLS and Random Forest represent different modeling paradigms. OLS is a linear regression model suitable for capturing linear relationships, while Random Forest is an ensemble method capable of capturing non-linear patterns and offering greater flexibility in handling various data scenarios. Considering, Random Forest is generally less sensitive to multicollinearity because it considers subsets of features in each tree and averages their predictions and because the effect of outliers tends to be mitigated, we decided it worth investigating Random Forest as an alternative model.
-
Compared to the OLS model, the relationship between predicted vs actual permits…
+
+
+
+
5 Model Testing
+
Model training, validation, and testing involved three steps. First, we partitioned our data into training, validation, and testing sets. We used data from 2013 through 2021 for initial model training. Next, we evaluated our models’ ability to accurately predict 2022 construction permits using our validation set, which consisted of all permits in 2022. We carried out additional feature engineering and model tuning, iterating based on the results of these training and testing splits. We sought to minimize both the mean absolute error (MAE) of our best model and the distribution of absolute error. Finally, when we were satisfied with the results of our best model, we evaluated it again by training it on all data from 2013 through 2022 and validating it on data from 2023 (all but the last two weeks, which we consider negligible for our purposes), which the model had never “seen” before. As Kuhn and Johnson write in Applied Predictive Modeling (2013), “Ideally, the model should be evaluated on samples that were not used to build or fine-tune the model, so that they provide an unbiased sense of model effectiveness.”
+
Again, testing confirms the strength of our model; based on 2023 data, our random forest model produces a MAE of 2.19. We note again that the range of model error is relatively narrow.
Show the code
-
ggplot(rf_test_preds, aes(x = abs_error)) +
-geom_histogram(fill = palette[3], alpha =0.7, color =NA) +
-labs(title ="Distribution of Absolute Error per Block Group",
-subtitle ="Random Forest, 2022") +
-theme_minimal()
Considering Random Forest’s favorable results and attributes for our study compared to OLS, we will train and test our predictive model using the random forest model.
-
We decided to split our training and testing data up to 2022 in an effort to balance permiting activity pre- and post- tax abatement policy.
-
[code block here]
-
We train and test up to 2022–we use this for model tuning and feature engineering.
-
Having settled on our model features and tuning, we now validate on 2023 data.
The constructed boxplot, categorizing observations based on racial composition, indicates that the random forest model generalizes effectively, showcasing consistent and relatively low absolute errors across majority non-white and majority white categories. The discernible similarity in error distributions suggests that the model’s predictive performance remains robust across diverse racial compositions, affirming its ability to generalize successfully.
We find that error is not related to affordability and actually trends downward with percent nonwhite. (This is probably because there is less total development happening there in majority-minority neighborhoods to begin with, so the magnitude of error is less, even though proportionally it might be more.) Error increases slightly with total pop. This makes sense–more people –> more development.
Our analysis reveals that the error is not correlated with affordability and demonstrates a downward trend in conjunction with the percentage of the nonwhite population. This observed pattern may be attributed to the likelihood that majority-minority neighborhoods experience a comparatively lower volume of overall development, thereby diminishing the absolute magnitude of error, despite potential proportional increases. Additionally, there is a slight increase in error with the total population, aligning with the intuitive expectation that higher population figures correspond to more extensive development activities.