1. Introduction

A concerning consequence of gentrification is displacement of long term residents in neighborhoods who are "priced out" by the new cash flow and attraction of the area.

This case study involves using new construction permits as a proxy for gentrification. While policy experts agree that there are many drivers of gentrification, new construction permits is one strong factor that heavily indicates an increase in investment and attention too a certain neighborhood of a city. In this study, we identify where new construction permits are filed in the city of Philadelphia between 2010-2019 to predict where future permits will be filed in 2019.

The model created below is intended to help fair housing agencies who provide services and resources to individuals at risk of home foreclosure maintain ownership status of their homes. Specifically, the model is able to predict in which neighborhoods more construction permits will be filed and ultimately which are at risk of gentrification and potential displacement. Fair housing agencies will be able to use the tool to help existing clients as well as conduct outreach to new clients who they know may soon struggle financially due to increasing property taxes or other lower costs of living.

This model is reproducible for other cities. It could also be replicated by adding in a different variable which could lead to gentrification.

Setup

First we load the appropriate libraries, define a couple of functions to help create clear visuals, and set our desired color palettes.


# load libraries

knitr::opts_chunk$set(warning=FALSE, message=FALSE, class.source = "fold-hide")

library(tidyverse)
library(sf)
library(RSocrata)
library(viridis)
library(spatstat)
library(raster)
library(spdep)
library(FNN)
library(grid)
library(gridExtra)
library(knitr)
library(kableExtra)
library(tidycensus)

options(scipen=999)

mapTheme <- function(base_size = 12) {
theme(
text = element_text(color = "black"),
plot.title = element_text(size = "14", color = "black"),
plot.subtitle = element_text(face = "italic"),
plot.caption = element_text(hjust = 0),
axis.ticks = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_rect(color = "black", fill = NA, size = 2)
)
}

plotTheme <- function(base_size = 12){
theme(
text = element_text(color = "black"),
plot.title = element_text(size = "14", color = "black"),
plot.subtitle = element_text(face = "italic"),
plot.caption = element_text(hjust = 0),
axis.ticks = element_blank(),
panel.background = element_blank(),
panel.grid.major = element_line("grey80", size = 0.1),
panel.grid.minor = element_blank(),
panel.border = element_rect(color = "black", fill = NA, size = 2),
strip.background = element_rect(fill = "grey80", color = "white"),
strip.text = element_text(size=12),
axis.title = element_text(size=12),
axis.text = element_text(size=10),
plot.background = element_blank(),
legend.background = element_blank(),
legend.title = element_text(colour = "black", face = "italic"),
legend.text = element_text(colour = "black", face = "italic"),
strip.text.x = element_text(size = 14)
)
}

qBr <- function(df, variable, rnd){
if (missing(rnd)) {
as.character(quantile(round(df[[variable]], 0), c(.01, .2,.4, .6, .8), na.rm=T))
} else if (rnd == FALSE || rnd == F) {
as.character(formatC(quantile(df[[variable]]), digits = 3), c(.01, .2, .4, .6, .8), na.rm = T)
}
}

q5 <- function(variable) {as.factor(ntile(variable ,5))}

palette5 <- c("#d01c8b","#f1b6da","#b8e186","#4dac26","#f7f7f7")
palette4 <- c("#d01c8b","#f1b6da","#b8e186","#4dac26")
palette2 <- c("#a1d76a","#e9a3c9")
constructionpalette <- c("#090a07", "#383120", "#c7b897", "#de9f00", "#fdcc00")

Data Exploration

We then load the construction permit data from Open Data Philly using and SQL query.

We also load the Philadelphia City, block group and neighborhood boundaries. Below is a map of Philadelphia neighborhoods.


# permit data, PHL boundary, PHL block groups, PHL nhoods, base fishnet

permits <- st_read("https://phl.carto.com/api/v2/sql?q=SELECT+*+FROM+permits+WHERE+permitissuedate+>=+'2010-01-01'+AND+permitissuedate+<+'2019-01-01'AND+typeofwork+=+'NEW+CONSTRUCTION'&filename=permits&format=geojson&skipfields=cartodb_id") %>%
st_transform('EPSG:3857')

phlBound <- st_read("http://data.phl.opendata.arcgis.com/datasets/063f5f85ef17468ebfebc1d2498b7daf_0.geojson") %>%
st_transform('EPSG:3857')

phlblockgroups <- st_read("http://data.phl.opendata.arcgis.com/datasets/2f982bada233478ea0100528227febce_0.geojson") %>%
st_transform('EPSG:3857') %>%
rename(GEOID = GEOID10)

phlblockgroups <- transform(phlblockgroups, GEOID = as.double(GEOID))

typeof(phlblockgroups$GEOID)

neighborhoods <- st_read("https://raw.githubusercontent.com/azavea/geo-data/master/Neighborhoods_Philadelphia/Neighborhoods_Philadelphia.geojson") %>%
st_transform('EPSG:3857')

ggplot() +
geom_sf(data = neighborhoods) +
labs(title = "PHL Neighborhood", subtitle = "Figure 1") +
mapTheme()

In order to make accurate predictions, we focus on high spatial resolution to create a hyperlocalized model. Below is a fishnet grid of the Philadelphia boundary loaded above. We then sum up the count of permits per grid cell and join it to the fishnet.


# Create a fishnet grid and sum up the counts of permits per gridcell and join it to the fishnet

fishnet <- st_make_grid(phlBound, cellsize = 500, square = TRUE) %>%
.[phlBound] %>% st_sf() %>%
mutate(uniqueID = rownames(.))

permit_net <- dplyr::select(permits) %>% mutate(countPermit = 1) %>%
aggregate(., fishnet, sum) %>%
mutate(countPermit = replace_na(countPermit, 0),
uniqueID = rownames(.),
cvID = sample(round(nrow(fishnet) / 24),
size = nrow(fishnet), replace = TRUE))

ggplot() +
geom_sf(data = permit_net, aes(fill = countPermit)) + 
scale_fill_viridis(option = "cividis") +
labs(title = "Count of Permits for the fishnet", subtitle = "Figure 2")

We then analyze the total distribution of permits across Philadelphia in order to determine which regression type to use in our model. Gentrification tends to occur in a select few neighbborhoods at a single time.


hist(permit_net$countPermit)

Feature Engineering

In this section, each feature is described and we explain whether or not it was included in the model and why.

Eviction Data

Data from Matthew Desmond's Eviction Lab is loaded for Philadelphia which consisted of over 200,000 observations. We filtered the data to a few select years to decrease the size and run time. There might be a few selection bias present in the filtering process and that if different years were selected, it may change the results of the model slightly.


blockgroups <- read.csv("../input/blockgroups/block-groups.csv") %>%
filter(year == 2010 | year == 2014 | year == 2016)

blockgroups <- blockgroups %>% filter(parent.location == "Philadelphia County, Pennsylvania")

evictions.geom <- left_join(blockgroups, phlblockgroups, by = "GEOID") %>%
st_as_sf() %>% # generate interpolation to grid cell
st_transform('EPSG:3857') %>%
mutate(Legend = "Evictions")

There are a few metrics we can use to evaluate the relationship between evictions and new construcction permits filed. Average number of evictions per grid cell.


# Average number of evictions per grid cell
evictions_net <- evictions.geom %>%
st_sf() %>% 
mutate(evictionsTotal = ifelse(is.na(evictions.geom$evictions), 0, evictions.geom$evictions)) %>%
dplyr::select(evictionsTotal) %>%
aggregate(., fishnet, mean) %>%
mutate(Legend = "evictionsTotal")

# mutate the variable your after for tracts
tractJawn <- evictions.geom %>%
st_sf() %>%
mutate(evictionsTotal = ifelse(is.na(evictions.geom$evictions), 0, evictions.geom$evictions))

# Interpolate from tracts to fishnet. Look up the extensive parameter
interpolateJawn <- dplyr::select(tractJawn, evictionsTotal) %>%
st_interpolate_aw(., fishnet, extensive = TRUE)

# For all vars net later
evictions_net.long <- gather(evictions_net, Variable, value, -geometry, -Legend) %>%
dplyr::select(-Legend)

evictions_net.long <- st_join(evictions_net.long, fishnet, join = st_within)

Another metric calculated is median eviction rate per grid cell. The map below shows the calculation by block group and by grid cell.

There is a census tract in Nothern Philadelphia that has a high median eviction rate but a nonexistent resident population as that is the tract where the Northeast Philadelphia Airport is located. This creates a false illusion and is not included in the model.


# Median eviction rate per grid cell
evictions_net2 <-
evictions.geom %>%
st_sf() %>%
mutate(MedEvictionRate=ifelse(is.na(evictions.geom$eviction.rate),0,evictions.geom$eviction.rate))%>%
dplyr::select(MedEvictionRate) %>% 
aggregate(., fishnet, median)%>%
mutate(Legend="MedEvictionRate")

#mutate the variable your after for tracts
tractJawn <- 
evictions.geom %>%
st_sf() %>%
mutate(MedEvictionRate=ifelse(is.na(evictions.geom$eviction.rate),0,evictions.geom$eviction.rate))

#interpolate from tracts to fishnet. Look up the `extensive` parameter.
interpolateJawn <-
dplyr::select(tractJawn, MedEvictionRate) %>%
st_interpolate_aw(., fishnet, extensive = TRUE)

#for all vars net later
evictions_net2.long<-
gather(evictions_net2, Variable, value, -geometry,-Legend)%>%
dplyr::select(-Legend)

evictions_net2.long<-
st_join(evictions_net2.long, fishnet, join=st_within)

#map
grid.arrange(
ggplot() + geom_sf(data=tractJawn, aes(fill = MedEvictionRate)) + ggtitle("Tracts"),
ggplot() + geom_sf(data=interpolateJawn, aes(fill = MedEvictionRate), colour=NA) + ggtitle("Grid cells"),
ncol=2)

Instead, we calculate the average eviction rate per grid cell. This is what is included in our model and later in the Correlation Test section.


evictions_net3 <-
evictions.geom %>%
st_sf() %>%
mutate(AvgEvictionFiling=ifelse(is.na(evictions.geom$eviction.filings),0,evictions.geom$eviction.filings))  %>%
dplyr::select(AvgEvictionFiling) %>% 
aggregate(., fishnet, mean)%>%
mutate(Legend="AvgEvictionFiling")

#mutate the variable your after for tracts
tractJawn <- 
evictions.geom %>%
st_sf() %>%
mutate(AvgEvictionFiling=ifelse(is.na(evictions.geom$eviction.filings),0,evictions.geom$eviction.filings))

# Interpolate form tracts to fishnet. Look up the `extensive` parameter
interpolateJawn <- dplyr::select(tractJawn, AvgEvictionFiling) %>%
st_interpolate_aw(., fishnet, extensive = TRUE)

# for all vars net later
evictions_net3.long <- gather(evictions_net3, Variable, value, -geometry, -Legend) %>%
dplyr::select(-Legend)

evictions_net3.long <- st_join(evictions_net3.long, fishnet, join = st_within)

# map
grid.arrange(
ggplot() +
geom_sf(data = tractJawn, aes(fill = AvgEvictionFiling)) +
ggtitle("Tracts"), ggplot() +
geom_sf(data = interpolateJawn, aes(fill = AvgEvictionFiling), color=NA)+
ggtitle("Grid cells"),
ncol=2
)

I also calculated the median filing rate per each grid cell, but it has the same issue as Median Eviction rate per grid cell. This was also not included in the model.


# Median Filing Rate per each grid cell, it has the same issue as Median Eviction Rater per grid cell
# Not included in the model

evictions_net4 <-
evictions.geom %>%
st_sf() %>%
mutate(MedFilingRate=ifelse(is.na(evictions.geom$eviction.filing.rate),0,evictions.geom$eviction.filing.rate))  %>%
dplyr::select(MedFilingRate) %>% 
aggregate(., fishnet, median)%>%
mutate(Legend="MedFilingRate")

#mutate the variable your after for tracts
tractJawn <- 
evictions.geom %>%
st_sf() %>%
mutate(MedFilingRate=ifelse(is.na(evictions.geom$eviction.filing.rate),0,evictions.geom$eviction.filing.rate))

#interpolate from tracts to fishnet. Look up the `extensive` parameter.
interpolateJawn <-
dplyr::select(tractJawn, MedFilingRate) %>%
st_interpolate_aw(., fishnet, extensive = TRUE)

#for all vars net later
evictions_net4.long<-
gather(evictions_net4, Variable, value, -geometry, -Legend)%>%
dplyr::select(-Legend)

evictions_net4.long<-
st_join(evictions_net4.long, fishnet, join=st_within)

Property Tax Delinquency

We read in the data using a SQl query to filter in Philadelphia City. Increasing costs of property are often a signal of gentrification. I expect areas of high property taxes would be highly associated with areas of new construction permits filed.

We start with the total amount of taxed owed per grid cell, because this data is available at a high spatial(point) resolution.


# total amount of taxes owed per grid cell

# Tax delinquency data
tax <- st_read("https://phl.carto.com/api/v2/sql?q=SELECT+*+FROM+real_estate_tax_delinquencies+WHERE+mailing_city+=+'PHILADELPHIA'&filename=real_estate_tax_delinquencies&format=geojson&skipfields=cartodb_id") %>%
st_as_sf(coords = c("lon", "lat"), crs = 4326, agr = "constant") %>%
st_transform('EPSG:3857') %>%
mutate(Legend = "Tax")

tax_net <- st_join(tax, fishnet, join = st_within) %>%
st_drop_geometry() %>%
group_by(uniqueID, Legend) %>%
summarize(totalOwed = mean(total_due, n=n()), ) %>%
full_join(fishnet, by='uniqueID') %>%
spread(Legend, totalOwed, fill=0) %>%
st_sf() %>%
dplyr::select(- ``) %>%
na.omit() %>%
ungroup()

tax_net.long <-
gather(tax_net, Variable, value, -geometry, -uniqueID)

ggplot() +
geom_sf(data = tax_net.long, aes(fill=value), color=NA) +
scale_fill_viridis() +
labs(title = "Average total amount due in taxes by Fishnet", subtitle = "Philadelphia - last assessment : 2020") +
mapTheme()

The next feature engineered is the total number of bankruptcy occurred per grid cell. The result of this feature are no longer helpful in informing our model as all data points resulted in 0 or few occurencies of bankruptcy. This feature is not currently included in the model.


# total number of bankruptcy occurrences per grid cell

# tax2 <- tax %>%
#     mutate(Legend = "taxBankruptcy")

# tax_net2 <- st_join(tax2, fishnet, join = st_within) %>%
#     st_drop_geometry() %>%
#     group_by(uniqueID, Legend) %>%
#     summarize(bankrupt = sum(bankruptcy == 'true')) %>%
#     full_join(fishnet, by='uniqueID') %>%
#     spread(Legend, bankrupt, fill=0) %>%
#     st_sf() %>%
#     dplyr::select(- ``) %>%
#     na.omit() %>%
#     ungroup()

# tax_net2.long <- gather(tax_net2, Variable, value, -geometry, -uniqueID)

# ggplot() +
#     geom_sf(data = tax_net2.long, aes(fill=value), color=NA) +
#     scale_fill_viridis() +
#     labs(title = "Sum where Bankrupt is True by Fishnet", subtitle = "Philadelphia - last assessment : 2020") +
#     mapTheme()

The final feature for our tax data examines if the amount due is actionable or non-actionable. Actionable accounts are in payment agreement, bankruptcy or overdue but not yet delinquent.


# accounts considered to be "actionable" per grid cell
tax3 <- tax %>%
mutate(Legend = "taxActionable")

tax_net3 <- st_join(tax3, fishnet, join=st_within) %>%
st_drop_geometry() %>%
group_by(uniqueID, Legend) %>%
summarize(actionable = sum(is_actionable == 'true')) %>%
full_join(fishnet, by="uniqueID") %>%
spread(Legend, actionable, fill=0) %>%
st_sf() %>%
dplyr::select(- ``) %>%
na.omit() %>%
ungroup()

tax_net3.long <-
gather(tax_net3, Variable, value, -geometry, -uniqueID)

ggplot() +
geom_sf(data = tax_net3.long, aes(fill=value), color=NA) +
scale_fill_viridis() +
labs(title = "Sum where is actionable is True by Fishnet", subtitle = "Philadelphia - last assessment : 2020") +
mapTheme()

Hipster Features

Starbucks cafes are strongly associated with a 0.5% increase in housing prices according to a study by Harvard. An increase in amenities like grocery stores, cafes and bars leads to an increase in gentrification. (https://www.cnbc.com/2018/09/04/starbucks-openings-predcit-higher-home-prices-says-harvard-study.html)

Below we read in the locations of starbucks and calculate the total number.


# starbucks locations
#Starbucks data <- kaggle datasets download -d starbucks/store-locations

starbucks <-read.csv("../input/store-locations/directory.csv")
starbucks <- filter(starbucks, City=="Philadelphia") %>%
st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326, agr = "constant") %>%
st_transform('EPSG:3857') %>%
mutate(Legend = "Starbucks")

starbucks_net <- st_join(starbucks, fishnet, join = st_within) %>%
st_drop_geometry() %>%
group_by(uniqueID, Legend) %>%
summarize(count = n()) %>%
full_join(fishnet) %>%
spread(Legend, count, fill=0) %>%
st_sf() %>%
dplyr::select(- ``) %>%
na.omit() %>%
ungroup()

# ggplot() +
#     geom_sf(data = starbucks_net, aes(fill=Starbucks), color=NA) +
#     scale_fill_viridis() +
#     labs(title = "Starbucks by Fishnet", subtitle = "Philadelphia 2000-2016") +
#     mapTheme()

We replicate the same to grocery store data.


# Grocery Store Access data
grocerystore <- st_read("http://data-phl.opendata.arcgis.com/datasets/53b8a1c653a74c92b2de23a5d7bf04a0_0.geojson") %>%
st_transform('EPSG:3857') %>%
rename(GEOID=GEOID10)

grocerystore1 <- filter(grocerystore, HPSS_ACCESS == 'Moderate or High Access') %>%
st_as_sf(coords = c("Longitude", "Latitude"), crs=4326, agr = "constant") %>%
st_transform('EPSG:3857') %>%
mutate(Legend = "Grocery")

grocery_net <- st_join(grocerystore1, fishnet, join=st_within) %>%
st_drop_geometry() %>%
group_by(uniqueID, Legend) %>%
summarize(count = n()) %>% # we can also do vehicle availability
full_join(fishnet) %>%
spread(Legend, count, fill=0) %>%
st_sf() %>%
dplyr::select(- ``) %>%
na.omit() %>%
ungroup()

# ggplot() +
#     geom_sf(data = grocery_net, aes(fill=Grocery), color=NA) +
#     scale_color_manual(values = c("constructionpalette")) +
#     labs(title = 'High Access Groceries by Fishnet', subtitle = "Philadelphia 2018")+
#     mapTheme()

Another characteristic of a hipster neighborhood is prescence of nature and greenery. We read in tree canopy data.


# number of trees per grid cell

treeinventory <- st_read("../input/treecanopy/PPR_Tree_Canopy_Points_2015.geojson") %>%
st_transform('EPSG:3857') %>%
mutate(Legend = "Tree")

# Spatial join of points to fishnets
treeinventory_net <- st_join(treeinventory, fishnet, join=st_within) %>%
st_drop_geometry() %>%
group_by(uniqueID, Legend) %>%
summarize(count = n()) %>%
full_join(fishnet) %>%
spread(Legend, count, fill=0) %>%
st_sf() %>%
dplyr::select(- ``) %>%
na.omit() %>%
ungroup()

treeinventory_net$Tree = as.numeric(treeinventory_net$Tree)
treeinventory_net <- treeinventory_net %>%
mutate(Legend = "Tree")

# ggplot() +
#     geom_sf(data = treeinventory_net, aes(fill=Tree)) +
#     scale_alpha_manual(values="Tree")
#     labs(title = "Tree Inventory by Fishnet", subtitle = "Philadelphia 2018") +
#     mapTheme()

All Risk Factors by Fishnet

Perform some data wrangling and manipulation so that we can plot the fishnets created side by side.


# All Risk Factors by Fishnet
starbucks_short<-starbucks%>%
dplyr::select(geometry,Legend)

grocery_short<-grocerystore1%>%
dplyr::select(geometry,Legend)

vars_net <- 
rbind(starbucks_short, grocery_short) %>%
st_join(., fishnet, join=st_within) %>%
st_drop_geometry() %>%
group_by(uniqueID, Legend) %>%
summarize(count = n()) %>%
full_join(fishnet) %>%
spread(Legend, count, fill=0) %>%
st_sf() %>%
dplyr::select(-``) %>%
na.omit() %>%
ungroup

vars_net.long <- 
gather(vars_net, Variable, value, -geometry, -uniqueID)

tree_long <-
gather(treeinventory_net, Variable, value, -geometry, -uniqueID,-Legend)%>%
dplyr::select(-Legend)

allvars.long <-
rbind(vars_net.long,tax_net.long, tax_net3.long,evictions_net.long,evictions_net3.long,tree_long)

vars <- unique(allvars.long$Variable)
mapList <- list()

for(i in vars){
mapList[[i]] <- 
ggplot() +
geom_sf(data = filter(allvars.long, Variable == i), aes(fill=value), colour=NA) +
scale_fill_viridis(name="", option="cividis") +
labs(title=i) +
mapTheme()}

do.call(grid.arrange,c(mapList, ncol=3, top="All Risk Factors by Fishnet"))

We can see that Grocery, Starbucks and Tax data may not inform our model as stongly as the other features.

Below we define the nearest neighbor function to calculate the average distance between a newly filed construction permit and the target observation.


# Check correlation
evictions_net <- 
evictions_net %>%
st_join(., fishnet, join=st_within)

vars_net2<-
vars_net%>%
st_drop_geometry()%>%
full_join(evictions_net, vars_net, by="uniqueID")%>%
st_sf()%>%
dplyr::select(-Legend)

evictions_net3 <- 
evictions_net3 %>%
st_join(., fishnet, join=st_within)

vars_net2 <-
vars_net2%>%
st_drop_geometry()%>%
full_join(evictions_net3, vars_net2, by="uniqueID")%>%
st_sf()%>%
dplyr::select(-Legend)

vars_net2<-
vars_net2%>%
st_drop_geometry()%>%
full_join(tax_net,vars_net2,by="uniqueID")%>%
st_sf()

vars_net2<-
vars_net2%>%
st_drop_geometry()%>%
full_join(tax_net3,vars_net2,by="uniqueID")%>%
st_sf()

vars_net2 <-
vars_net2%>%
st_drop_geometry()%>%
full_join(treeinventory_net,vars_net2,by="uniqueID")%>%
st_sf()%>%
dplyr::select(-Legend)


nn_function <- function(measureFrom,measureTo,k) {
measureFrom_Matrix <-
as.matrix(measureFrom)
measureTo_Matrix <-
as.matrix(measureTo)
nn <-   
get.knnx(measureTo, measureFrom, k)$nn.dist
output <-
as.data.frame(nn) %>%
rownames_to_column(var = "thisPoint") %>%
gather(points, point_distance, V1:ncol(.)) %>%
arrange(as.numeric(thisPoint)) %>%
group_by(thisPoint) %>%
summarize(pointDistance = mean(point_distance)) %>%
arrange(as.numeric(thisPoint)) %>% 
dplyr::select(-thisPoint) %>%
pull()

return(output)  
}

st_c <- st_coordinates
st_coid <- st_centroid

#nn function requires both input layers to be points
vars_net2 <-
vars_net2 %>%
mutate(
Starbucks.nn = nn_function(st_c(st_coid(vars_net2)), st_c(starbucks),1),
Grocery.nn = nn_function(st_c(st_coid(vars_net2)), st_c(st_coid(grocerystore1)),1),
Evictions.nn = nn_function(st_c(st_coid(vars_net2)), st_c(st_coid(evictions.geom)),3),
#     Tax.nn= nn_function(st_c(st_coid(vars_net2)), st_c(tax),3),
Tree.nn= nn_function(st_c(st_coid(vars_net2)),st_c(treeinventory),3))

vars_net.long.nn <- 
dplyr::select(vars_net2, ends_with(".nn")) %>%
gather(Variable, value, -geometry)

vars2 <- unique(vars_net.long.nn$Variable)

mapList <- list()

for(i in vars2){
mapList[[i]] <- 
ggplot() +
geom_sf(data = filter(vars_net.long.nn, Variable == i), aes(fill=value), colour=NA) +
scale_fill_viridis(name="", option="cividis") +
labs(title=i) +
mapTheme()}

#NOTE: for tax.nn and eviction.nn fishnets now show counts instead of average evictions or total due
do.call(grid.arrange,c(mapList, ncol = 2, top = "Nearest Neighbor Risk Factors by Fishnet"))

Starbucks and grocery store locations are more favorable because there is more information per grid cell.

Our last predictor we include distance to city center. We can observe from our count of permits by fishnet that there is a large concentration of new construction permits around the city center.

It is likely that the value of one neighborhood increases the value of surrounding neighborhoods. As new construction permits are filed in the city center, it is likely an increasing amount of investment will occur in the surrounding neighborhoods.


# Distance to city center
cityhall <-
filter(neighborhoods, name == "CENTER_CITY") %>%
st_centroid()

vars_net2$CCDistance = st_distance(st_centroid(vars_net2), cityhall) %>%
as.numeric()

ggplot() +
geom_sf(data = vars_net2, aes(fill = CCDistance), color=NA) +
scale_fill_viridis(option="cividis") +
labs(title = "Distance to Center City", subtitle = "Figure 3") +
mapTheme()

Create the final net

We perform a final join so all of our features are in the same dataframe as our dependent variable.


# final join for all of our features

final_net <- left_join(permit_net, st_drop_geometry(vars_net2), by="uniqueID")

final_net <- st_centroid(final_net) %>%
st_join(dplyr::select(neighborhoods, name)) %>%
st_drop_geometry() %>%
left_join(dplyr::select(final_net, geometry, uniqueID)) %>%
st_sf() %>%
na.omit()

Exploring Spatial Processes

Local Moran's I

We use the final net created above to calculate a Statistic known as Local Moran's I. The null hypothesis is that the permit count at a given location is randomly distributed to its relative neighbors. A high local moran's I indicates spatial clustering.


#  creating queen weight neighnors for Moran's I
final_net.nb <- poly2nb(as_Spatial(final_net), queen=TRUE)
final_net.weights <- nb2listw(final_net.nb, style="W", zero.policy=TRUE)

#Running Local Moran's I
final_net.localMorans <- 
cbind(
as.data.frame(localmoran(final_net$countPermit, final_net.weights)),
as.data.frame(final_net)) %>% 
st_sf() %>%
dplyr::select(Permit_Count = countPermit, 
Local_Morans_I = Ii, 
P_Value = `Pr(z != E(Ii))`) %>%
mutate(Significant_Hotspots = ifelse(P_Value <= 0.005, 1, 0)) %>% #change pvalue?
gather(Variable, Value, -geometry)

vars <- unique(final_net.localMorans$Variable)
varList <- list()

for(i in vars){
varList[[i]] <- 
ggplot() +
geom_sf(data = filter(final_net.localMorans, Variable == i), 
aes(fill = Value), colour=NA) +
scale_fill_viridis(name="", option="cividis") +
labs(title=i) +
mapTheme() + theme(legend.position="bottom")}

do.call(grid.arrange,c(varList, ncol = 4, top = "Local Morans I statistics, Construction Permits"))

There is a significant clustering in the Center City and surrounding neighborhoods. This will help us during the generalizability of our model. A good model should be able to predict in the hotspots as well as in the coldspots.

We then calculate the distance to the identified hotspots below.


# significance to hotspots
final_net <- 
final_net %>%
mutate(Permit.isSig = ifelse(localmoran(final_net$countPermit, final_net.weights)[,5] <= 0.00001, 1, 0)) %>%
mutate(Permit.isSig.dist = nn_function(st_coordinates(st_centroid(final_net)), 
st_coordinates(st_centroid(filter(final_net, Permit.isSig == 1))), 1))

ggplot() +
geom_sf(data = final_net, aes(fill = Permit.isSig.dist), color=NA) +
scale_fill_viridis(name="NN Distance", option = "cividis") +
labs(title = "Distance to highly significant Permit hotspots", subtitle = "Figure 4") +
mapTheme()

Correlation Tests

We conduct a series of correlation tests to see if our risk factors are highly associated to the count of newly filed construction permits.


# see if our risk factors are highly associated to the count of newly filed construction permits

correlation.long <-
st_drop_geometry(final_net) %>%
dplyr::select(-uniqueID, -cvID, -CCDistance, -name) %>%
gather(Variable, Value, -countPermit)

correlation.cor <-
correlation.long %>%
group_by(Variable) %>%
summarize(correlation = cor(Value, countPermit, use = "complete.obs"))

ggplot(correlation.long, aes(Value, countPermit)) +
geom_point(size = 0.1) +
geom_text(data = correlation.cor, aes(label = paste("r =", round(correlation, 2))), x=-Inf, y=Inf, vjust = 1.5, hjust = -.1) +
geom_smooth(method = "lm", se = FALSE, color ="black") +
facet_wrap(~Variable, ncol = 4, scales = "free") +
labs(title = "Permit count as a function of risk factors", subtitle = "Figure 5") +
plotTheme()

Permit.isSig and Permit.isSig.dist are highly correlated which supports the statement that increasing values in one neighborhood affect its surrounding neighborhoods. Starbucks locations (Starbucks.nn) and number of evictions (Evictions.nn) are more strongly correlated than their point locations. Tree canopy is not highly correlated.

Building the Model

We build our model using the poisson regression. We run our regression twice, once with our highly correlated features and the next with the spatial process of our dependent variable.


# Poisson Regression because there are a lot of 0s in the distribution of DV
reg.vars <- c("Starbucks.nn", "Grocery.nn", "evictionsTotal", "AvgEvictionFiling", "taxActionable", "Tax", "CCDistance")

# Adding in the spatial component of the dependent variable
reg.ss.vars <- c("Starbucks.nn", "Grocery.nn", "evictionsTotal", "AvgEvictionFiling", "taxActionable", "Tax", "CCDistance", "Permit.isSig.dist")

Cross-validated Poisson Regression

We define our crossValidate function that will be used in both methods. We also split the data into a training and testing datasets.

The first method is known as k-fold cross-validation in which one random fold of data is left out of the training set each time unit all folds have been tested. The second method performed is known as leave-one-group-out, o LOGO, cross-validation in which one designated group of data is left out of the training set each time untill all groups have been tested on.

I expect LOGO CV to have a more accurate result as we indicate each group of data will have a neighborhood.


crossValidate <- function(dataset, id, dependentVariable, indVariables) {

allPredictions <- data.frame()
cvID_list <- unique(dataset[[id]])

for (i in cvID_list) {

thisFold <- i
cat("This hold out fold is", thisFold, "\n")

fold.train <- filter(dataset, dataset[[id]] != thisFold) %>% as.data.frame() %>% 
dplyr::select(id, geometry, indVariables, dependentVariable)
fold.test  <- filter(dataset, dataset[[id]] == thisFold) %>% as.data.frame() %>% 
dplyr::select(id, geometry, indVariables, dependentVariable)

regression <-
glm(countPermit ~ ., family = "poisson", 
data = fold.train %>% 
dplyr::select(-geometry, -id))

thisPrediction <- 
mutate(fold.test, Prediction = predict(regression, fold.test, type = "response"))

allPredictions <-
rbind(allPredictions, thisPrediction)

}
return(st_sf(allPredictions))
}

reg.cv <- crossValidate(
dataset = final_net,
id = "cvID",
dependentVariable = "countPermit",
indVariables = reg.vars) %>%
dplyr::select(cvID = cvID, countPermit, Prediction, geometry)

reg.ss.cv <- crossValidate(
dataset = final_net,
id = "cvID",
dependentVariable = "countPermit",
indVariables = reg.ss.vars) %>%
dplyr::select(cvID = cvID, countPermit, Prediction, geometry)

reg.spatialCV <- crossValidate(
dataset = final_net,
id = "name",
dependentVariable = "countPermit",
indVariables = reg.vars) %>%
dplyr::select(cvID = name, countPermit, Prediction, geometry)

reg.ss.spatialCV <- crossValidate(
dataset = final_net,
id = "name",
dependentVariable = "countPermit",
indVariables = reg.ss.vars) %>%
dplyr::select(cvID = name, countPermit, Prediction, geometry)

Analyzing Results across Space

Compare the resulting predictions to the observations of reality to calculate the number of errors.


reg.summary <-
rbind(
mutate(reg.cv, Error = Prediction - countPermit, Regression = "Random k-fold CV: Just Risk Factors"),
mutate(reg.ss.cv, Error = Prediction - countPermit, Regression ="Random k-fold CV: Spatial Process"),
mutate(reg.spatialCV, Error = Prediction - countPermit, Regression = "Spatial LOGO-CV: Just Risk Factors"),
mutate(reg.ss.spatialCV,  Error = Prediction - countPermit, Regression = "Spatial LOGO-CV Spatial Process")) %>%
st_sf()

We then plot the mean absolute errors from each cross-validation.


# plot mean absolute errors from each cross-validation

error_by_reg_and_fold <-
reg.summary %>%
group_by(Regression, cvID) %>%
summarize(Mean_Error = mean(Prediction - countPermit, na.rm = T), 
MAE = mean(abs(Mean_Error), na.rm=T),
SD_MAE = mean(abs(Mean_Error), na.rm = T)) %>%
ungroup ()

error_by_reg_and_fold %>%
ggplot(aes(MAE)) +
geom_histogram(bins = 30, color = "black", fill = '#de9f00') +
facet_wrap(~Regression) +
geom_vline(xintercept = 0) +
scale_x_continuous(breaks = seq(0, 8, by =1)) +
labs(title = "Distribution of MAE", subtitle = "k-fold cross validation vs. LOGO-CV", caption = "Figure 6", x="Mean Absolute Error", y="Count") +
theme(panel.background = element_rect(fill = "#090a07"))


st_drop_geometry(error_by_reg_and_fold) %>%
group_by(Regression) %>% 
summarize(Mean_MAE = round(mean(MAE), 2),
SD_MAE = round(sd(MAE), 2)) %>%
kable() %>%
kable_styling("striped", full_width = F) %>%
row_spec(1, color = "white", background = "#383120") %>%
row_spec(2, color = "black", background = "#de9f00") %>%
row_spec(3, color = "white", background = "#383120") %>%
row_spec(4, color = "black", background = "#de9f00")

Regression	Mean_MAE	SD_MAE
Random k-fold CV: Just Risk Factors	0.61	0.64
Random k-fold CV: Spatial Process	0.62	0.63
Spatial LOGO-CV Spatial Process	1.92	3.10
Spatial LOGO-CV: Just Risk Factors	1.90	3.06

The results indicate the regression model with spatial features perform better than the one with just risk factors. The k-fold regression performs better than the LOGO-CV regression.

To further analyze, we map the errors.


#Run with K fold
error_by_reg_and_fold %>%
filter(str_detect(Regression, "k-fold")) %>%
ggplot() +
geom_sf(aes(fill = MAE)) +
facet_wrap(~Regression) +
scale_fill_viridis(option="cividis") +
labs(title = "Permit errors by k-fold cv Regression",
subtitle = "Figure 7") +
mapTheme() + theme(legend.position="bottom")


#Run with LOGO-CV
error_by_reg_and_fold %>%
filter(str_detect(Regression, "LOGO")) %>%
ggplot() +
geom_sf(aes(fill = MAE)) +
facet_wrap(~Regression) +
scale_fill_viridis(option="cividis") +
labs(title = "Permit errors by LOGO-CV Regression",
subtitle = "Figure 8") +
mapTheme() + theme(legend.position="bottom")

Calculate a Local Moran's I statistic to examine how the model performs in hotspots vs coldspots.


neighborhood.weights <-
filter(error_by_reg_and_fold, Regression == "Spatial LOGO-CV Spatial Process") %>%
group_by(cvID) %>%
poly2nb(as_Spatial(.), queen=TRUE) %>%
nb2listw(., style="W", zero.policy=TRUE)

filter(error_by_reg_and_fold, str_detect(Regression, "LOGO"))  %>% 
st_drop_geometry() %>%
group_by(Regression) %>%
summarize(Morans_I = moran.mc(abs(Mean_Error), neighborhood.weights, 
nsim = 999, zero.policy = TRUE, 
na.action=na.omit)[[1]],
p_value = moran.mc(abs(Mean_Error), neighborhood.weights, 
nsim = 999, zero.policy = TRUE, 
na.action=na.omit)[[3]]) %>%
kbl(caption = "Moran's I and p-value of Spatial LOGO-CV") %>%
kable_styling() %>%
row_spec(1, color = "white", background = "#383120") %>%  
row_spec(2, color = "black", background = "#de9f00")

Moran's I and p-value of Spatial LOGO-CV
Regression	Morans_I	p_value
Spatial LOGO-CV Spatial Process	0.1239204	0.012
Spatial LOGO-CV: Just Risk Factors	0.1200030	0.011

The results indicate that there is more spatial variation when our spatial features are included. This suggests that our model may be over-compensating for hotspots. The next step to improve this would be by removing some of the nearest neighbor features.

We plot the predictions and observations below.


# plot predictions and observations

st_drop_geometry(reg.summary) %>%
group_by(Regression) %>%
mutate(Permit_Decile = ntile(countPermit, 10)) %>%
group_by(Regression, Permit_Decile) %>%
summarize(meanObserved = mean(countPermit, na.rm = T),
meanPrediction = mean(Prediction, na.rm = T)) %>%
gather(Variable, Value, -Regression, -Permit_Decile) %>%        
ggplot(aes(Permit_Decile, Value, shape = Variable)) +
geom_point(size = 2) + geom_path(aes(group = Permit_Decile), colour = "black") +
scale_shape_manual(values = c(2, 17)) +
facet_wrap(~Regression) + xlim(0,10) +
labs(title = "Predicted and observed Permit by observed Permit decile",
subtitle = "Figure 9")

The results actually show that our model tends to under-predict in neighborhoods of high concentrations of construction permits and over predict in neighborhoods of low concentration.

Analyzing Results Across Time

We create maps to analyze the accuracy and generalizability of our model across time. Below, a map of the 2019 permits


permit_ppp <- as.ppp(st_coordinates(permits), W = st_bbox(final_net))
permit_KD <- density.ppp(permit_ppp, 1000)

as.data.frame(permit_KD) %>%
st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
aggregate(., final_net, mean) %>%
ggplot() +
geom_sf(aes(fill=value)) +
geom_sf(data = sample_n(permits, 1500), size = .5) +
scale_fill_viridis(name = "Density", option="cividis") +
labs(title = "Kernel density of 2019 permits",
subtitle = "Figure 10") +
mapTheme()

Load the data for new construction permits in 2019


permits19 <-  st_read("https://phl.carto.com/api/v2/sql?q=SELECT+*+FROM+permits+WHERE+permitissuedate+>=+'2019-01-01'+AND+permitissuedate+<+'2020-01-01'AND+typeofwork+=+'NEW+CONSTRUCTION'&filename=permits&format=geojson&skipfields=cartodb_id")%>%
st_transform('EPSG:3857')%>%
distinct() %>%
.[fishnet,]

Repeat the above steps to calculate kernel density for 2019. We also categorize the density levels into 5 categories.


permit_ppp <- as.ppp(st_coordinates(permits), W = st_bbox(final_net))
permit_KD <- density.ppp(permit_ppp, 1000)

permit_KDE_sf <- as.data.frame(permit_KD) %>%
st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
aggregate(., final_net, mean) %>%
mutate(label = "Kernel Density",
Risk_Category = ntile(value, 100),
Risk_Category = case_when(
Risk_Category >= 90 ~ "90% to 100%",
Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
cbind(
aggregate(
dplyr::select(permits19) %>% mutate(permitCount = 1), ., sum) %>%
mutate(permitCount = replace_na(permitCount, 0))) %>%
dplyr::select(label, Risk_Category, permitCount)

We use the predictions from LOGO-CV with the spatial features because this is the model we expected to perform the best.


permit_risk_sf <-
filter(reg.summary, Regression == "Spatial LOGO-CV Spatial Process") %>%
mutate(label = "Risk Predictions",
Risk_Category = ntile(Prediction, 100),
Risk_Category = case_when(
Risk_Category >= 90 ~ "90% to 100%",
Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
cbind(
aggregate(
dplyr::select(permits19) %>% mutate(permitCount = 1), ., sum) %>%
mutate(permitCount = replace_na(permitCount, 0))) %>%
dplyr::select(label,Risk_Category, permitCount)


rbind(permit_KDE_sf, permit_risk_sf) %>%
na.omit() %>%
gather(Variable, Value, -label, -Risk_Category, -geometry) %>%
ggplot() +
geom_sf(aes(fill = Risk_Category), colour = NA) +
geom_sf(data = sample_n(permits19, 3000), size = .5, colour = "black") +
facet_wrap(~label, ) +
scale_fill_viridis(discrete = TRUE, option="cividis") +
labs(title="Comparison of Kernel Density and Risk Predictions",
subtitle="2018 permit risk predictions; 2019 permits",
caption = "Figure 11") +
mapTheme()

Figure 11 shows a localized heat map from the risk predictions rather than the kernel density map. Localization can not be understated as risk can vary form block to block and the resources needed for each follows the prediction. In this approach rather than the Kernel density map allocating resources on a specific block or household level, the risk predictions model is more appropriate.


rbind(permit_KDE_sf, permit_risk_sf) %>%
st_set_geometry(NULL) %>% na.omit() %>%
gather(Variable, Value, -label, -Risk_Category) %>%
group_by(label, Risk_Category) %>%
summarize(countPermits = sum(Value)) %>%
ungroup() %>%
group_by(label) %>%
mutate(Rate_of_test_set_permits = countPermits / sum(countPermits)) %>%
ggplot(aes(Risk_Category,Rate_of_test_set_permits)) +
geom_bar(aes(fill=label), position="dodge", stat="identity") +
scale_fill_manual ("legend", values = c("Kernel Density" = "#383120", "Risk Predictions" = "#de9f00")) +
labs(title = "Risk prediction vs. Kernel density, 2019 permits",
subtitle = "Figure 12") +
plotTheme() + theme(axis.text.x = element_text(angle = 45, vjust = 0.5))

The bar chart above displays the accuracy of our predictions. In the riskiest category (90% to 100%), kernel density predicts more accurately while in the second riskiest, the risk prediction model does better. However at the other categories, bars are basically even. Later on, I would like to see risk predictions do substantially better in order to prove its effectiveness over kernel density.

Gentrification Prediction in Philadelphia

John Paul Mbuthia

06/05/2022