Why I called the Inquirer’s award-winning piece plagiarism

Last month, I saw a tweet from an Inquirer reporter celebrating the company’s win of the Newhouse School’s Toner Prize for Excellence in Local Reporting. I clicked through and saw the pieces for which it won. In a moment of frustration, I tweeted.

Enough people have asked me what this was about that I thought I’d write up a summary.

Clustering Philadelphia’s Elections

Back in 2019, I noticed that all of my maps of Philadelphia’s elections looked the same. Whether a map of turnout or the candidate that residents voted for, certain sections of the city moved together in each election. I ran a clustering algorithm on results from the prior eight Democratic Primaries, set to identify four clusters that I called Philadelphia’s “Voting Blocs,” and noted that these clusters were largely structured by race and class. I renamed them here, and used them to understand election results here, here, here, here, here, here, here, here, here, here, here, here, and here. In 2020, I measured how those clusters changed over time.

In February 2023, the Inquirer published an article running a clustering algorithm on the prior eight Democratic Primaries, set to identify six clusters. They noted that the clusters were largely structured by race and class. They later measured how those clusters changed over time.

The Inquirer’s work was well done, and their write-up significantly better than mine. The interactivity was neat. But the analytic strategy was the same, on the same data with the same takeaways. I reached out to the Inquirer asking that they cite my work in the article. I expected a short sentence saying “Jonathan Tannen has performed a similar analysis” buried at the bottom. Instead, they just said “no”.

The reasons for not citing my work that I received include, with my annotations in brackets:

  • They wrote all of the code themselves and used open data. [This is irrelevant to citation.]
  • The person who primarily did the analysis was not aware of my work when they began it. [I cannot disprove this. But the other coauthor admitted they knew about my work. And at least one person they interviewed told them to compare to my work. And they certainly knew about it when they said no to my request for citation.]
  • I shouldn’t worry about being cited, because they would quote me in future articles. [No comment.]
  • The work is different enough that citation wasn’t necessary.

Is the work different?

That last point is the only that would matter, if it were true. I admit that “different enough” can be a fuzzy distinction, and journalists (or bloggers) don’t follow the same citation practices as academics. Unfortunately, their analysis in no way passes even the most lenient definition of “different”.

Both pieces…

  • used K-Means clustering
  • on recent Democratic primaries in Philadelphia
  • that was crosswalked to present boundaries
  • to identify clusters that were largely structured by race and class.

Here are the clusters they produced, and then mine.

You will notice, if you can see past color choice, that they are basically the same. The Inquirer divided my blue cluster into their yellow and dark green clusters, and divided my red cluster into their salmon and light green clusters. Maybe their pink cluster extends a few blocks north past my green one. But that’s it.

It’s hard to be sure of the differences under the hood, because unlike my blog the Inquirer has not published their code. But as far as I can tell, the only differences between the pieces are that (1) they ran it on 2013-2020 instead of 2012-2019 and (2) they set num_clusters=6 instead of num_clusters=4. Are those changes substantial? On their github page, the authors acknowledge “it’s worth noting that three-, four-, and five-cluster maps yielded similar results, and sussed out race and class boundaries in similar ways.”

In closing

I don’t know the inner workings of the Inquirer, so am hesitant to point fingers at the reporters of the piece. The authors did cite my work in the github branch that has five bookmarks. Of course, there’s a reason that I want my work to be cited in the article, and it’s the same reason that the Inquirer does not.

It’s brutal to watch the company celebrate an award for work that it plagiarized from mine. I hope that if the Inquirer continues to pursue “original” research, they will practice the bare minimum of citation requirements.

Do Open Wards increase turnout?

I’m a committeeperson in the 24th Ward, which serves West Philly’s Mantua, Parkside, and Powelton neighborhoods. In May 2022, we voted to become an Open Ward. What does that mean? There’s no single definition, and each Open Ward establishes its own bylaws. But at a high level, being “Open” means that Committeepeople vote on which candidates to endorse and on other Ward operations, rather than simply following the Ward Leader. In addition to being more democratic, we also hope that by empowering committeepeople in this way, we can increase participation and Ward capacity for Get Out the Vote and other efforts.

Many of these benefits are fuzzy, but an increase in voter participation should be measurable. So, do we have any evidence that this works? Do Open Wards increase turnout? A number of wards have been open for years (some even for decades). Today, I’m doing some statistical analyses to measure if that’s true.

First, what wards are “Open”? As I said, there’s no single definition. But for this analysis, I’m using the set of Wards that have bylaws allowing committeepeople to vote on endorsements. Those I’m considering Open are:

Ward Year Opened
1 2018
2 2018
5 60s or 70s
8 60s or 70s
9 60s or 70s
15 2022
18 2018
24 2022
27 60s or 70s
30 1998
39a 2022

(If you want to contest these definitions or years, my email is open.)

View code
library(sf)
library(dplyr)
library(tidyr)
library(ggplot2)
source("../../admin_scripts/theme_sixtysix.R")
source("../../admin_scripts/util.R")
asnum <- function(x) as.numeric(as.character(x))

divs <- st_read("../../data/gis/warddivs/202011/Political_Divisions.shp")ward_from_div <- function(warddiv){
  ward <- substr(warddiv, 1, 2)
  div <- substr(warddiv, 4,5) 
  ward <- case_when(
    ward == "39" & asnum(div) >= 25 ~ "39a",
    ward == "39" & asnum(div) < 25 ~ "39b",
    TRUE ~ ward
  )
  return(ward)
}

divs <- divs |>
  mutate(
    warddiv = pretty_div(DIVISION_N),
    ward = ward_from_div(warddiv)
  )

wards <- divs |> group_by(ward) |> summarise() |> st_make_valid()

open_wards <- tribble(
  ~ward, ~year_opened,
  "01", "2018",
  "02", "2018",
  "05", "60s or 70s",
  "08", "60s or 70s",
  "09", "60s or 70s",
  "15", "2022",
  "18", "2018",
  "24", "2022",
  "27", "60s or 70s",
  "30", "1998",
  "39a", "2022"
)

wards <- wards |> mutate(
  is_open = ward %in% open_wards$ward,
  is_open_pre_22 = ward %in% (
    open_wards |> filter(year_opened != "2022") |> with(ward)
  )
)

ggplot(wards) + 
  geom_sf(aes(fill = is_open)) +
  scale_fill_manual(
    values=c(`TRUE` = light_blue, `FALSE` = light_grey),
    guide=FALSE
  ) +
  geom_text(
    data = wards |> filter(is_open) |> 
      mutate(
        x = st_coordinates(st_centroid(geometry))[,"X"],
        y = st_coordinates(st_centroid(geometry))[,"Y"]
      ),
    aes(
      x = x,
      y = y,
      label = ward
    ), 
    size = 3
) +
  theme_map_sixtysix() +
  labs(
    title="Philadelphia's Open Wards"
  )

The first thing you’ll notice is that these wards cluster around Center City. As such, they tend to be wealthier and whiter than the city as a whole; the same demographics that have seen a generalized surge in turnout since 2016. That makes the causality here difficult to measure: does the openness of a ward increase turnout, or did these wards become open thanks to local political engagement, and turnout there would have increased regardless?

There’s another complication in answering the question: how should we define turnout? I don’t love the usual measure, votes cast divided by registered voters; that number will be alarmingly low because of inactive voters still on the rolls, and the denominator moves for all sorts of reasons, including slow cleaning of the voter rolls or registration drives (I’ve called this the Turnout Funnel). And wards might have very different populations with different propensities to vote, or even be eligible to vote, which we want to control for.

I’ll use my favorite metric: turnout in an off-year election as a percentage of Presidential turnout. Off-year turnout is more malleable to local GOTV efforts: there’s more need just to remind people when Election Day is, and people get less communication about the election from other channels. A division’s Presidential turnout gives a ceiling on plausible off-year turnout. In this case, I’ll measure turnout in November 2022 divided by turnout in November 2020.

View code
df_major <- readRDS("../../data/processed_data/df_major_20230116.Rds")

turnout <- df_major |> filter(is_topline_office, ward != "99") |>
  group_by(warddiv, year, election_type) |>
  summarise(votes=sum(votes))

turnout <- turnout |> left_join(
  divs |> as.data.frame() |> select(warddiv, ward),
  by="warddiv"
) |>
  left_join(
    wards |> as.data.frame() |> select(ward, is_open, is_open_pre_22),
    by="ward"
  )

div_turnout <- turnout |>
  pivot_wider(
    names_from=c("election_type", "year"),
    values_from = "votes"
  )

ward_turnout <- turnout |> 
  group_by(ward, year, election_type, is_open, is_open_pre_22) |>
  summarise(votes=sum(votes)) |>
  pivot_wider(names_from=c("election_type", "year"), values_from="votes")

ward_turnout |>
  mutate(row = case_when(is_open ~ ward, TRUE ~ "Closed Wards")) |>
  group_by(is_open) |>
  summarise(votes_20 = sum(general_2020), votes_22 = sum(general_2022, na.rm=T)) |>
  mutate(ratio = votes_22 / votes_20)
View code
ggplot(wards |> left_join(ward_turnout, by="ward", suffix=c("", "_w"))) + 
  geom_sf(aes(fill = general_2022 / general_2020)) +
  geom_sf(aes(color=is_open), fill = NA, lwd=1) +
  scale_color_manual(guide=FALSE, values=c(`TRUE` = "white", `FALSE` = alpha("white", 0))) +
  scale_fill_viridis_c() +
  theme_map_sixtysix() %+replace%
  theme(legend.position = "right") +
  labs(
    title="Turnout in 2022 vs 2020",
    subtitle="Open Wards are outlined.",
    fill = "Votes cast \n2022 / 2020"
  )

Overall, the Open Wards (including those that only became Open later) cast 148,000 votes for President in 2020, and then 118,000 for Governor in 2022, an 80% rate. The other wards cast 594,000 and 380,000, a 64% rate. That’s a 16 percentage point difference!

But it might not be causal. We need to control for the ways Open Wards are systematically difference from Closed ones. We could do this by adding a bunch of control variables to a regression, but I prefer a stronger approach: using spatial boundary discontinuities. We can look at neighboring divisions on opposite sides of a ward boundary, one in a closed ward and one in an Open Ward) and compare the differences. By using only neighboring divisions, we create a useful apples-to-apples comparison; we assume that any systematic differences between these divisions is the effect of Open Wards. (I did a similar analysis in 2019, measuring the power of ward endorsements.)

For this analysis, I’ll limit to wards that were only Open before 2022 (sadly excluding my own 24). I filter down so each division is only in a single pair, and control for the proportions Black and Hispanic in the divisions (the pairing of divisions across boundaries should mean these proportions are not that different, but this will control for any lingering differences that do exist).

View code
block_data <- readr::read_csv(
  "../../data/census/decennial_2020_poprace_phila_blocks/DECENNIALPL2020.P4_data_with_overlays_2021-09-22T083940.csv",
  skip=1
)
block_shp <- sf::st_read(
  "../../data/gis/census/tl_2020_42101_tabblock/tl_2020_42_tabblock20.shp"
)block_shp <- st_transform(block_shp, st_crs(wards))
block_cents <- block_shp |> st_centroid()

st_contains_df <- function(shp1, shp2, id1, id2){
  raw <- st_contains(shp1, shp2)
  res <- list()
  for(i in 1:length(id1)){
    res[[i]] <- data.frame(
      id1 = id1[i],
      idx2 = raw[[i]],
      id2 = id2[raw[[i]]]
    )
  }
  return(bind_rows(res))
}

blocks_to_wards <- st_contains_df(wards, block_cents, wards$ward, block_cents$GEOID20) |>
  rename(ward = id1, block_i = idx2, geoid = id2)

res <- list()
for(ward in unique(blocks_to_wards$ward)){
  blocks_w <- block_cents[
    blocks_to_wards |> filter(ward == !!ward) |> with(block_i),
  ]
  divs_w <- divs |> filter(ward == !!ward)
  blocks_to_divs <- st_contains_df(divs_w, blocks_w, divs_w$warddiv, blocks_w$GEOID20) |>
    select(-idx2) |>
    rename(warddiv = id1,  geoid = id2)
  res[[ward]] <- blocks_to_divs
}
blocks_to_divs <- bind_rows(res)

div_pops <- blocks_to_divs |>
  inner_join(block_data |> mutate(geoid = substr(id, 10,24)), by="geoid") |>
  group_by(warddiv) |>
  summarise(
    pop = sum(`!!Total:`),
    hisp = sum(`!!Total:!!Hispanic or Latino`),
    nhw =  sum(`!!Total:!!Not Hispanic or Latino:!!Population of one race:!!White alone`),                                                                            
	nhb = sum(`!!Total:!!Not Hispanic or Latino:!!Population of one race:!!Black or African American alone`),  
    nha = sum(`!!Total:!!Not Hispanic or Latino:!!Population of one race:!!Asian alone`)
  )
neighboring_wards <- st_touches(wards)
neighboring_wards <- neighboring_wards |>
  lapply(as.data.frame) |>
  bind_rows(.id = "id0") |>
  mutate(id0 = asnum(id0)) |>
  rename(id1 = `X[[i]]`) |>
  filter(id0 < id1) |>
  mutate(ward0 = wards$ward[id0], ward1 = wards$ward[id1])

neighboring_divs <- vector(mode="list", length=nrow(neighboring_wards))
for(row in 1:nrow(neighboring_wards)){
  ward0 <- neighboring_wards$ward0[row]
  ward1 <- neighboring_wards$ward1[row]
  
  divs0 <- divs |> filter(ward == ward0)
  divs1 <- divs |> filter(ward == ward1)
  touches <- st_touches(divs0, divs1) |>
    lapply(as.data.frame) |>
    bind_rows(.id = "id0") |>
    mutate(id0 = asnum(id0)) |>
    rename(id1 = `X[[i]]`) |>
    mutate(div0 = divs0$warddiv[id0], div1 = divs1$warddiv[id1])
  neighboring_divs[[row]] <- touches |> select(div0, div1)
}
neighboring_divs <- bind_rows(neighboring_divs)

neighboring_divs <- neighboring_divs |>
  left_join(
    div_turnout, by=c("div0"="warddiv")
  ) |>
  left_join(
    div_turnout, by=c("div1"="warddiv"),
    suffix=c("_0", "_1")
  ) |>
  left_join(
    div_pops, by=c("div0" = "warddiv")
  )|>
  left_join(
    div_pops, by=c("div1" = "warddiv"),
    suffix=c("_0", "_1")
  )

neighboring_divs <- neighboring_divs |>
  mutate(
    prop_22_over_20_0 = general_2022_0 / general_2020_0,
    prop_22_over_20_1 = general_2022_1 / general_2020_1,
    signed_is_open = is_open_1 - is_open_0,
    signed_is_open_pre_22 = is_open_pre_22_1 - is_open_pre_22_0
  )

# Grab arbitrary unique pairs
set.seed(215)
neighboring_divs_lm <- neighboring_divs |>
  group_by(div0) |>
  filter(sample.int(n()) == 1) |>
  group_by(div1) |>
  filter(sample.int(n()) == 1)

form <- as.formula(
  "prop_22_over_20_1 - prop_22_over_20_0 ~ -1 + signed_is_open_pre_22 + I(hisp_1 / pop_1 - hisp_0 / pop_0) + I(nhb_1 / pop_1 - nhb_0 / pop_0)"
)
fit <- lm(form, data = neighboring_divs_lm) 
summary(fit)
## 
## Call:
## lm(formula = form, data = neighboring_divs_lm)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.31756 -0.05978 -0.00070  0.04634  0.32425 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## signed_is_open_pre_22           0.01738    0.01503   1.156    0.248    
## I(hisp_1/pop_1 - hisp_0/pop_0) -0.26663    0.06638  -4.017 7.38e-05 ***
## I(nhb_1/pop_1 - nhb_0/pop_0)   -0.24061    0.03274  -7.349 1.72e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09265 on 316 degrees of freedom
## Multiple R-squared:  0.1924, Adjusted R-squared:  0.1847 
## F-statistic:  25.1 on 3 and 316 DF,  p-value: 1.37e-14

The result: divisions in Open Wards had 1.7pp stronger turnout than divisions just across the boundary, but this result has a p-value of 0.25 and is not statistically significant (meaning we don’t have enough data to be confident it’s different from zero). This provides a confidence interval of (-1pp, 5pp).

A few notes about this analysis:

  • The result was much bigger (about 5pp) when I didn’t control for race! This means that even when comparing only division across boundaries, there are racial differences between the divisions in Open Wards and those not.
  • I’m only using one election comparison (2022 vs 2020), and we obviously have a lot more. I’ll think about ways to compellingly use more elections without introducing complications. Quick checks using 2018 vs 2016 did not substantively change the results.

So, what does this mean? Our best estimate for Open Wards’ change in turnout is a small, positive effect, but we don’t have enough data to be confident this is different from zero. Open Wards of course have other claims to benefits than just turnout: more democratic decision-making, for example. And a 2 percentage point increase in Philadelphia’s turnout, if it existed, wouldn’t be nothing: if scaled across the city, it would have meant 4,000 votes in 2019’s Superior Court election, which was decided by 17,000.