Before November's election, I published a prediction of the Pennsylvania General Assembly Lower House: Democrats would win on average 101.5 seats (half of the 203), and were slight favorites (53%) to win the majority. Then I found a bug, and published an updated prediction: Democrats would win 96 seats on average (95% confidence interval from 88 to 104), and had only a 13% chance of taking the house. This prediction, while still giving Republicans the majority, represented an average 14 seat pickup for Democrats from 2016.
That prediction ended up being correct: Democrats currently have the lead in 93 seats, right in the meat of the distribution.
But as I dug into the predictions, seat-by-seat, it looked like there were a number of seats that I got wildly wrong. And I've finally decided that the model had another bug; probably one that I introduced in the fix to the first one. In this post I'll outline what it was, and what I think the high-level modelling lessons are for next time.
Where the model did well, and where it didn't
The mistake ended up being in the exact opposite of the fix I implemented: candidates in races with no incumbents.
The State House has 203 seats. This year, there were 23 uncontested Republicans and 55 uncontested Democrats. I got all of them right 😎. The party imbalance in uncontested seats was actually unprecedented in at least the last two decades, they're usually about even. Among the 125 contested races, 21 had a Democratic incumbent, 76 a Republican incumbent, and 28 no incumbent. I was a little bit worried this new imbalance meant that the process of choosing which seats to contest had changed, and that past results in contested races would be different from this year. Perhaps Democrats were contesting harder seats, and would win a lower percentage of them than in the past. That didn't prove true.
The aggregate relationship between my predictions and the results look good. I got the number of incumbents that would lose, both Democrat and Republican right. In races without incumbents, I predicted an even split of 14 Democrats and 14 Republicans, and the final result of 17 R - 11 D might seem... ok? Within the range of error, as we saw in the histogram above. But the scatterplot tells a different story.
Above is a scatterplot of my predicted outcome (the X axis) versus the actual outcome (the Y axis). Perfect predictions would be along the 45 degree line. Points are colored by the party of the incumbent (if there was one). The blue dots and the red dots look fine; my model actually expected any given point to be centered around this line, plus/minus 20 points, so that distribution was exactly expected. But black dots look horribly wrong. Among those 28 races without incumbents, I predicted all but three to be close contests, and missed many of them wildly. (That top black dot, for example, was Philadelphia's 181, where I predicted Malcolm Kenyatta (D) slightly losing to Milton Street (R). That was wrong, to put it mildly. Kenyatta won 95% of the vote.)
What happened? My new model specification, done in haste to fix the first bug, imposed a faulty logic. It forced the past Presidential races to carry the same information about races without incumbents as races with, even though races with incumbents had other information. I should have allowed the model to fall back on district partisanship when it didn't have incumbent results, but the equivalent of a modelling typo didn't allow that. Instead, all of these predictions ended up at a bland, basically even race, because the model couldn't use the right information to differentiate them. My overall house prediction ended up being good only because a few 28 of the total 203 districts were affected, and getting three too many wrong didn't make the topline look too bad. But it was a bug.
I'm new to this world of publishing predictive models based on limited datasets and with severe time constraints (I can't spend the months on a model that I would in grad school or at work). What are the lessons of how to build useful models under these constraints?
Lesson 1: Go through every single prediction. I never looked at District 181. If I had seen that prediction, I would have realized something was terribly wrong. Instead, I looked at the aggregate predictions (similar to the table, and things looked okay enough). Next time, I'll force myself to go through every single prediction (or a large enough sample of predictions if there are too many). When I tried to hand-pick sanity checks based on my gut, I happened to not choose "a race with no incumbents, but which had an incumbent for decades, and which voted for Clinton at over 85%".
Lesson 2: Prefer clarity of the model's calculations over flexibility. I fell into the trap of trying to specify the full model in a single linear form. Through generous use of interactions, I thought I would allow the model flexibility for it to identify different relationships between historic presidential races and length of incumbency. This would have been correct, if I had implemented it bug-free. But I happened to leave out an important three-way interaction. If I had fit separate models for different classes of races--perhaps allowing the estimates to be correlated across models--I would have immediately noticed the differences.
Lesson 2b: I actually learned this extension to Rule 2 in the process of fitting, but the post-hoc assessment bangs it home: when you have good information, the model can be quite simple. In this case, the final predictions did well even with the bug because the aggregate result was really pretty easy to predict given three valuable pieces of information: (a) incumbency, (b) past state house results, and (c) FiveThirtyEight's Congressional predictions. The last was vital: it's a high-quality signal about the overall sentiment of this year's election, which is the biggest factor after each district's partisanship. A model that used only these three data points and then correctly estimated districts' correlated errors around those trends would have gotten this election spot on.
Predictions will be back
For better or worse, I've convinced myself that this project is actually possible to do well, and I'm going to take another stab at it in upcoming elections. First up is May's Court of Common Pleas election. These judges are easy to predict: nobody knows anything about the candidates, so you can nail it with just structural factors. More on that later!
On November 6, 2018, we (i.e. all of you readers out there) collectively generated 1,341 turnout data points to the Sixty-Six Wards Live Turnout Tracker. That made my modelling effort pretty easy: we missed the true turnout of 548,000 by only 7,000 votes! (This is almost too good… More thoughts on that in a later post.)
One novel feature of this dataset is that we can look at when through the day Philadelphians voted. I did this back in the Primary, and my surprising finding was that turnout was relatively flat. While there were before- and after-work surges, they weren’t nearly as pronounced as I had expected.
That was a midterm primary. How about this time, in the record-setting general? Two weeks ago we saw an unprecedented, energized electorate, and it looks like their time-of-day voting pattern was different too.
Estimating the overall pattern requires some statistics. Each data submission contains three pieces of information: precinct, time of day, and the cumulative votes at that precinct. Here, I fit something different than my live election-day model, since I have the true final results. I assume that between hours h-1 and h, a fraction f(h) of the city’s population votes, and each precinct as a fraction that randomly deviates from that distribution.
For division d, the fraction of the final votes that have been achieved by hour h is given by
v_d(0) = 0
v_d(h) = v_d(h-1) + f(h) + f_d(h)
where f(h) is the city-wide precinct average fraction (I use an improper, flat prior), and the deviations are drawn from a normal distribution with an unknown standard deviation sigma_dev.
Because observations don’t occur only on the hour, I use linear interpolation in between, so a data submission x_i, which occurs in division d at time h + t with t between 0 and 1, is modeled as
x_i ~ student_t(1, mean = (1-t) v_d(h) + t v_d(h+1), sd = sd_error),
with sd_error unknown.
The result is that I estimate the fraction of the votes cast in a given hour, f(h), around which divisions randomly deviate. Notice that this isn’t exactly the citywide fraction of votes if those deviations from the mean are correlated with overall turnout (if, for example, high-turnout wards also are more likely to vote before 9am), but the coverage of submissions is heavily skewed towards high-turnout wards, so you should read these results as mainly pertaining to high-turnout wards anyway (see below for more). I also don't account for the fact that needing to add up to 100% of the vote induces correlations in the estimates but... um... this will be fine.
When Philadelphia Votes
The results: In this election, we saw strong before- and after-work volume, with 26.9% of votes coming before 9am [95% CI: (25.8, 28.1)] and 27.4% of votes coming between 4 and 7pm (24.7 , 30.1).
The overall excitement of the election appears to have largely manifested in the morning, when reports of lines (in a midterm!) came from across the city. There is another surge from 4-7, as people leave work, but once again voting from 7-8 was quiet (without the thunderstorm, this time).
Do neighborhoods vote differently?
It seems likely that residents of different neighborhoods would vote at different times of the day. Unfortunately, the (albeit humbling) participation in the Turnout Tracker came disproportionately from Whiter, wealthier wards, and I don’t have great data to identify differences between groups. But here’s a tentative stab at it.
I’ve manually divided wards into six groups based on their overall turnout rates, their correlated voting patterns across the last 16 elections, and racial demographics. (Any categorization is sure to upset people, so I’m preemptively sorry. But these wards appear to vote similarly, and I've chosen clear names over euphemistic ones.)
Below is a plot of the raw data submissions, divided by the eventual final votes, by ward group. The difference between groups is not especially pronounced, but it appears that Center City and Ring did vote disproportionately before 9am (along with Manayunk, which behaves like them). [Notice that in the raw submissions, some of the fractions are over 1, meaning that a submitted voter number was higher than the final vote count.]
To estimate the fraction of the vote at exactly 9am, I filter all of the data to between 8 and 10am, drop the obvious outliers, and fit a simple linear model of fraction on time for each group. The estimated fractions at 9am are below.
Nearly 28% of the Center City and Ring Wards' votes came before 9am, which was statistically significantly higher than all other groups except for the Universities (which had large uncertainty). The other wards hovered around 24% by 9am, except for the very low-turnout North Philly and Lower Northeast wards which only had 17% of their final turnout by that hour.
Energy means early voting?
It certainly appears that in this highly energetic election, a disproportionate amount of votes came in the morning. This could mean that the population that gets energized is more likely to vote in the morning (perhaps workers, as opposed to retirees), or that in excited elections, voters want to ensure their vote in the morning.
In May 2019, the Live Election Tracker will be back. And at that point, we'll have three whole datasets to be able to evaluate time of day data! Then we'll really be able to say something. Before then, I've got a number of posts in the pipeline. Next up: evaluating my various models, some better than others. Stay tuned!
The vote count on Tuesday morning had me worried. When the Turnout Tracker crawled above 198,000 voters before 10am--half of 2014 turnout through just three hours of voting--I scrambled to my computer to double, triple check the calculations. But my code wasn't off, it was the voters that had completely changed. By 8pm that night, over 547,000 Philadelphians had voted, a 43% increase over four years ago.
I'll dig into the tracker in another post, but let's do a quick hit of the actual final turnout and some maps.
* Quick Note: In everything below, I use the votes cast for President/Governor, rather than actual turnout. This will be short by however many people leave that race blank, which appears to be < 1% of voters in past years. These results are also preliminary, representing only 99.5% of precincts.
Some 537,231 Philadelphians cast votes for Governor. This is a 157 thousand increase over the 379,046 of 2014. The absolute size of this increase is unprecedented since at least 2002, and would have been the largest proportional increase in that span if not for the *82%* increase to elect Krasner in 2017. As the plot makes jarringly clear, in the aftermath of 2016, something is different.
And Philadelphia somewhat outpaced the rest of the state, where votes grew by 41%. So not only did turnout surge, but Philadelphia eked out some more statewide clout, too.
Where the votes came from
The turnout boom was not evenly distributed. As we saw in 2017, it was driven by the wealthier, predominantly white wards, and especially those that have gentrified over the last twenty years.
Immediately in the morning, the Tracker made one thing clear: Ward 27 saw the biggest turnout growth. It ruined the color scale on the map.
The 27th, in University City, saw an increase in votes of 135%. That manages to dwarf even the 95% growth of Wards 31 and 18 (in Kensington/Fishtown), in second and third.
This is worth digging into more. How did a ward possibly see this much growth? It's all Penn. The division right along the river, containing most undergrad dorms, went from 116 votes in 2014 to 585 in 2018. That manages to make the second and third place divisions look a weaker green, even though they quadrupled(!!) their votes, from 88 to 363 and from 82 to 322, respectively.
This is mostly a problem of denominators, though. Turnout at Penn was tiny in 2014. It's not as if Ward 27 all of a sudden has the most votes in the city. Instead, the student-heavy ward just this year performed like the average center-city-ringing neighborhood.
As evidence, consider instead 2018 votes as a fraction of 2016. I like this comparison, because the Presidential election of 2016 probably represents the highest plausible attainable turnout for a midterm; you're just never going to get someone who doesn't vote for President to vote for Governor.
)By this comparison, Wards 9 and 22 (Chestnut Hill and Mount Airy) look great, but typically so, with 88% and 87% of the 2016 voters coming out again in 2018. In third place was Ward 31 in Kensington, and Ward 18 below it was sixth, with 87% and 85% of 2016 turnout, respectively. Notice that these two also saw the second and third most *growth* since 2014, and have quickly ascended the ranks of top voting wards in the city.
Meanwhile, many of the Black wards that always come out in midterms continued to do so, including in Overbrook and Wynnfield in West Philly, and in Cedar Brook and West Oak Lane in the Northwest. These wards didn't see their turnout grow a ton, mostly because they always, always vote (at least relative to the rest of the city).
Many of the divisions in Ward 31 to the North saw triple the 2014 turnout, whereas divisions in Ward 18 below it merely doubled.
[Edited: My map was incorrectly handling new division 18-18, so I’ve removed it until I have a fix].
The growth in these districts aren't always what a candidate will care about. Growth in a division doesn't matter all that much if it started at a very low point, or if nobody lives there. For candidates, what might be most useful is just density of votes, which will be a function of population density and turnout. In this metric, Fairmount and West Center City glow, along with Ward 46 in University City. These are all divisions with high turnout and a ton of people.
Finally, here's turnout as a function of the population over 18. This has some benefits over the typical reported turnout of voters divided by registered voters because it (a) doesn't rely on efficiently taking people off the books, which is notoriously slow in Philadelphia, and (b) bakes into the calculation any systematic differences in getting registered to vote, which I claim should be considered part of the voting process. It's main (large) downside is that it includes in the denominator non-citizens or other residents not eligible to register, which will make immigrant communities look particularly bad.
Grad Hospital, Fairmount, and Chestnut Hill shine by this metric, while North Philly, the lower Northeast, and even Penn despite all its growth have low percentages.
Sadly, one demographic has been under-represented in every single map in today's post: the Hispanic communities in North Philly. For example, Ward 7 is the darkest ward in the map of 2018 vs 2016, meaning that despite its 63% vote growth over 2014 (11th best growth in the city!), it still had the worst turnout relative to 2016: only 52% of the votes from 2016 came out for 2018. This community already has the lowest turnout in Presidential races, but it shrinks even farther in non-Presidential elections.
Two straight elections of something different
Philadelphia's turnout since 2016 has been astounding. Across the city there were 43% more votes than four years ago, and every single Ward's turnout grew.
While voters always come out in Center City, Chestnut Hill, Overbrook, and Cedar Brook, we saw unprecedented booms in Fishtown, University City, the rest of the neighborhoods ringing Center City.
These changes have now stuck around for two straight elections--2017 and 2018--and could presage a fundamental change in our city's political calculus for years to come.
Forecast: Who will win the PA House?
The race for the Pennsylvania Senate
The race for the Pennsylvania House
Evaluating the Live Election Tracker
So you wanna be a Committeeperson