20

Bayesian state space modelling of the Australian 2019 election

 5 years ago
source link: https://www.tuicool.com/articles/hit/uaa6Zvf
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

So I’ve been back in Australia for five months now. While things have been very busy in my new role at Nous Group , it’s not so busy that I’ve failed to notice there’s a Federal election due some time by November this year. I’m keen to apply some of the techniques I used in New Zealand in the richer data landscape (more polls, for one) and complex environment of Australian voting systems.

Polling data

The Australian Electoral Commission has wonderful, highly detailed data on actual results, which I’ll doubtless be coming to at some point. However, I thought for today I’d start with the currency and perpetual conversation-making power (at least in the media) of polling data.

There’s no convenient analysis-ready collection of Australian polling data that I’m aware of. I used similar methods to what’s behind my nzelect package to grab these survey results from Wikipedia where it is compiled by some anonymous benefactor, from the time of the 2007 election campaign until today.

Thanks are owed to Emily Kothe who did a bunch of this scraping herself for 2010 and onwards and put the results on GitHub (and on the way motivated me to develop a date parser for the horror that is Wikipedia’s dates), but in the end I started from scratch so I had all my own code convenient for doing updates, as I’m sure I’ll be wanting.

All the code behind this post is in its own GitHub repository. It covers grabbing the data, fitting the model I’ll be talking about soon, and the graphics for this post. That repo is likely to grow as I do more things with Australian voting behaviopur data.

Here’s how that polling data looks when you put it together:

JR7B3qn.png!web

Notes on the abbreviations of Australian political parties in that chart:

  • ONP ~ “Pauline Hanson’s One Nation” - nationalist, socially conservative, right-wing populism
  • Oth ~ Other parties
  • Grn ~ “Australian Greens” ~ left wing, environment and social justice focus
  • ALP ~ “Australian Labor Party” ~ historically the party of the working person, now the general party of the centre left
  • Lib/Nat ~ “Liberal Party” or “National Party” ~ centre and further right wing, long history of governing in coalition (and often conflated in opinion polling, hence the aggregation into one in this chart)

I’m a huge believer in looking at polling data in the longer term, not just focusing on the current term of government and certainly not just today’s survey release. The chart above certainly tells some of the story of the last decade or so; even a casual observer of Australian politics will recognise some of the key events, and particularly the comings and goings of Prime Ministers, in this chart.

Prior to 2007 there’s polling data available in Simon Jackman’s pscl package which has functionality and data relating to political science, but it only covers the first preference of voters so I haven’t incorporated it into my cleaned up data. I need both the first preference and the estimated two-party-preference of voters.

(Note to non-Australian readers - Australia has a Westminster-based political system, with government recommended to the Governor General by whomever has the confidence of the lower house, the House of Representatives; which is electorate based with a single-transferrable-vote aka “Australian vote” system. And if the USA could just adopt something as sensible as some kind of preferential voting system, half my Twitter feed would probably go quiet).

Two-party-preferred vote

For my introduction today to analysis with this polling data, I decided to focus on the minimal simple variable for which a forecast could be credibly seen as a forecast of the outcome on election day, whenever it is. I chose the two-party-preferred voting intention for the Australian Labor Party or ALP. We can see that this is pretty closely related to how many seats they win in Parliament:

rYF3q2R.png!web

The vertical and horizontal blue lines mark 50% of the vote and of the seats respectively.

US-style gerrymanders generally don’t occur in Australia any more, because of the existence of an independent electoral commission that draws the boundaries. So winning on the two-party-preferred national vote generally means gaining a majority in the House of Representatives.

Of course there are no guarantees; and with a electoral preference that is generally balanced between the two main parties even a few accidents of voter concentration in the key electorates can make a difference. This possibility is enhanced in recent years with a few more seats captured by smaller parties and independents:

E3yUF3f.png!web

All very interesting context.

State space modelling

My preferred model of the two I used for the last New Zealand election was a Bayesian state space model. These are a standard tool in political science now, and I’ve written about them in both theAustralian andNew Zealand context.

To my knowledge, the seminal paper on state space modelling of voting intention based on an observed variety of polling data is Jackman’s “Pooling the Polls Over an Election Campaign” . I may be wrong; happy to be corrected. I’ve made a couple of blog posts out of replicating some of Jackman’s work with first preference intention for the ALP in the 2007 election. In fact, this was one of my self-imposed homework tasks in learning to use Stan , the wonderfully expressive statistcal modelling and high-performance statistical computation tool and probability programming language.

My state space model of the New Zealand electorate was considerably more complex than I need today, because in New Zealand I needed to model (under proportional representation) the voting intention for multiple parties at once. Whereas today I can focus on just two-party-preferred vote for either of the main blocs. Obviously a better model is possible, but not today!

The essence of this modelling approach is that we theorise the existence of an unobserved latent voting intention, which is measured imperfectly and irregularly by opinion poll surveys. These surveys have sampling error and other sources of “total survey error”, including “house effects” or statistical tendencies to over- or under-estimate vote in particular ways. Every few years, the true voting intention manifests itself in an actual election.

Using modern computational estimation methods we can estimate the daily latent voting intention of the public based on our imperfect observations, and also model the process of change in that voting intention over time and get a sense of the plausibility of different outcomes in the future. Here’s what it looks like for the 2019 election:

aQ7neaz.png!web

This all seems plausible and I’m pretty happy with the way the model works. The model specification written in Stan and the data management in R are both available on GitHub.

An important use for a statistical model in my opinion is to reinforce how uncertain we should be about the world. I like the representation above because it makes clear, in the final few months of modelled voting intention out to October or November 2019, how much change is plausible and consistent with past behaviour. So anyone who feels certain of the election outcome should have a look at the widening cone of uncertainty on this chart and have another think.

A particularly useful side effect of this type of model is statistical estimates of the over- or under-estimation of different survey types or sources. Because I’ve confronted the data with four successive elections we can get a real sense of what is going on here. This is nicely shown in this chart:

jMzuqae.png!web

We see the tendency of Roy Morgan polls to overestimate the ALP vote by one or two percentage points, and of YouGov to underestimate it. These are interesting and important findings (not new to this blog post though). Simple aggregations of polls can’t incorporate feedback from election results in this way (although of course experienced people routinely make more ad hoc adjustments).

A more sophisticated model would factor in change over time in polling firms methods and results, but again that would take me well beyond the scope of this blog post.

Looking forward to some more analysis of election issues, including of other data sources and of other aspects, over the next few months.

Here’s a list of the contributors to R that made today’s analysis possible:

thankr::shoulders() %>% knitr::kable() %>% clipr::write_clip()
maintainer no_packages packages Hadley Wickham [email protected] 16 assertthat, dplyr, ellipsis, forcats, ggplot2, gtable, haven, httr, lazyeval, modelr, plyr, rvest, scales, stringr, tidyr, tidyverse R Core Team [email protected] 12 base, compiler, datasets, graphics, grDevices, grid, methods, parallel, stats, stats4, tools, utils Gábor Csárdi [email protected] 6 callr, cli, crayon, pkgconfig, processx, ps Winston Chang [email protected] 4 extrafont, extrafontdb, R6, Rttf2pt1 Yihui Xie [email protected] 4 evaluate, knitr, rmarkdown, xfun Kirill Müller <[email protected]> 4 DBI, hms, pillar, tibble Dirk Eddelbuettel [email protected] 3 digest, inline, Rcpp Lionel Henry [email protected] 3 purrr, rlang, tidyselect Jeroen Ooms [email protected] 2 curl, jsonlite Jim Hester [email protected] 2 pkgbuild, readr Ben Goodrich [email protected] 2 rstan, StanHeaders Jim Hester [email protected] 2 glue, withr Vitalie Spinu [email protected] 1 lubridate Deepayan Sarkar [email protected] 1 lattice Gabor Csardi [email protected] 1 prettyunits Patrick O. Perry [email protected] 1 utf8 Jennifer Bryan [email protected] 1 cellranger Michel Lang [email protected] 1 backports Simon Jackman [email protected] 1 pscl Jennifer Bryan [email protected] 1 readxl Kevin Ushey [email protected] 1 rstudioapi Justin Talbot [email protected] 1 labeling Simon Potter [email protected] 1 selectr Jonah Gabry [email protected] 1 loo Charlotte Wickham [email protected] 1 munsell Alex Hayes [email protected] 1 broom Joe Cheng [email protected] 1 htmltools Baptiste Auguie [email protected] 1 gridExtra Luke Tierney [email protected] 1 codetools Henrik Bengtsson [email protected] 1 matrixStats Peter Ellis [email protected] 1 frs Simon Garnier [email protected] 1 viridisLite Brodie Gaslam [email protected] 1 fansi Brian Ripley [email protected] 1 MASS R-core [email protected] 1 nlme Stefan Milton Bache [email protected] 1 magrittr Marek Gagolewski [email protected] 1 stringi James Hester [email protected] 1 xml2 Max Kuhn [email protected] 1 generics Simon Urbanek [email protected] 1 Cairo Jeremy Stephens [email protected] 1 yaml Achim Zeileis [email protected] 1 colorspace

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK