Statistical Thinking in Sports Analytics
A brief overview of my recent experience teaching a sports analytics methods course
I am clearly very bad at managing a Substack newsletter, since the last post is from January about the Big Data Bowl! Of course, I have the excuse that I was very busy teaching during the spring semester - which made it difficult for me to have free time to write about sports analytics research papers (which I am still intending to do similar to my first post).
One of the reasons this spring semester was exceptionally busy, was because I developed a brand new sports analytics methods course for our masters students and junior/senior undergraduates (that have taken our core regression course). In essence, the course was really an introduction to multilevel modeling and applied Bayesian modeling through various problems in sports analytics. The first half of the semester walked through the basics of multilevel models (leaning heavily on the fantastic text Beyond Multiple Linear Regression) before the second half turned into a full applied Bayesian modeling course (leaning heavily on the also fantastic text Bayes Rules!).
I told students at the beginning of this course that the goals of the course were for the students to:
Become familiar with fundamental topics in sports analytics and the
relevant statistical methods for tackling problems in this growing area
Build and interpret statistical models and quantify uncertainty
Be able to recognize a problem and develop an appropriate approach to modeling it, which includes formally writing down models (and all of the various assumptions) and knowing how to implement them
Develop a sports analytics project for personal portfolio
It was fundamentally a model-building course around the context of sports analytics problems, and I had a blast teaching it! This included making extensive demos (that the students loved) and some pretty fun homework assignments (that the students may not have loved). Students also completed a few critiques of public sports analytics work they found in the wild, and worked on a group project. I was pretty impressed with the projects the students developed (which you can check out here) in a relatively short period of time (effectively a month). The projects had the following requirements:
Sports motivation: it had to be an interesting problem motivated by a sports question of interest, not a stupid textbook problem
Use easy to obtain data: I did not want students spending their time writing code to scrape data instead of model building
Build, evaluate, and interpret statistical models: they had to demonstrate the use of at least one modeling technique covered in the course, which included them formally writing out the model (which students struggle with…), justification for their modeling approach (such as comparison to alternative specifications), and then the relevant interpretation in the context of the topic
Quantify uncertainty: all results had to be reported with an appropriate measure of uncertainty! They were free to choose between traditional MLE standard errors, bootstrapping, or going full Bayesian - but they just had to justify why their choice was appropriate
Communicate with visualizations: the project had to include at least 2 relevant data visualizations (not including tables) since communication is a crucial component to sports analytics
I will likely do a more formal write-up of this course in the months ahead and have plans for distribution of my materials (which includes creating modules for the SCORE Network). There were a number of topics I planned that I did not get to, for example I never explicitly covered win probability! This was partly because I don’t want to discuss it in a lazy way, since the structure of the problem is more challenging than what meets the eye. But I also just did not have sufficient time due to the amount of time I spent covering the methodology - which is relevant across all fields, not just sports analytics! I will be teaching this course again next year and have some plans for changes (including more tracking data), but here’s a rough look at the course calendar (across a 14 week semester):
Expected value of game states
Expected goals - intro/review of generalized linear models (GLMs)
Went through basics of GLMs, model fitting, deviance, coefficient interpretation, etc.
Covered calibration and cross-validation in the context of sports data
Used NHL shot data (logistic regression) and soccer goal scoring rates (Poisson regression for homework)
Expected Points in American Football - intro to multinomial logistic regression, but build off expected goals material
Player/team evaluation with multilevel modeling
Began with discussion of popular residual allocation approaches used with Expected Points Added (EPA) and Completion Probability Over Expectation (CPOE) - this covered the challenges and difficulties in allocating residuals to credit players / teams, motivating the role of multilevel modeling
Walked through building multilevel models in the context of modeling completion probability
Motivated pooling through demonstration of naive model with coefficient for every QB
Discussed different levels in the data, for example:
Level One: individual pass attempts, with pass outcome as response, and variables describing individual pass attempts (e.g., air yards)
Level Two: Variables observed on larger observational units such as the QB involved and variables describing that QB
Walk through fixed effects vs random effects, varying intercepts, intraclass correlation coefficient, varying slopes, adding in multiple levels, nested effects vs crossed effects, all of this took several lectures
Walk through of Laird & Ware model formula, restricted maximum likelihood estimation (REML), and what exactly you’re estimating (variance) versus the Empirical Bayes estimates that are reported using lme4
Quantify uncertainty about random effects with traditional standard errors as well as bootstrapping, with discussion of difficulty in bootstrapping sports data… but the discussion of uncertainty led into Bayesian modeling
Intro to Bayesian Thinking
Review methods of estimation and how we quantify uncertainty
Discussed interpretation of probability, e.g., does it make sense to think about the long-run frequency of two teams playing each other in the Super Bowl when we are actually only observing this game once?
Demonstration Bayesian thinking and updating with simple Binomial and then the classic conjugate prior Beta-Binomial model with Caitlin Clark FG% across college game log (this occurred right when she was breaking the point record)
This was the soft launch into Bayesian statistics that took us into the end of the first half of the semester
Bayesian Modeling with Regularized Adjusted Plus-Minus (RAPM) - once you use Stan, you never go back
Returned from spring break with motivation of classic adjusted plus-minus modeling in basketball and moved into RAPM
I started this with just thinking of ridge regression regularization in the usual sense in which most of these students have seen it in the context of their machine learning courses: Loss + Penalty, tune the penalty and shrink coefficients
But then we turned to the Bayesian interpretation of ridge, which in turn to led to us building our first fully Bayesian model in Stan
Before diving into Stan for RAPM, we reviewed different posterior approximation techniques (grid approx, Laplace) and walked through MCMC with simple coding exercise of Metropolis-Hastings algorithm
Covered Gibbs sampling and then high-level overview of Hamiltonian Monte Carlo before getting into Stan
Returned to Caitlin Clark Beta-Binomial example for Stan intro, walking through the components of writing Stan code, using rstan, and viewing different diagnostics
Implemented fully Bayesian RAPM in Stan, with various wrangling of posterior distributions of player effects and variances, along with different ways of summarizing and viewing posterior
Homework problem with poor prior rating in Bayesian RAPM model
Team ratings, static and dynamic
Discussed various RAPM design matrices and returned to homework dataset of soccer goal scoring rates to demonstrate rstanarm for Bayesian multilevel Poisson regression with offense and defense effects for every Premier League team
Worked through Posterior prediction and different levels of variability (this was fun)
Discussed prior and posterior predictive checks
Made connections to classic Bradley-Terry model
Introduced Elo ratings and walked through coding Elo ratings from scratch
Walked through classic Glickman & Stern state-space model for dynamic NFL team ratings, with demonstration of implementation in Stan
This led to one of my favorite visualizations of the semester
Discussed extensions and use-cases of state-space modeling in sports (it’s everywhere!)
Tracking Data (did not spend as much time as I wanted to here)
Big picture overview of tracking data, what are the data and how they’re collected, types of problems and methods people work on with tracking data such as space ownership
Walked through continuous-time expected play value in American football, the number of different models required to make that work, and the importance of uncertainty quantification
Basic demo of NFL Big Data Bowl data wrangling, with example feature engineering and visualizations, including Voronoi tessellations
Final lecture after project presentations was about sports betting, with simple overview of break-even %, sportsbook probability, hold/vig, showed them Unabated, and told them to go home and rethink their life if they were betting…
Thanks for reading!
got introduced to Stan through an ecological forecasting class I took for my own grad program this spring, I love it so much even though basically all I've figured out how to do in it is fit a model with a normal distribution and a linear deterministic function. So much more to learn!
Dang, I want to go back to school now!