Let me start with this: I have no idea who will be selected as finalists in this year’s Big Data Bowl. Every year, the number of interesting and high-quality submissions goes up. I’m not going to write about ‘how to win the Big Data Bowl’, because I quite frankly have no idea. If you’re interested in that, then you should:
Talk to Quang Nguyen who is a back-to-back finalist, with his STRAIN notebook being referenced in other competitions as an example for how to format a submission.
Instead, similar to last year, I’ll provide a preview of Carnegie Mellon Sports Analytics Center (CMSAC) submissions and mention some notebooks I found interesting. Of course this is subject to the selection bias of my network, so there are many interesting notebooks that I am unaware of. You can and should check out the work people have shared on Kaggle. If you competed in this year’s Big Data Bowl and I do not mention your notebook, please share a link to your work in the comments.
But before I get to that preview, I wanted to share some thoughts I have based on what I’ve seen in several submissions this year. These thoughts echo what Adi Wyner recently discussed on the great Wharton Moneyball podcast:
Always, always, always report base rates! Your model’s test accuracy means nothing to me without reporting the base rate. For example, you may observe that your XGBoost model that accounts for all sort of tracking data features results in a test accuracy of 80% at predicting some event to happen. You may think: ‘WOOHOO 80% YEAH WE’RE SMART!’ But if that predicted event happens 80% of the time (ignoring all of those features you conditioned on), the you can just always guess that event and obtain the same test accuracy… which means your model is useless (regardless of whatever post-hoc summaries you try to use for justifying its existence). On a related note…
Compare to a simpler model! Let’s pretend you decide to use some type of deep learning architecture as your model and observe its test accuracy is 90%. And you were wise enough to also look at the base rate which was only 50%. You may think: ‘WOW 90% > 50% YEAH WE’RE REALLY SMART!’ Not so fast my friend, beyond referencing the base rate - you should also compare your approach to a simpler model that also accounts for tracking data features. For example, you could take a number of simple distance-based and velocity derived features from the tracking data, toss that shit into a logistic regression with a lasso penalty using glmnet, and see how it compares pretty quickly. An additive model with sparsity is the simplest reference point that accounts for features. If this type of model results in 70%, while your deep learning approach displays 90% - then great! That’s a win for you and indicates the complexity of the problem you’re working on, motivating the usage of a flexible approach. BUT if the simple lasso model also achieves the same level of accuracy as your deep learning architecture, then to quote Brad Pitt as Billy Beane (pardon the language):
REPORT UNCERTAINTY! This should go without saying that you should always report some form of uncertainty with whatever you’re estimating. This includes test accuracy! You split your data 80% train, 20% test and observe a test accuracy of 85% - cool, now repeat that four more times on different test folds (i.e., 5-fold cross-validation), and report the average of the five test accuracy estimates ALONG WITH STANDARD ERRORS. You could observe an average test accuracy of 85% across your five test folds where all five folds yield similar results or all five folds yield vastly different results. I have no idea how reliable your results if you do not report uncertainty with your estimates.
Again, this is not aimed at any individual notebook - but rather what I observed across several submissions1. I’m not sure if it was because this year’s theme was about pre-snap to post-snap that more people focused on prediction type problems. It’s also easier than ever to implement sophisticated techniques. For example, the smart folks at SumerSports released to the public a transformer-based architecture with a tutorial for the Big Data Bowl. This is awesome!2 There are a large number of problems in sports analytics with tracking data where such approaches are going to dominate statistical models and tabular-based machine learning techniques. But the points I made above remain, you should start with base rates and simpler models to provide appropriate points of reference for comparison.
The last point (in what feels like rant): the Big Data Bowl samples are limited amounts of data3. Despite the fact the memory size of Big Data Bowl samples are quite large (especially this year), it’s only a fraction of a single season of non-independent observations. In other words, you do not have as much data as you think4. Not only does this limit the capabilities of highly flexible techniques, but it also makes your reported accuracies less interesting. Teams and vendors with access to the entire history of NFL tracking data have much greater capability in building a model that will do better in test accuracy than whatever you can. What is interesting, from the perspective of the Big Data Bowl competition, is not the reported levels of accuracy itself (after indication of improvement over base rates and simpler approaches) but rather what you do with the model output/predictions. In my opinion, this is what separates the best notebooks that are based on some prediction task from the rest of the pack. The prediction accuracy is just a minor point, while the novelty is in the downstream use of the model.
Now on to the actual point of this post…
CMSAC submissions
The following are highlights and links to Big Data Bowl submissions by CMU students that I hope you find interesting. I already wrote about Quang Nguyen’s submission about modeling snap timing variability, so I will just focus on the other two submissions from students I advised below.
CHASE: A New Metric for Receiver Spatial Impact
Authors: James Lauer, Larry Jiang, Nicco Jacimovic, Jason Andriopoulos
Submission notebook: https://www.kaggle.com/code/lauerjames/chase-a-new-metric-for-receiver-spatial-impact
In this submission by three CMU MADS students (James, Nicco, and Jason) and a CMU Neuroscience PhD student (Larry, who was a finalist last year!), they introduced an approach to measure how NFL receivers impact defensive spacing using a convex hull.
They were motivated by the notion of “gravity” in other sports, and demonstrated how this measure of attention relates to teammate catch probability. They also connected the ‘at-throw’ spacing to pre-snap information. This group worked incredibly hard and have a number of other findings that they were unable to fit into their submission. So here’s hoping they get the chance to elaborate more if selected as a finalist!
Trench Chess: How Defensive Lines Create Opportunities by Changing The Picture
Author: Abhi Varadarajan
Submission notebook: https://www.kaggle.com/code/abhishekvaradarajan/trench-chess
CMU sophomore Abhi Varadarajan introduced a way to identify disguised pressures and stunts using the tracking data, combining his football knowledge and data skills to provide insight on how defensive lines change the picture from pre-snap to post-snap. He has a great thread about his work that you should check out.
Other submissions I found interesting (so far)…
Safety Entropy: A Measure of Safeties’ Predictability: Great example of hitting on the points I discussed above along with providing a neat usage of entropy. They also began with the flex that they talked to Josh Rosen…
TEndencIQ: Outsmarting the Offense Through Pre-Snap Defensive Intelligence: Undergraduate track submission that focused on the TE’s role in a play.
Exposing Coverage Tells in the Presnap: The digital whiteboard in this submission is awesome and exactly what teams want and vendors create. And this was made by undergrads!
Again, if you competed in this year’s Big Data Bowl and I did not mention your notebook, please share a link to your work in the comments!
Thanks for reading!
Do not get me started on the pointless use of 3D graphics that were prevalent this year…
I teach transformers in a master’s level Natural Language Processing course.
Mike Lopez hates me.
See Brill et al. here: https://arxiv.org/abs/2409.04889
When you bite off more than you can chew, it shows itself when you get to the write up and don't have the word count or the figures to adequately cover the exact points you talk about in this blog post. This was the case with ours, but I think a potentially interesting pursuit nonetheless. Sometimes, just making it happen without fizzling out is a small win in itself.
https://www.kaggle.com/code/jacobmarkmiller/anticipating-play-outcomes
Great article!
I do think your philosophy and points are well received but I also somewhat miss the good old days of practitioners building creative and new methodologies. I am biased of course but there is a part of me that believes there is value inherently in experimenting and showcasing unique/creative methodology even if the raw results themselves are somewhat meaningless (small training set, needs to get reproduced over large samples).
There was a part of us that was frustrated with that lack of creativity for those choosing to use deep learning and the fact that the official NGS team still uses the Zoo arch from 2020. To me seeing a project like CAMO where they go in and find meaningful interpretation of attention weights (a non-trivial task imo) is pretty cool in itself.
Also to your point about 3D graphics... I remember Rishav put together a fantastic one for our PaVE submission 3 years ago. It got so much hype on twitter and I'd be lying if I didn't think it contributed to us getting HM. I can see why people make the investment!