clock menu more-arrow no yes mobile

Filed under:

Georgia Tech Football: Advanced Stats Week 2022 - Data and its Environs

Let’s get meta — meta-data, that is.

COLLEGE FOOTBALL: NOV 27 Georgia at Georgia Tech Photo by David J. Griffin/Icon Sportswire via Getty Images

Unlike Robert’s work this week, my contributions will be less focused on specific data and more on the meta-narrative of that data. That’s really where the discussion on current state of the college football analytics world begins and ends: what data we have, what type of data we have, and how we get access to it.

This feels like a very basic thing, right? Of course, data is important — no data means no nerds Posting charts on the Bird App™️, and we can’t have that. But there’s nuance in these details: the data guides insight. You can slice and dice a spreadsheet any way you want to make a tweetable graphic, but:

  1. well, you need the actual spreadsheet (the literal data), AND:
  2. you need to know what the spreadsheet has in it (what type of data you have).

Now, with that in mind, let’s get down to business.

How do we get data?

This is one of two hard parts when it comes to creating the charts you see on Twitter (the other being data cleaning). For college football specifically, there’s a few different types of data you might want to grab:

  • Scores and schedules
  • Team statistics
  • Player statistics
  • Plays
  • Recruiting Visits

Some examples, for your perusal:

In general, this stuff can be found from your usual suspects:,, Football Reference,,,, and even subscription sites like Pro Football Focus (PFF) and Sports Info Solutions (SIS) — all places from whence you can manually copy numbers and text to build your spreadsheet at your leisure.

But what if you wanted to automate this? After all, it’s not necessarily an efficient process to copy and paste all of this stuff into Excel every single time you want to build a chart or a table.

Well, now you’re speaking my language — programming language, that is. Via a little bit of R or Python programming, one can build reproducible pipelines to access all of this data and build visualizations based on it. I’ll defer to the experts (for R and Python) on the specifics of this approach if you’re interested, but in short, these tools (after some layers of abstraction) rely on the public (but undocumented and mostly-hidden) data feed that ESPN uses to populate their site. The tools retrieve and organize the data into frames (IE: tables) that can be manipulated in code, and as part of their automation, these tools can clean and simplify the data for users to manipulate it more easily. This pre-processing also means that these tools can prepare complementary information to the basics from ESPN to better aide user analysis (EX: expected points added).

What data are we getting?

Now that we’ve acquired data — whether via manual collection or programmed pipelines — what have we actually retrieved? Intuitively, we’ll focus first on the structure of the data: do we have numeric or boolean columns? Text columns? Maybe we have alphanumeric identifiers for teams or players that we need to match to full information records? These are all (relatively) easy questions to answer by just looking at what we have available to us.

But there’s a meta-way to tackle this question as well, and let’s walk through it with an example: take a look at this Jahmyr Gibbs catch from last year’s Kennesaw State game and jot down some mental notes of what happens:

Have those notes in mind? Well, let’s compare those to how ESPN describes this play in its play-by-play feed:

One sentence: that’s all we get. We only get the simplest explanation of the final state of this play. Other stat trackers might note the multiple broken tackles, but even given that, what we’re seeing here is the destination and not the journey, so to speak. We only get events and the two players involved in them (maybe a third if we’re lucky — the tackler), not the whole story. It’s like reading a book only via CliffsNotes that are half censored.

This is where I’m going when I encourage you to interrogate the type of data that you’re working with: are you just getting events and outcomes, or are you getting the whole picture? There are twenty-two human beings on the field at any given instant during a play in a college football game: what are they doing? Where are they going on the field? To make a long series of rhetoric questions short: how are they contributing to that specific play’s success or failure for their team?

These are things you don’t get from your usual event data, and it’s a critical missing link in how we analyze the game. Now, you might offer that film study (like that done in college football programs with armies of analysts) covers this gap and for the most part, you’d be right. Film study reveals a lot of the process behind a play — the coverages, the formations, the routes — and provides the opportunity to explore alternative outcomes as a thought exercise (EX: “what if Gibbs cut back towards the inside across his defender earlier?”).

But here’s the rub — while film study reveals more semantic concepts about the game, it suffers from two major deficiencies compared to analysis of event data:

  1. It’s time-consuming to go through the same clip of the same play some X times and note down a different thing every time.
  2. Because it’s time-consuming, it’s not scalable.

Let me frame this: what’s the probability that Gibbs breaks those tackles based on the angles of attack of his would-be tacklers? Was there a different receiver that had more separation on his defender that Jordan Yates could have targeted on this play? What was that receiver’s catch probability based on said separation and his technique? Based on the defensive line’s speed and trajectory, how much time did Yates have to make this throw? Via film, we now understand how we’ve gotten from point A to B, but what was the chance of that outcome? What other points B (point Bs?) did we not go to and why? Bottom line: did Yates make the most optimal throw here, or was there a better decision at the time he made his reads? There’s no way we can answer these questions by just looking at tape.

So, from play-by-play data, we know what happened, and from film, we know how it happened. Now, with improvements in computing and machine learning techniques, programs can blend these two spheres together: by employing player-tracking data, analysts can empower on-field coaches and staffers with information on player tendencies, coverages, and schematic tendencies that would take years to compile via traditional film study. Now, it’s data analysts — not just film junkies — that can analyze intermediate states of plays and evaluate counterfactual outcomes. A program that utilizes this data effectively could change the game (pun not intended) on how it prepares for its opponents and dissect its own performances — schematic and traditionally semantic conversations could now become more data-driven ones that key on process, rather than strictly outcomes.

In short: the game is about blocking and tackling, and a coach’s job is to figure out where they can find small edges to improve their team’s ability to do those things and prevent their opponents from doing those things. You take past data, you learn from it, and you adjust your scheme — this iterative process is essential to the success of a football team. Every percentage point, every fraction, every significant digit: all of these matter when 1) the talent margins are tight in a competitive league and 2) a ball bouncing the wrong way at the wrong time results in someone (or multiple someones) losing their job. Every scrap of data is vitally important to optimizing for team success. Now, companies have given* programs a virtual galleon-full of player-tracking data to comb through and find more ways to win.


* Well, if the program pays for it. You’ll notice I didn’t really touch much on pay-data services like StatsBomb, Pro Football Focus (PFF) or Sports Info Solutions (SIS) when discussing data sources to pull from. These firms are an important part of the data ecosystem, especially at enterprise (read: FBS program) scale, but based on what I’ve read and seen, PFF and SIS primarily provide charted data (read: what you can get from film study) and metrics that are built on top of that. It seems like only StatsBomb has figured out tracking data at the college level — more on them and their project (along with others) soon.