This is the latest post in our new series at From The Rumble Seat introducing our readers to sports analytics and what we can find out using data. You can view all other Intro to Football Analytics posts in the story stream.
One of the coolest applications of Football Analytics I have seen are the Advanced NFL Stats live Win Probability Graphs. At any point in the game they can tell you the likelihood of the home team winning, expressed as a win probability (probability is the percent chance something happens, .5 = 50% chance it occurs). These were one of the first things that drew me in to sports analysis and are a great visual. And now, I have finally taken the time to produce them for college football.
Normally in these posts I explain the process and all the thought that went behind the results. I'll still do that but for the sake of time and content I'll be a little more brief than normal. Not only are the charts below super cool, but the process was rather complicated and I'm not sure I can explain everything to everyone without drowning out the cool parts.
My goal when developing a play by play win probability model is to be able to take the descriptions about the offense's possession and produce a probability of that team winning the game. I took the play by play information from www.cfbstats.com for all games from the 2009 and 2012 season. The play by play data already contained information on the spot, quarter, down, distance, and the current score. I wanted this to be as granular as possible so I had to calculate the current time remaining in the quarter with a function I wrote. The play by play data contains the time left on the clock for the start of each possession, so I just evenly spaced out the time of the possession between the total number of plays in that drive. After that I found out who won each game and added that information to each play. I added one more dummy variable that just tells you if the current team is the home team, and that's it (for the simple version, I'm still doing research on new variables and interactions terms*). I ran those inputs in a logistic regression to develop the model and then used the model to predict the win probability for all plays in the 2013 season. Some caveats: I ignored all overtime games because that seemed like a hassle I didn't want to deal with. The model doesn't consider the pre-game strength of the teams, all it cares about is who is home and everything else is based on game data. That may come in the future but I'm not convinced it needs to, argue with me in the comments if you like :).
(*) I lied, I added some interaction terms after doing some of the analysis at the end, for instance lead and quarter. A 3 point lead in the 1st quarter and 4th quarter aren't the same.
I've got to bore you with some technical stuff before we get to the goods, but I think this is important for understanding prediction models and making sure they are working. One of my favorite ways to determine the accuracy of a prediction model is to test it's predictions against what actually happened. Also it's extremely important to do this on observations you did not build the model on. This allows you to test the model on data it hasn't been tuned to predict. What I did is group all of the predicted win probabilities in to groups of 5 percentage points, so throw all the times I predicted a team has a zero to 5 percent chance of winning in one group, then 5 to 10 percent chance of winning in another group and so on. I then looked at how often the teams in each bin actually won the game. A perfect model would show that teams predicted to win 5% of the time actually win 5% of the time and so on. Here is how my model does:
I'd say that is pretty good! One last test I want to show is called a Bad Capture Rate Plot. It shows how well the score separates the bads (plays where a team actually ended up losing) with the goods (plays where a team actually ended up winning). There are a couple of ways to interpret this plot, but don't worry too much about specifics. All I want to show is that the model does really well at separating the plays where teams end up losing vs when teams end up winning.
My favorite application of the win probability model are the live play by play graphs. For example here is the chart from the national championship game this year (the color is who has the ball).
We can also find the biggest plays in the game according to the model. By subtracting the win probability after the play from before the play you can get a sense of how that play affected your team's chances of winning. Using the National Championship Game again as an example here are the 5 biggest plays from the game. And the numbers are all from the perspective of FSU.
|Play Description||Quarter||FSU WP
|Tre Mason rush for 37 yards for a TOUCHDOWN.||4th||64.2%||24.8%||(39.4%)|
|Jameis Winston pass complete to Chad Abram for 11 yards for a TOUCHDOWN||4th||27.9%||58.0%||30.13%|
|Jameis Winston pass complete to Kelvin Benjamin for 2 yards for a TOUCHDOWN.||4th||45.1%||78.8%||33.7%|
|Cody Parkey kickoff for 65 yards returned by Levonte Whitfield for 100 yards for a TOUCHDOWN.||4th||37.1%||68.9%||31.8%|
|Nick Marshall pass complete to Tre Mason for 12 yards for a TOUCHDOWN.||1st||55.4%||31.5%||(23.9%)|
The only play that seems weird to me is the pass to Tre Mason in the 1st quarter. That took Auburn from down by three to up by 4 with three whole quarters left, and its not like it was a 50 yard pass or anything. Like I said, I'm still doing research on all of this.
I think that is all I will say for now, other than please post any comments, questions, or criticisms you have in the comments. Also please check the comments for all of the WP graphs from last year. These charts will be included in all game reviews at From The Rumble Seat this upcoming season, and if there is enough interest I might do some national ones as well. I also did some other games from this year with an earlier model and you can view those and all the ones in the comments as well on this imgur album.
** Orientalnc had some good questions about the model, so I figured I would add my responses here so everyone can see them.
I am not sure how that helps me feel better or worse about the upcoming GT season, or, to be more specific, our chances of beating Woford or Georgia Southern. Maybe one of you guys can help me understand.
If Tech plays flat, or GSU executes really well, we could suffer the same fate as Florida did last year. How does the model account for this?