Welcome to our first walkthrough post in the Intro to Football Analytics series, "Why Rate Stats are Better than Totals". Rate stats are any stat that measures some performance on a per unit basis; yards per play, third down conversion, points per drive, etc. Total stats would be the pure game totals; yards gained, third downs converted, points scored. In general rate stats give you a much better picture of the quality of a team's performance. Gaining 400 yards in a game is great, but doing it in 40 plays is much better than doing it in 80. If you convert 5 third downs in a game it matters if you had 7 or 15 attempts. When I say rate stats are "better" I mean they are more indicative of a team's true talent level, and in turn a better predictor of future performance. This is a pretty well-accepted theory across many sports, and in business as well.
One of the most popular and simplest rate stats is yards per play, one number that can sum up a team's performance in a game. It has its advantages and disadvantages but its simplicity makes it a great stat for this example. I want to compare yards per play to total yards and see which one does a better job of predicting how many points you score. The reason I want to compare it to points scored is because, well, isn't that the whole point of any football analysis, whether with your eyes or a spreadsheet? Which team is going to win and how many points are they going to score? This won't necessarily prove all rate stats are better than all totals, and in general I'm not saying that's the case, but it will give us a good starting point as well as introduce some general analytical tools.
An easy comparison to do between yards per play and total yards gained would be to check the correlation between each stat and points scored. The correlation between two variables measures the strength of the relationship between them. If two variables move "together", it implies that when one increases so does the other, resulting in a higher correlation. Correlation does not mean the two variables are necessarily related at all, so just because two variables are highly correlated does not mean one has anything to do with the other. What correlation does give us is a starting point when determining if there might be a relationship. If you want to have some fun with things that are correlated but have nothing to do with each other then I recommend this site. Correlation is measured on a scale from -1 to 1, with 1 being a perfect relationship, 0 being no relationship at all, and -1 meaning a completely opposite relationship (when one increases, the other decreases).
I looked at all games for the 2009 to 2012 season and found the correlation between each team's yards per play in that game and the number of points it scored, as well as the correlation between each team's total yards and the number of points it scored.
|Yards / Play||Total Yards Gained|
Total yards gained was actually better correlated with points scored than yards per play. I honestly wasn't expecting to see that, but there are a number of things wrong with this analysis. This includes games against FCS teams, doesn't filter for garbage time, looks at all points scored including defensive and special teams touchdowns, and doesn't account for opponent. Also doing correlations after the fact does not help us understand the driving force of the relationship, just that there is one. But we can do better than a correlation analysis by using another tool: regression.
Regression analysis is a way to measure the size of the relationship between two variables. It answers the question, "When one variable increases, by how much does the other increase or decrease on average?". It is a very powerful tool that you will hear being thrown around in many offices in all types of industries. But when only looking at how one variable affects another it doesn't tell us much more than the correlation analysis. But the power of regression lies in adding multiple variables to your analysis. When we have multiple independent variables, like yards per play and total yards, and one dependent variable, points scored, the regression can tell us how much our points scored will increase on average when one variable increases while the other is held constant. This can help us really identify the driving force between the variables.
Since regression is a predictive tool we want to use our data to predict something. I want to try and predict how many points you score in a game based upon your performance in prior games, as well as your opponent's performance in prior games. I calculated each team's yards per play and yards per game (yes, technically a rate stat but I think most people consider it as a total -- "team XYZ rushed for 200 yards in that game") in all games before a given week in the 2nd half of the season. I also found the same stats for each team's opponent's defense in that week, so that for each game in the 2nd half of a season from the 2010 to 2013 seasons (1781 games) I have a team's yards per play, yards per game, yards per play allowed, and yards per game allowed. I also filtered the data set to only include games between FBS teams. Now I can set up my regression to predict the points a team is going to score in a game based on the season-to-date yards per play and yards per game for both teams in the game. The regression output, which I will show below, tells us that the yards per play stats are just as statistically significant as the yards per game stats, but also that all four variables are highly statistically significant. Statistical significance, in this case, is just a fancy term for importance. Honestly, I was expecting a much wider gap between the variables but they are all relatively the same level of importance.
** Author's Note: This is the regression output from the lm package in r. What is important are the coefficients. Estimate is the size of the coefficient for each variable, and this represents the change in your dependent variable with a one unit change in your independent variable keeping everything thing else equal. For example Is.Home has a coefficient of 3.81, this means being the home team (Is.Home = 1) adds 3.81 points to your expected points scored. The Std.Error and t value are statistical things, if you don't know what they then don't worry, they aren't important to this post. The Pr(>|t|) is the level of significance I referenced above; the smaller the more important. **
Another test we can do is to use yards per play and yards per game stats to predict point differential. I found the yards per play differential (the yards per play gained minus the yards per play allowed) and yards per game differential for each team prior to a game and, with the same information on the opponent, predicted the point differential for that game. I'll focus on how good of a prediction this can be in a later post, but for now I just want to focus on yards per play vs total yards. When we compare the significance of the two types of stats we see a more expected result; Yards per play is more significant than yards per game, but both are very important.
I had two main goals when writing this. The first was to show with confidence that rate stats like yards per play, yards per dropback, fumble percentage we better than just using the totals. I'm not sure I quite did that. The results were surprising in their ambiguity, I expected to find a much larger difference between the quality of yards per play vs yards per game. I will have to think about this result, and hopefully will have another post on it in the future. The second goal was to introduce some analytical concepts and thought processes to be a resource for other people. I hope I did that, but again I'm not really sure how most of this came across.
- If you want to predict the margin of victory of a game then the two team's yards per play differential is more important than yards per game differential, but yards per game still adds value to your prediction.
- When trying to predict the number of points a team will score in a game yards per play and yards per game are relatively equal in their predictive power.
- Caveats: I restricted the stats to only include games between FBS teams but did not filter out for plays that occurred in garbage time. I also did not adjust for opponent when considering the prior game's stats. Both of these would increase the accuracy of predicting points and scoring margins, but they may or may not change the significance of yards per play vs total yards. A topic for another post perhaps.
I would love some feedback from anyone who has still read this far, please put any comments or criticisms in the comments, as well as any discussion type questions. Also when I was starting out with football data analysis I loved examples and tutorials, so here is the Dropbox link to the code I used to complete all the analysis in R and the final data set is available here in CSV format so that you can play with it in any program.