Filed under:

# The Joel Sokol NCAAT Discussion: Part 3

This is the final part of our three part interview with Professor Sokol of Georgia Tech. Professor Sokol explained his system in Part I and told us about the differences between college hoops and college football statistically in Part II. In today's segment, Professor Sokol goes into greater depth about his system and how it has changed since its initial inception. He also answers the age old question about predicting Cinderellas.

FTRS: How accurate was your first model and what tweaks have you made since then? How accurate was it for the 2009 Tournament?

Sokol: Our initial model was (statistically) significantly better than competing rankings like the AP and ESPN/USA Today polls, the RPI, and Sagarin and Massey's ratings. We've made some tweaks since then, including rolling out a Bayesian version of LRMC this year (we now report both the original and Bayesian methods on our page). The Bayesian version seems to do a better job of valuing some of the teams whose results give less helpful information.

For 2009, our methods were actually less successful than our competitors. In 2008, the opposite was true (in fact, in 2008 LRMC correctly predicted the whole final four, final two, and winner). Statistically, a single year's results are almost never enough to be significant -- but we now have 10 years of tracking data, so we can say that over the long run LRMC is statistically-significantly better.

FTRS: Can you give us a layman's definition of the Bayesian model you referred to?

Sokol: Bayesian models are a different sort of statistical methodology than standard parameter estimation. In most basic statistical models, we assume there's some "true" value of something (for example, how good a basketball team is), and using the data we have, we try to find an accurate estimate of that unknown true value.

In a Bayesian model, we start with a pretty generic guess, and use the data we observe to update that guess.

(Editor's Note: Many probability and statistics students use Bayes' Theorem to deal with conditional probability. If you would like a more detailed explanation of Bayes' methods, click here.)

FTRS: You refer to statistical significance in the same answer. How many years of NCAA data is required for the results to be statistically significant?

Sokol: Actually, the amount of data needed depends on how big a difference there is between the methods we're comparing. For example, suppose you want to see which of two basketball teams is better. So, you have them play each other every day until you're sure you know the answer.

If it's Duke vs. NJIT, Duke will probably win the first 10 games (and they'd all probably be blowouts). It's likely that you'll just end the experiment there, and declare that Duke is a better team.

But if it's Duke vs. Kansas, the first 10 games might be split 5-5 or 6-4, so it's less clear who's really better. (Even 6-4 isn't convincing; if one close game went the other way it would've been a 5-5 split.) So you might make them play 10 more, and now it's 11-9, still pretty close. So you make them play more, etc. Eventually, you'll get to the point where you can say okay, it's now (say) 115-95, and that's convincing enough that the team with 115 wins is better (though not by so much) than the team with 95 wins.

The question is how many games you need, and that's easy to answer using standard statistical techniques.

In our case we're going the other way. We have a certain number of games (in our data set, we have 10 years, or 630 total tournament games), and we want to know whether it's enough to show statistical significance. Normally, 630 games would be enough -- but not necessarily here, because so many games give no information. For example, any reasonable prediction method will have the 1-seed beating the 16-seed in the first round, so those games don't tell us anything about which prediction method is better. In fact, any game where two methods predict the same winner doesn't give us any useful information about comparing the two methods.

So, we have to use something called McNemar's test, which compares two methods only on the games where they disagree on the predicted winner. With dissimilar methods, there could be a hundred or more disagreements over the 10 years, but with similar methods (like when we compare two versions of LRMC) there might be only 2-4 disagreements per year. So for some comparisons we don't yet have enough data to claim statistical significants. [In fact, we actually have two Bayesian improvements to LRMC, and not enough data to be sure which one is better than the other, but we do have enough data to show that they're both statistically significantly better than the original LRMC.]

FTRS: Once the brackets are announced, how long does it take for your model to run a simulation? Do you run multiple simulations? Can you tell our readers why it is a good idea to run more than one simulation?

Sokol: We actually don't run a simulation; instead, we just rank all of the teams and assume that the better team is our predicted winner in each round.

That's not always technically the best prediction though -- for example, consider 4-team pod where A, B, and C are ranked in that order, but are very close, and D is much worse. A plays B in the first round, and C plays D. So A vs. B is almost 50/50, but C is very likely to advance. In the second round, C vs. A and C vs. B are both almost 50/50. So C is the most likely team to advance out of the 4-team pod, even though they're 3rd best of the 4. There's a web site called Poologic that takes this sort of thing into account -- we provide them with our raw LRMC ratings, and they do the calculation. [In fact, you can enter your pool's specific scoring system and it'll calculate your best expected-value bracket for that system.]

In a sense, though, the Markov chain part of LRMC is like having an infinite number of simulations; it calculates what the order of teams would be if they played each other over and over an infinite number of times. That's much more accurate than just having one simulation where a low-probability event might happen. In fact, that's just like the NCAA tournament, where there are always some upsets, but nobody can predict for sure which they'll be.

FTRS: Does your model predict potential "Cinderellas" or is there still some margin of unquantifiable luck involved in winning the tournament?

Sokol: Both, actually. Our model often does show when a worse-seeded team is actually better than the better-seeded teams it plays in the first round or two (for example, Arizona last year). From a seeding point of view that's an "upset", but we don't consider it to be a real upset since we think the better team won each game.

Our model can also often show which teams are more likely to pull off an upset -- they might not be ranked as high as the team they're facing, but they're much closer than the seeds would suggest.

However, there's always a significant amount of luck in sporting events, so there's no way we can predict for sure who's going to win each game. If we could, we'd be pretty rich by now!

As an upper bound, consider that over the last five years the Las Vegas favorite has won 76-77% of the NCAA tournament games. So that means even to the pros in Las Vegas, about one out of every four NCAA tournament games is a true upset where the worse team wins (so, 15-16 upsets per year in the tournament).

With such a high upset rate, making perfectly accurate predictions is essentially impossible. Even our 2008 success included a good bit of luck -- there's no way we can expect LRMC to do that well. We got lots of "thanks for your rankings; I used them in my pool and won" email that year, and we told each one not to expect LRMC to do that well every year. :)

This is a bit off-topic, but a nice story: my favorite email that year was from someone who didn't know much about college basketball but won a large pool using our LRMC rankings -- and he asked me for the name of my favorite charity so he could make a donation!