Tag Archives: sports

Should a football team run or pass? A game theory and linear programming approach

Last week I visited Oberlin College to deliver the Fuzzy Vance Lecture in Mathematics (see post here). In addition, I gave two lectures to Bob Bosch’s undergraduate optimization course. My post about my lecture on ambulance location models is here.

My second lecture was about how to solve two player zero-sum games using linear programming. The application was a sports analytics application of whether a football team should run or pass. The purpose of the lecture was to learn about zero-sum games (it was a new topic to most students) and learn how to solve zero-sum games with two decision-makers using linear programming.

This lecture tied into my Badger Bracketology work, but since I do not use optimization in my college football playoff forecasting model, I selected another football application.


Related reading:

the NFL football draft and the knapsack problem

In this week’s Advanced Football Analytics podcast, Brian Burke talked about the knapsack problem and the NFL draft [Link]. I enjoyed it. Brian has a blog post explaining the concept of the knapsack problem as it relates to the NFL draft here here. The idea is that the draft is a capital budgeting problem for each team, where the team’s salary cap space is the knapsack budget, the potential players are the items, the players’ salaries against the cap are the item weights, and the players’ values (hard to estimate!) are the item rewards. Additional constraints are needed to ensure that all the positions are covered, otherwise the optimal solution returned might be a team with only quarterbacks and running backs. Brian talks a bit about analytics and estimating value. I’ll let you listen to the podcast to get to all the details.

During the podcast, Brian gave OR a shout out and added a side note about how knapsack problems are useful for a bunch of real applications and can be very difficult to solve in the real world (thanks!). I appreciated this aside, since sometimes cute applications of OR on small problem instances give the impression that our tools are trivial and silly. The reality is that optimization algorithms are incredibly powerful and have allowed us to solve incredibly difficult optimization problems.

Optimization has gotten sub-optimal coverage in the press lately. My Wisconsin colleagues Michael Ferris and Stephen Wright wrote a defense of optimization in response to an obnoxious anti-optimization article in the New York Times Magazine (“A sucker is optimized every minute.” Really?). Bill CookNathan Brixius, and JF Puget wrote nice blog posts in response to coverage of a TSP road trip application that failed to touch on the bigger picture (TSP is useful for routing and gene sequencing, not just planning imaginary road trips!!). I didn’t write my own defense of optimization since Bill, Nathan, and JF did such a good job, but needless to say, I am with them (and with optimization) all the way. It’s frustrating when our field misses opportunities to market what we do.

If you enjoy podcasts, football, and analytics, I recommend the Advanced Football Analytics podcast that featured Virgil Carter, who published his groundbreaking football analytics research in Operations Research [Link].

Related posts:


Some thoughts on the College Football Playoff

After a fun year of Badger Bracketology, I wanted to reflect upon the college football playoff.

Nate Silver reflects upon the playoff in an article on FiveThirtyEight, and he touches on the two most salient issues in the playoff:

  • False negatives: leaving teams with a credible case for being named the national championship out of the playoff.
  • False positives: “undeserving” teams in the playoff.

As the number of teams in the playoff increases, the number of false negatives decreases (good – this allows us to have a chance of selecting the “right” national champion) and the number of false positives increases (bad).

One of my concerns with the old Bowl Championship Series (BCS) system with a single national championship game was that exactly two teams were invited to the national championship game. This was a critical assumption in the old system that was rarely discussed. There was rarely exactly two teams that are “deserving.” Usually, deserving is equated with “undefeated” and in a major conference. Out of 16 BCS tournaments, this situation occurred only four times (25% of championship games), leading to controversy in the remaining 75%. This is not a good batting average, with most of the 12 controversial years having too many false negatives and no false positives.

The new College Football Playoff (CFP) system has a new assumption: the number of “deserving” teams does not exceed four teams.

If you look at the BCS years, we see that this assumption was never violated: there was never more than four undefeated teams in a major conference nor a controversy surrounding more than 3 potential “deserving” teams. Controversy surrounded the third team that was left out, a team that would now be invited to the playoff. At face value, the four team playoff seems about right.

But given the title of Nate Silver’s article (“Expand The College Football Playoff”) and the excited discussion of the idea of the eight team playoff in 2008 after a controversial national championship game, I can safely say that most people want more than four teams in the playoff. TCU’s dominance in a bowl game supports these arguments. The fact that we’ve had one controversial seeding in one CFP is a sign that maybe four isn’t the right playoff size. What is the upper bound on the number of deserving teams?

Answering this question is tricky, because there is a relationship between the number of teams in the playoff and our definition of “deserving.” There will always be teams on the bubble, but as the playoff becomes larger, this becomes less of an issue. Thoughts on this topic are welcome in blog comments.

It’s worth mentioning the impact on academics and injuries. As a professor of operations research, I believe that every decision requires balancing different tradeoffs. The tradeoffs in the college football playoffs should not only be about false positives, false negatives, fan enjoyment, and ad revenue. Maybe this is trivial: it’s an extra game for a mere eight teams, but I will be disappointed if the entire impact on the student-athletes and their families such as academics and injuries are not part of the conversation.


introducing Badger Bracketology, a tool for forecasting the NCAA football playoff

bucky_shoots_and_scoresToday I am introducing Badger Bracketology:

I have long been interested in football analytics, and I enjoy crunching numbers while watching the games. This year is the first season for the NCAA football playoff, where four teams will play to determine the National Champion. It’s a small bracket, but it’s a start in the right direction.

The first step to being becoming the national champion is to make the playoff. To do so, a team must be one of the top four ranked teams at the end of the season. A selection committee manually ranks the teams, and they are given a slew of information and other rankings to make their decisions.

I wanted to see if I could forecast the playoff ahead of time by simulating the rest of the season rather than waiting until all season’s games have been played. Plus, it’s a fun project that I can share with my undergraduate simulation simulation that I teach in the spring.

Here is how my simulation model works. The most critical part is the ranking method, which uses the completed game results to rate and then rank the teams so that I can forecast who the top 4 teams will be at the end of the season. I need to do this solely using math (no humans in the loop!) in each of 10,000 replications. Here is how it works. I start with the outcomes of the games played so far, starting with at least 8 weeks of data. This is used to come up with a rating for each team that I then rank. The ranking methodology uses a connectivity matrix based on Google’s PageRank algorithm (similar to a Markov chain). So far, I’ve considered three variants of this model that take various bits of information account like who a team beats, who it loses to, and the additional value provided by home wins. I used data from the 2012 and 2013 seasons to tune the parameters needed for the models.

The ratings along with the impact of home field advantage are then used to determine a win probability for each game. From previous years, we found that the home team won 56.9% of games later in the season (week 9 or later), which accounts for an extra boost in win probability of ~6.9% for home teams. This is important since there are home/away games as well as games on neutral sites, and we need to take this into account. The simulation selects winners in the next week of games by essentially flipping a biased coin with. Then, the teams are re-ranked after each week of simulated game outcomes. This is repeated until we get to the end of the season. Finally, I identify and simulate the conference championship games played (these are the only games not scheduled in advance). And then we end up with a final ranking. Go here for more details.

There are many methods for predicting the outcome of a game in advance. Most of the sophisticated methods use additional information that we could not expect to obtain weeks ahead of time (like the point spread, point outcomes, yards allowed, etc.). Additionally, some of the methods simply return win probabilities and cannot be used to identify the top four teams at the end of the season. My method is simple, but it gives us everything we need without being so complex that I would be suspicious of overfitting. The college football season is pretty short, so our matrix is really sparse. At present, teams have played 8 weeks of football in sum, but many teams have played just 6-7 games. Additional information could be used to help make better predictions, and I hope to further refine and improve the model in coming years. Suggestions for improving the model will be well-received.

Our results for our first week of predictions are here. Check back each week for more predictions.

Badger Bracketology: http://bracketology.engr.wisc.edu/

Our twitter handle is: @badgerbrackets

Your thoughts and feedback are welcome!

Additional reading:


underpowered statistical tests and the myth of the myth of the hot hand

In grad school, I learned about the hot hand fallacy in basketball. The so-called “hot hand” is the person whose scoring success probability is temporarily increased and therefore should shoot the ball more often (in the basketball context). I thought the myth of the hot hand effect was an amazing result: there is no such thing as a hot hand in sports, it’s just that humans are not good at evaluating streaks of successes (hot hand) or failures (slumps).

Flash forward years later. I read a headline about how hand sanitizer doesn’t “work” in terms of preventing illness. I looked at the abstract and read off the numbers. The group that used hand sanitizer (in addition to hand washing) got sick 15-20% less than the control group that only washed hands. The 15-20% difference wasn’t statistically significant so it was impossible to conclude that hand sanitizing helped, but it represented a lot of illnesses averted. I wondered if this difference would have been statistically significant if the number of participants was just a bit larger.

It turns out that I was onto something.

The hot hand fallacy is like the hand sanitizer study: the study design was underpowered, meaning that there is no way to reject the null hypothesis and draw the “correct” conclusion whether or not the hot hand effect or the hand sanitizer effect is real. In the case of the hand sanitizer, the number of participants needed to be large enough to detect a 15-20% improvement in the number of illnesses acquired. Undergraduates do this in probability and statistics courses where they estimate the sample size needed. But often researchers sometimes forget to design an experiment in a way that can detect real differences.

My UW-Madison colleague Jordan Ellenberg has a great article about the myth of the myth of the hot hand on Deadspin and it’s fantastic. He has more in his book How Not to Be Wrong, which I highly recommend.  He introduced me to a research paper by Kevin Korb and Michael Stillwell that compared statistical tests used to test for the hot hand effect on simulated data that did indeed have a hot hand. The “hot” data alternated between streaks with success probabilities of 50% and 90%. They demonstrated that the serial correlation and runs tests used in the ‘early “hot hand fallacy” paper were unable to identify a real hot hand, and therefore, these tests were underpowered and unable to reject the null hypothesis when it was indeed false. This is poor test design. If you want to answer a question using any kind of statistical test, it’s important to collect enough data and use the right tools so you can find the signal in the noise (if there is one) and reject the null hypothesis if it is false.

I learned that there appears to be no hot hand in sports where a defense can easily adapt to put greater defensive pressure on the “hot” player, like basketball and football. So the player may be hot but it doesn’t show up in the statistics only because the hot player is, say, double teamed. The hot hand is more apparent and measurable in sports where defenses are not flexible enough to put more pressure on the hot player, like in baseball and volleyball.



Major League Baseball scheduling at the German OR Society Conference

Mike Trick talked about his experience setting the Major League Baseball (MLB) schedule at the 2014 German OR Conference in Aachen, Germany. Mike’s plenary talk had two major themes:
1. Getting the job with the MLB
2. Keeping the job with the MLB

The getting the job section summarized advances in computing power and integer programming solvers that have made solving large-scale integer programming (IP) models a reality. Mike talked about how he used to generate cuts for his models, but now the solvers (like CPLEX or Gurobi) add a lot of the cuts automatically as part of pre-processing. Over time, Mike’s approach has become popping his models into CPLEX and then figuring out what the solver is doing so he can exploit the tools that already exist.

Side note: I am amazed at how good the integer programming solvers have become. I recently worked on a variation to the set covering model for which a greedy approximation algorithm exists. The time complexity of the greedy algorithm isn’t great in theory. In practice, the greedy algorithm is slower than the solver (Gurobi, I think) and doesn’t guarantee optimality. I can’t believe we’ve come this far.

Mike also stressed the importance of finding better ways to formulate the problem to create a better structure for the IP solver.  Better formulations can be more complicated and less intuitive, but they can lead to markedly better linear programming bounds. Mike achieved this by replacing his model with binary variables that correspond to team-to-team games (does team i play team j on day t?) with another model whose variables correspond to series (a series is usually 3 games played between teams on consecutive days). Good bounds from the linear programming relaxations help the IP solver find an optimal solution much quicker. Another innovation focused on improving the schedule by “throwing away” much of the schedule (usually about a month) after making needed changes and resolving. Again, this is something that is possible due to advances in computing.

The keeping the job section addressed business analytics and its role in optimization. Mike defined business analytics as using data to make better decisions, something that OR has always done. What is new is using the power of data analytics and predictive modeling to guide prescriptive integer programming models in a meaningful way. The old way was to use point estimates in integer programming models, the new way uses more information (such as the output of a logistic regression) to guide optimization models. The application Mike used was estimating the value of scheduling home games at different times (day vs. night) and day of the week. When embedded in the optimization modeling framework, the end result was that creating a schedule using business analytics could add about $50M to MLB in revenue. 

Mike summed up his talk but talking about how educating the marketing folks is part of the job now. Marketing likes to measure “success” as the number of games that sell out. Operations researchers recognize that sold out games are lost revenue, so the goal has become to schedule games such that games are almost sold out, and making sure that marketing understands this approach.

Related post:

the craft of scheduling Major League Baseball games

Markov chains for ranking sports teams

My favorite talk at ISERC 2014 (the IIE conference) was “A new approach to ranking using dual-level decisions” by Baback Vaziri, Yuehwern Yih, Mark Lehto, and Tom Morin (Purdue University) [Link]. They used a Markov chain to rank Big Ten football teams in their ability to recruit prospective players. Players would accept one of several offers. The team that got the player was the “winner” and the other teams were losers.  We end up with a matrix P where element (i,j) in P is the number of times team j beats team i.

The Markov chain is then normalized so that each row sums to 1 and solved for the limiting distribution. The probability of being in team j in the limit was interpreted as meaning the proportion of time that team j is the best. Therefore, the limiting distribution can be used to rank teams from best to worst.

They found that using this method with 2001 – 2012 data, Wisconsin was ranked fourth, which was much higher than it was ranked by experts and explains why they have been to 12 bowl games in a row. Illinois (my alma mater) was ranked second to last, only above lowly Indiana.

I used this method regular season 2014 Big Ten basketball wins and ended up with the following ranking. I also have the official ranking based on win-loss record for comparison.  We see large discrepancies for only two teams: Michigan State (which is over-ranked according to its win-loss record) and Indiana (which is under-ranked according to its win-loss record). The Markov chain method ranks these two teams differently because Indiana had high quality wins despite not winning so frequently and because Michigan State lost to a few bad teams when they were down a few players due to injuries.


Ranking MC Ranking W-L record  Ranking
1 Michigan Michigan
2 Wisconsin Wisconsin
3 Indiana Michigan State
4 Iowa Nebraska
5 Nebraska Ohio State
6 Ohio St Iowa
7 Michigan St Minnesota
8 Minnesota Illinois
9 Illinois Indiana
10 Penn St Penn State
11 Northwestern Northwestern
12 Purdue Purdue

Sophisticated methods are a little more complex than this. Paul Kvam and Joel Sokol estimate conditional probabilities in the transition probability matrix for the logistic regression Markov chain (LRMC) model using logistic regression [Paper link here]. The logistic regression yields an estimate for the probability that a team with a margin of victory of x points at home is better than its opponent, and thus, looks at margin of victory not just wins and losses.


subjective scoring in Olympic sports drives me a little crazy

The Olympics are beginning. When I think of the Olympic sports, I think of a lot of sports that scored subjectively. Not so much stronger, faster, and more goals, more of panels of judges picking winners amid controversy. I prefer number crunching and objective scoring. A New York Times article by John Branch [Link] overviews the changes to the winter Olympic sports in the last two decades. In summary, the new sports are mostly those with  subjective scoring (halfpipe, snowboard cross).

A good run early in the contest might receive an 80. A slightly better run might earn an 83. A brilliant run, one that seems unbeatable, might score 95. All of the others are slotted around them. It can frustrate athletes, who ask why their second-place score was 10 points below that of the winner. They struggle to understand that the value means nothing; what matters is how it ranks.

I’ve noticed this, too, and it’s frustrating. Some sports like figure skating and gymnastics have well-established rubrics for scoring, but they are not perfect. On the positive side, the judges do a fairly good job of recognizing the best performances.

Does subjective scoring bother you?


Look for more Olympics posts from me in the next couple of weeks.

I’ve been blogging for almost 7 years, so I have a few old posts about the Olympics. Here are a few that I recommend reading:

will the New York Times Fourth Down Robot change football?

The New York times runs a twitter account for a “Fourth Down Bot” (@NYT4thDownBot) that analyzes every 4th down call in NFL football games. The bot gives advice and sometimes a short report summarizing the probability of success associated with each of the choices:

The bot has a lot of personality!

Brian Burke provides the methodology, which is here. The recommendations are based on which actions (going for it, punting, or going for a field goal) yields the most expected points. In the last 10 minutes of a game, the bot selects recommendations based on which yields the highest win probability. These concepts are not equivalent – going for it may maximize your points, but if time is running out and you are down by two, it might be better to go for a field goal than try for a touchdown.

The bot is useful because there is such a huge difference what is the best strategy and what coaches actually do. The picture below illustrates the difference. There are a number of explanations for the difference. One is that fans and owners only remember the times it doesn’t work–following the optimal policy may maximize the number of wins on average, but losing a game could mean losing your job. When the objective is to keep your job and not win games, everyone gets used to more conservative and suboptimal play calling.

Fourth Down Bot's recommendations as compared to what most coaches do.

Fourth Down Bot’s recommendations as compared to what most coaches do.

Sports nerds have known about this issue for a long time. I’ve even blogged about it before (here and here).

The Fourth Down Bot is so high profile that it has really raised awareness of this issue, possibly to the point that it may change how the game is played. If the fans know that it is better to go for it on fourth down and if the coaches and owners read the scathing fourth down reports questioning their decision-making, then maybe it will be unacceptable for coaches to cling to sub-optimal policies. Maybe I’m too optimistic about the Fourth Down Bot’s chance at improving scientific literacy to the point when the game changes. It’s possible that coaches and owners will be dismissive of math models and the nerds who make them, but I hope the Fourth Down Bot chips away at our society’s distrust of math.

It’s worth noting that the Fourth Down Bot is genderless and does not have a race. Until I blogged about the bot, all of the sports nerds and number crunchers I’ve read and blogged about are men. I can’t be the only women interested in these issues. Please introduce me to other women and minority sports nerds – I am more than willing to promote sports number crunchers from underrepresented groups.

Has the Fourth Down Bot changed the way you think about football? Do you think the Fourth Down Bot has the potential to change the game?

decision quality and baseball strategy

Miss baseball? Love operations research and analytics? Watch Eric Bickel’s 46-minute webinar called “Play Ball! Decision Quality and Baseball Strategy” here: