underpowered statistical tests and the myth of the myth of the hot hand

In grad school, I learned about the hot hand fallacy in basketball. The so-called “hot hand” is the person whose scoring success probability is temporarily increased and therefore should shoot the ball more often (in the basketball context). I thought the myth of the hot hand effect was an amazing result: there is no such thing as a hot hand in sports, it’s just that humans are not good at evaluating streaks of successes (hot hand) or failures (slumps).

Flash forward years later. I read a headline about how hand sanitizer doesn’t “work” in terms of preventing illness. I looked at the abstract and read off the numbers. The group that used hand sanitizer (in addition to hand washing) got sick 15-20% less than the control group that only washed hands. The 15-20% difference wasn’t statistically significant so it was impossible to conclude that hand sanitizing helped, but it represented a lot of illnesses averted. I wondered if this difference would have been statistically significant if the number of participants was just a bit larger.

It turns out that I was onto something.

The hot hand fallacy is like the hand sanitizer study: the study design was underpowered, meaning that there is no way to reject the null hypothesis and draw the “correct” conclusion whether or not the hot hand effect or the hand sanitizer effect is real. In the case of the hand sanitizer, the number of participants needed to be large enough to detect a 15-20% improvement in the number of illnesses acquired. Undergraduates do this in probability and statistics courses where they estimate the sample size needed. But often researchers sometimes forget to design an experiment in a way that can detect real differences.

My UW-Madison colleague Jordan Ellenberg has a great article about the myth of the myth of the hot hand on Deadspin and it’s fantastic. He has more in his book How Not to Be Wrong, which I highly recommend.  He introduced me to a research paper by Kevin Korb and Michael Stillwell that compared statistical tests used to test for the hot hand effect on simulated data that did indeed have a hot hand. The “hot” data alternated between streaks with success probabilities of 50% and 90%. They demonstrated that the serial correlation and runs tests used in the ‘early “hot hand fallacy” paper were unable to identify a real hot hand, and therefore, these tests were underpowered and unable to reject the null hypothesis when it was indeed false. This is poor test design. If you want to answer a question using any kind of statistical test, it’s important to collect enough data and use the right tools so you can find the signal in the noise (if there is one) and reject the null hypothesis if it is false.

I learned that there appears to be no hot hand in sports where a defense can easily adapt to put greater defensive pressure on the “hot” player, like basketball and football. So the player may be hot but it doesn’t show up in the statistics only because the hot player is, say, double teamed. The hot hand is more apparent and measurable in sports where defenses are not flexible enough to put more pressure on the hot player, like in baseball and volleyball.