In 2012, The New York Times published an article about an algorithm used by Target to identify shoppers that might be pregnant.
[A] man walked into a Target outside Minneapolis and demanded to see the manager. He was clutching coupons that had been sent to his daughter, and he was angry, according to an employee who participated in the conversation.
“My daughter got this in the mail!” he said. “She’s still in high school, and you’re sending her coupons for baby clothes and cribs? Are you trying to encourage her to get pregnant?”
The manager didn’t have any idea what the man was talking about. He looked at the mailer. Sure enough, it was addressed to the man’s daughter and contained advertisements for maternity clothing, nursery furniture and pictures of smiling infants. The manager apologized and then called a few days later to apologize again.
On the phone, though, the father was somewhat abashed. “I had a talk with my daughter,” he said. “It turns out there’s been some activities in my house I haven’t been completely aware of. She’s due in August. I owe you an apology.”
Target uses purchase history to predict which shoppers are pregnant. The article in the New York Times that broke this story implied that Target is extremely accurate in these predictions. I want to explore this a little further. Target sends shoppers that meet their pregnancy criteria coupon books for maternity and baby products. This article in Forbes outlines some of the data used by Target as well as their approach in predictive analytics.
[Target’s statistician] identified 25 products that when purchased together indicate a women is likely pregnant. The value of this information was that Target could send coupons to the pregnant woman at an expensive and habit-forming period of her life.
I’ve always been skeptical about the accuracy of Target’s algorithm, mainly because they have found me continually pregnant since 2010-11 (the last time I was actually pregnant). Target has sent me many coupon books and ads over the years. It’s not just Target. Sometimes I receive baby formula in the mail from a formula company. Babies are expensive, and that it costs Target (and other companies) very little to send me baby coupons and ads. The upside is that if Target is correct, they have a huge potential profit. If they are wrong, it only cost them a little in advertising revenue.
We can unpack a procedure for identifying pregnant customers. John Foreman includes a “pregnant shopper” model in his book Data Smart to introduce linear and logistic regression that illustrates this point. I’ve used this model in class and students really like it. Regression models are fit to data from fictitious shoppers. A logistic regression model produces a score for each shopper based on their purchases of many types of products (similar to how the real application works). The score can be mapped into a 0-1 prediction or decision for classification. This helps decide who gets the coupon books and who doesn’t by choosing a cutoff, with shoppers whose scores are above the cutoff getting the coupon books. The lower the cutoff, the more false positives there will be. Different cutoffs lead to different values of the true positive and false positive rates (see the “ROC curve” image below from Data Smart).
Two ways to measure accuracy include:
- Sensitivity: the true positive rate that measures the proportion of actual positives that are correctly identified.
- Specificity: the true negative rate (1 – the false positive rate) that measures the proportion of actual negatives that are correctly identified.
What this means is that the algorithm is not necessarily accurate and that Target is not necessarily aiming for accuracy in terms of the model’s predictive ability. Instead, Target is choosing a point on this ROC curve by setting a cutoff that makes sense for their business model. If the costs of sending an ad to a non-pregnant shopper is low (the cost of a false positive) and the profit of true positives is high, it would lead Target selecting a value of the cutoff that would lead to a point on the curve with a high true positive rate as well as a high false positive rate (with low specificity). This is what I experience.
Other applications where the cost of a false positive is lower may lead to different selections of a cutoff with a lower false positive rate. I rarely have received baby formula in the mail, presumably since mailing formula comes at a much higher cost to these companies.
Another example that comes to mind is modeling sports injuries. The cost of a false positive could be high: resting a star player too much at the end of the season to stave off injury means the team could lose too many games and miss the playoffs. Not resting the player (risking a true positive) means the player could suffer a season ending injury, which would mean the team could lose in the playoffs.