# using combinatorics and regression to steal social security numbers

It turns out to guess someone’s social security number (SSN) for nefarious purposes, all you need is a little knowledge of combinatorics and regression.

A story reported all over the web today, Alessandro Acquisti and Ralph Gross of Carnegie Mellon University in Pittsburgh used Social Security Administration death records to fit a regression model to how SSNs are assigned.  They combined that data with birthday and birthplace data available elsewhere to restrict the feasible range that someone’s social security number could take to make it easier to guess someone’s SSN.  This works because SSNs are assigned in a relatively deterministic way since 1988, making it very easy to correctly guess the first five digits of someone’s SSN (if they are born since 1988).  The Enumeration at Birth initiative created this issue, since it deterministically distributes available SSN prefixes to states.  And the smaller the state population, the fewer SSNs are “available,” making a correct guess more probable.

How easy is it to guess a SSN?  Well, it depends.

* For a person born before 1989, the CMU algorithm guesses correctly 0.08 percent of the time (in 100 tries per person).  For a person born since 1989, the CMU algorithm guesses correctly 0.9 percent of the time (in 100 tries per person), more than 10x better than before 1989.

* When they allowed themselves 1000 tries per person, there was a similar trend (0.8 percent before 1989 versus 8.5 percent since 1989).

* It’s easier to find just the first five digits.  In a single guess, they can identify the first five digits for 44 percent of people born since 1989 and for 7 percent of people born from 1973 to 1988.

* This is even more disturbing for small states.  The researchers guessed the first 5 digits of 2% of California SSNs with 1980 birthdays (the biggest state, pre-1989) and 90% of Vermont SSNs with 1995 birthdays (the smallest state, post-1988)