how to use data mining and simulation to win at Jeopardy

The single most lucrative single day prize for any contestant in Jeopardy’s history was indirectly based on data mining and simulation. Roger Craig, who has a PhD in computer science, used data-mining algorithms to train himself on a database of training questions. His source data was The Jeopardy Archive, which has every question and answer from all of the Jeopardy episodes. He parsed the whole site to create one large data set composed on unstructured test data. He reverse engineered the game to identify which categories of questions to study based on how valuable these questions are in the game. He randomly sampled from the set of training questions and tried to answer the questions correctly. His answered questions were used to predict which questions he would get right and wrong and to identify which subjects to study.

Roger Craig clustered the training questions based on their category and their value. For example, low-valued questions are often based on food whereas high-valued questions are often based on art. He constructed a nonlinear algorithm to identify the optimal “path” for beating the average contestant on Jeopardy. The algorithm was based on the probability of getting questions correct for his “predicted self” using simulation. He focused on improving his answers on the high-valued questions by studying topics that were shaky.

Training for Jeopardy was also a knapsack problem: Roger Craig had a limited amount of time to study. One way to effectively use his studying time was to limit the amount of time spent on each question. This had the side benefit of preparing him to answer the questions as quickly as possible, improving his odds of being the first to buzz in with the correct question. Then, he used his algorithm to identify which topics would help improve his score the most.

The video below shows how Roger Craig prepared for Jeopardy. If you have discovered more details about his specific model and implementation, please leave a comment.

2 responses to “how to use data mining and simulation to win at Jeopardy

  • Laura McLay

    @MacDiva posted the data set used for training:

    and noted that Roger Craig is on twitter (@rogcraig)

  • Paul K

    Watching those episodes is pretty fun given his strategy (along with knowledge) differed so much from the “normal” or average strategy. For example, he used a strategy to hit the daily doubles before other contestants vs. just going down categories greedily (much to his advantage when combined with the knowledge edge!):

    The observed distribution of daily doubles is given below (culled from a jeopardy forum thread), which I am sure he was well aware of when picking questions:
    Position 1 (from top of board): 0.1%
    P2: 10%
    P3: 26%
    P4: 36%
    P5: 26%

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: