Tag Archives: probability

chocolate chip cookies are Poisson distributed

I asked for examples of things that are Poisson distributed in class. One student said the number of chocolate chips in a cookie are Poisson distributed. He’s right.

Here is the intuition of when you have a Poisson distribution. First, you should have a counting process where you are interested in the total number of events that occur by time t or in space s.  If each of these events is independent of the others, then the result is a Poisson distribution.

Let’s consider the Poisson process properties of a chocolate chip cookie. Let N(t) denote the number of chocolate chips in a cookie of size t. N(t) is a Poisson process with rate y if all four of the following events are true:

1) The cookie has stationary increments, where the number of chocolate chips in a cookie is proportional to the size of the cookie. In other words, a cookie with twice as much dough should have twice as many chocolate chips (N(t) ~ Poisson (y*t)). That is a reasonable assumption.

2) The cookie has independent increments. The number of chocolate chips in a cookie does not affect the number of chocolate chip cookies in the next cookie.

3) A cookie without any dough cannot have any chocolate chips (N(0)=0)).

4) The probability of finding two or more chocolate chips in a cookie of size h is o(h). In other words, you will find at most one chocolate chip in a tiny amount of dough.

All of these assumptions appear to be true, at least in a probabilistic sense. Technically there may be some dependence between chips if we note that bags of chocolate chips have a finite population (whatever is in the bag). There is some dependence between the number of chocolate chips in one cookie to the next if we note that how many chips we have used thus far gives us additional knowledge about how many chips are left. This would violate the independent increments assumption. However, the independence assumption is approximately true since the frequency of chocolate chips in the cookie you are eating is roughly independent of the frequency of chocolate chips in the cookies you have already eaten. As a result, I expect the Poisson is be an excellent approximation.

Picture courtesy of Betty Crocker

 


my stochastic processes midterm: Wisconsin edition

Based on the curiosity over my cheese-related exam question on twitter, I have decided to post the midterm for my graduate level course on stochastic processes. My favorite questions are #1 and #2. I should note that #1 was inspired by actual leftover cheese that is packaged and sold at a discount at the Babock Dairy Store on campus (picture is below). If there is enough extra leftover cheese, it is poured into a bag, leading to cheese that has layers like an onion. It is apparently not cost-effective to repress the leftover cheese into a smooth brick of cheese. As someone who didn’t grow up wearing a foam cheese hat, I find that cheese production, quality control, and inventory is the right avenue for me to learn about cheese.

https://twitter.com/lauramclay/status/392640686094155776

The Midterm

1. The dairy plant in Babcock Hall makes one batch of cheese six days per week. The amount of cheese (in pounds) left over after each batch is distributed according to an exponential distribution with parameter 1. Cheese production on each day is independent. The Babcock Dairy store will package and sell any leftover cheese in a batch (i.e., a day) that is more than 1 pound—a “cheese factory second.” Let Ei be the event that there is more than 1 pound of cheese available on day i, with i = 1, 2, 3, 4, 5, 6.

Therefore, Ei is a random variable:
Ei = 1 if there are cheese factory seconds for the type of cheese produced on day i and 0 otherwise (i.e., ignore how much cheese is leftover – we are instead interested in the binary outcome of whether cheese is leftover).

E = the total number of types of cheese factory seconds across the 6 day week.

a) Define E in terms of Ei.
b) What is the sample space for E?
c) What is the probability that there is at least a pound of leftover cheese on day 1?
d) What is the probability that four or more days in the week produce cheese factory seconds?

2. The Lightsaber Manufacturing Company (LMC) operates their manufacturing plant in a galaxy far, far away. They need to decide to how many lightsabers to stock for the next Jedi Convention, where they will sell lightsabers to Jedi apprentices. Due to the expense associated with interstellar travel, LMC will discard the unsold lightsabers after the convention. Lightsabers cost 413 galactic credits to produce and are sold for 795 galactic credits (the unit of currency in the Empire). If the demand for lightsabers follows an exponential distribution with an average of 75 lightsabers, how many lightsabers should the LMC bring to the Jedi Convention to maximize its profit?

3. A student has a hard time figuring out how to get started on homework for Stochastic Modeling Techniques. The student randomly selects one of 3 potential places to start a homework problem with equal probability. The first approach is not fruitful; the student will return to the starting point after 1 hour of work. The second approach will return the student to the starting point after 3 hours. The third will lead to the solution in 15 minutes (1/4 hour). The student is confused, so he/she always chooses from all 3 available approaches each time. What is the expected amount of time it takes this diligent student to solve the homework problem?

4. A mysterious illness called “badgerpox” has affected the local badger population near Madison. The exposure level (X) largely determines whether a badger contracts the disease (D). The probability distribution for the exposure level and the conditional probability of disease given the exposure level are given in the table below.

Exposure level (Xi) P(Xi) P(D | Xi)
10 0.7 0.001
100 0.15 0.005
1000 0.12 0.12
10000 0.03 0.78

Find:

(a) The conditional distribution P(Xi | D) for each value of Xi
(b) The probability that a badger contracts the disease P(D)
(c) The expected exposure level for badgers that have contracted the disease.

5. A student likes to come to ISyE 624 on time, which is possible as long as the student can travel from left to right in the diagram below. There are two paths to class; the student can pass through if and only if all components along its path are open. Due to construction, the probability that component i is open has probability pi, i = 1,2,3,4. Assume components are independent. If there is not a path to class, the student will arrive to class late, and the professor will be sad. What is the probability the student gets to class on time?

midterm1_prob1

6. The Wisconsin-Minnesota football rivalry dates back to 1890. The teams play once per year for the trophy of “Paul Bunyan’s Axe,” which replaced the first trophy (the “Slab of Bacon,” 1930-1943)*. The teams are unevenly matched, with Wisconsin winning 16 of the last 20 games. Let’s say that Wisconsin wins each game independently with probability 0.8. The teams play next on 11/23/13.

(a) What is the expected number of games/years until Wisconsin loses next?
(b) What is the expected number of games/years until Wisconsin loses 2 games in a row?
(c) What is the probability that it Wisconsin loses for the 3rd time in the 5th game in the series?

* I didn’t make that up!

20131024-104034.jpg


the conditional probability of being struck by lightning (Part 4)

The National Weather Service released a report on being struck by lightning [Story here] that may be of interest to those of you read my previous posts on lightning (see below). The study confirms that men account for 82% of those struck by lightning, which I’ve blogged about before. More people are struck by lightning in July than in any other month.

Lightning fatalities by month

Lightning fatalities by month

NOAA states that June, July and August have the most lightning in addition to having the most lightning fatalities, In fact, I was caught in a severe storm that rolled in quickly when I was running in a forest preserve in June. I suspect that the conditional probability of being struck by lightning doesn’t depend on seasonality as much as the picture above suggests, but I don’t have any proof.

The Washington Post Magazine ran a nice story about the ‘Spark Ranger’ Roy Sullivan who was struck by lightning seven times [Link]. A man who worked outside, Roy Sullivan was at an increased risk of getting struck by lightning. I’m not sure how that happened seven times.

For more reading:


the secretary problem is a useful model for selling a house

Realtors sometimes think that the optimal solution is to convince their clients to accept the first offer made on their house. The marginal increase in the realtor’s fee is tiny if the sellers wait to get a small increase in the selling price (a 3% commission on the extra $2000 that the sellers are holding out for is a measly $60. The realtor may invest more than $60 to better market the house while waiting for a slightly better offer to arrive. See the video from the Freakonomics documentary below for more on this subject.

We ended up using the optimal Secretary Problem policy to sell our house. It wasn’t our plan on the offset, but it’s what happened. An optimal policy to the Secretary Problem maximizes the probability of ending up with the best offer. The idea is to first estimate the number of offers you would expect to receive, at least in the timeframe that you have to sell a house; let’s call this n. Then you observe and reject the first n/e offers. After that, you accept the first offer that is the best you’ve seen so far. I thought n would be small (2-3), but there was a lot of traffic in our house and I had to increase my estimate of n to maybe 6.

The first person to look at our house made us an offer almost immediately. It was a good offer, but the buyer wanted to close a month earlier than we were ready, which would lead to some substantial costs on our end (not counting the stress of having to make immediate moving plans and find temporary housing). The net offer was good, but not good enough for us to sell our house. We let it go. At the time, this seemed like our n/e, which meant that we should accept the next offer that was better than our first one.

We had a second offer on our house that eventually worked its way up to match the net value of the first offer. It was too low, so we rejected it (and almost gave our realtor a heart attack in the process). We then had a third offer on the house that was not worth entertaining. Later that day, we received our fourth offer. After some negotiation, it became our best so far, and we accepted it.

It’s not often that I get to personally collect empirical evidence to validate an OR model. I’d like to say that it was fun, but it was mostly a stressful experience. I’m glad it worked out in the end. During the process, it was helpful to know that math backed us up on our offer rejections. We ended up with that extra $2000 (less $120 for both realtors).

Comment: When I say we “rejected” offers, I mean that we counter-offered with what we were willing to settle for and were turned down. Accepting/rejecting offers is a little more complicated when selling a house as compared to the secretary problem, where there is no negotiation.

Incidentally, the Wall Street Journal recommends using the Secretary Problem for finding a rental [Link].


The Birthday Problem with a mating season: A simulation approach

My last blog post illustrated how unlikely the equal birthday likelihood assumption is. I wrote a short simulation code to consider the impact of unequal birthdays. I modeled unequal birthdays as a mating season that results in three months (90 days) that are more likely birthdays than the remaining 365-90 days. This corresponds to July – September in my earlier post.

Let R = (Likelihood of being born in the “hot” 3 months) / (Likelihood of being born in the remaining 9 months).

The Birthday Problem assumes that R = 1. I consider 1 <= R <= 2. This post courtesy of Chris Rump indicates that R < 1.2, meaning that humans don’t have much of a mating season.

The simulations below show the average value of P(n), where P(n) = the probability that someone shares a birthday in a group of n people. The simulations are performed over 1M replications for each value of n. The probability of shared birthday goes up when people are more likely to be born in the birth months associated with “mating season.” But the effects are small, as can be seen by a fairly compressed y-scale. The simulations were performed in Matlab and the program is here.

The Birthday Problem probability for n=5.

The Birthday Problem probability for n=10.

The Birthday Problem probability for n=20.

The Birthday Problem probability for n=30.

The Birthday Problem probability for n=40.

The Birthday Problem probability for n=50.

 


The Birthday Problem

Many of you have seen The Birthday Problem: Given a group of n people, what is the probability that someone shares a birthday?

Here, we are only concerned with birth day and month (not year). The solution assumes that a person is equally born on any of the 365 days in the year, thus ignoring leap years.

Let P(n) = the probability that someone shares a birthday in a group of n people and let Q(n) = the probability that everyone has unique birthdays. There are 365^n ways for n people to be born on any of the 365 days.Then

P(n) = 1 – Q(n) = 1 – (365*364*…*(365-n+1))/365^n.

P(n)

P(2) = 0.0028

P(5) = 0.0271

P(10) = 0.1169

P(20) = 0.4114

P(30) = 0.7063

P(40) = 0.8912

P(50) = 0.9704

P(60) = 0.9941 –> in a room with 60 people, you are almost certain to have at least two people that share a birthday!

The key assumption is that all birth dates are equally likely. This NPR article shows that humans have a “mating season” that makes July – September birthdays more likely. I posted the image below.

This will, of course, change our answer above. The probabilities depend on who is in the room. Have you simulated the Birthday Problem with an unequal birthday distribution? If so, please shed light on realistic numbers for P(n).

On a side note, the image below suggests that babies are induced on December 27-30 for a tax break. I’m not sure how I feel about that.

How likely are people to be born on different birth dates?


bus accidents are a Poisson process

The fourth school bus accident in the Richmond, Virginia area occurred this morning. Everyone wants to know, what does this mean?!?

Here’s what I think it means: bus accidents can be modeled as a Poisson process. Equivalently, the time between bus accidents can be modeled using the exponential distribution. This modeling paradigm is appropriate if bus accidents “randomly” occur independently of one another, which is a reasonable assumption.

If the time between bus accidents is exponentially distributed, then we expect that sometimes bus accidents occur in groups of three or four. Example exponential probability distributions are below. The exponential distribution has parameter lambda, where the average time between arrivals (bus accidents in this case). Most of the “meat” of the distribution is close to zero, even if the average time between arrivals is very large. This means that we would expect to sometimes observe small interarrival times and then go a long time between the next arrival.

Let’s put this in terms of bus accidents. If bus accidents occur as a result of chance or coincidence, then we would sometimes expect to observe four bus accidents in a week and then go months before the next bus accident. Four bus accidents in a week does not necessarily imply that something nefarious is going on.

This reasoning can also be used to explain why completely unrelated celebrity deaths sometimes occur in threes.

Example exponential distributions (probability density functions). The average time between arrivals is lambda^-1.

How rare are four bus accidents in a week? Let’s assume that bus accidents occur once every four weeks on average (lambda=1/4). The probability of observing 4+ accidents in a week is 0.01%. Pretty rare. But that’s any one week. The school year is 36 weeks long, which means that we would have 36 chances to have 4+ accidents in a week. Using the Binomial distribution, we find that the the odds of having at least one week with 4+ accidents is 0.5% (once every 200 years).

What about a slightly less extreme week? The probability of observing 3+ accidents in a week is 0.2%. Over the course of a year, the odds of having at least one week with 3+ accidents is 7.5% (once every 13 years).

Related post:

 

 

 

 

 

 


the license plate game: the raw numbers

My last post discussed how one might estimate how many state license plates one would expect to see on a road trip. I made a spreadsheet to compute the probability of seeing each state license plate.

Assumptions

  1. The probability of seeing a state license plate A in another state B depends on the distance between their state capitals. It is scaled by the  number of licensed drivers in state A. (This indirectly means that the probability does not depend on how long we are in a state).
  2. Seeing state license plates A, B, etc. are independent from other license plates in a given state D.
  3. Seeing given state license plate A is independent when driving across states B, C,…
  4. We do not adjust for round trips.

The distance between state capitals was found here. The number of licensed drivers per state is here. I estimated the odds of seeing a license plate from state A in state B is captured by this formula:

P = exp(-K * (Distance from A to B in miles) / # of licensed drivers)

with K = 7000 – 2000*Summer01 – 1000*ExpensiveGas01. Summer01 is 1 if it is summer break and 0 otherwise. ExpensiveGas01 is 1 if it gas is “expensive” and AAA predicts that road trips will be down and 0 otherwise. I didn’t have time to properly identify a meaningful formula or calibrate the parameters. Suggestions here are welcome!

Validation

  • We predicted 28.3 states for our summer trip from Richmond to Chicago. We saw ~35. Here, the discrepancy seemed to be the amount of time we spent in each state. We went through fewer states, but was in each state (especially Kentucky and Indiana) a relatively long time.
  • We predicted 26.8 license plates for our winter trip from Richmond to Vermont. We saw 26. Not bad!

The results make me conclude that the first assumption is probably not true: the probabilities do depend on how long we are in a state. When driving to Vermont, we went through many (8) little states. When driving to Chicago, we went through fewer (5) states but were in each state for longer.  Moreover, many of the Midwest states are not “destination” states. Take Indiana for instance. I love Hoosiers as much as the next person, but Indiana truly is the “Crossroads of America”–it’s a state that many people from other states drive through. It’s a better place to spot license plates than, say, Delaware. I didn’t take that into account.

Below is a detailed review of our winter trip numbers. It indicates the predicted probability of seeing each state license plate and whether we actually saw it. As asterisk (*) indicates whether the model is “off”–whether we (1) did not see a state with probability greater than 0.5 or (2) did not see a state with a probability of 0.5 or lower.

A copy of my spreadsheet is here if you want to see how I computed the numbers.

State Cumulative probability of seeing each state      States we saw
Alabama 0.671 *
Alaska 0 Yes *
Arizona 0.065
Arkansas 0.060
California 0.961 Yes
Colorado 0.083 Yes *
Connecticut 1 Yes
Delaware 1 Yes
District of Columbia 1 Yes
Florida 0.999 Yes
Georgia 0.971 Yes
Hawaii 0
Idaho 0
Illinois 0.973 Yes
Indiana 0.950 *
Iowa 0.056
Kansas 0.028
Kentucky 0.710 *
Louisiana 0.236
Maine 0.565 Yes
Maryland 1 Yes
Massachusetts 1 Yes
Michigan 0.990 *
Minnesota 0.269
Mississippi 0.060 Yes *
Missouri 0.563 Yes
Montana 0
Nebraska 0.001
Nevada 3.53E-06
New Hampshire 0.911 Yes
New Jersey 1 Yes
New Mexico 3.14E-05
New York 1 Yes
North Carolina 0.999 Yes
North Dakota 0
Ohio 0.998 Yes
Oklahoma 0.032 Yes *
Oregon 0.0006
Pennsylvania 0.999 Yes
Rhode Island 0.863 *
South Carolina 0.878 Yes
South Dakota 0
Tennessee 0.841 *
Texas 0.983 Yes
Utah 5.61E-05
Vermont 1 Yes
Virginia 1 Yes
Washington 0.037
West Virginia 0.416
Wisconsin 0.671 Yes
Wyoming 0

the license plate game

My family took a lot of road trips when I grew up. To combat boredom, we tried to see how many state license plates we would see on our trip. On a trip to see Mount Rushmore, we found almost all of the states.

As an adult and geek, the license plate game has (subtlety?) changed. Now, I combat boredom by talking with my husband about how to come up with a probability distribution for how many state license plates we would expect to see on a road trip from point A to point B.

We took two road trips this year: one from Richmond, VA to Chicago, IL over the summer, the second from Richmond, VA to Burlington, VT over the winter break. We saw ~35 states in our first trip and ~25 states in our second trip.  My husband and I immediately noticed that we accrued license plates at a slower rate on our winter trip, which we suspect was from fewer people making road trips over the winter as compared to summer.

We wondered if one could estimate how many license plates you would expect to see in a road trip based on

  • the states you drive through,
  • the time of year (more people take road trips in the summer)

The state that you are in determines how likely you are to see other state license plates based on their relative distances as well as the number of licensed drivers in other states.

We simplified the problem to avoid looking at how long you drove through a state as well as interstate connectivity issues. That is, there is no difference between driving through West Virginia on I-70 and driving through Pennsylvania on I-80. Additionally, if you are in I-80 in Illinois, you are connected to neighbor states Iowa and Indiana but not neighbor states Missouri and Wisconsin, and therefore, one might expect to see Iowa and Indiana plates. We ignored this and just noted that you would be in Illinois, which gives the likelihood of seeing license plates from other states regardless of “route distance.”

My next post summarizes the model, the assumptions, and the results.

Have you tallied license plates on road trips? What do you think are the salient aspects of this problem to include in a probability model?


what are the odds of winning the lottery two times?

A Chicago area man won the lottery for the second time. The Chicago Tribune reports:

Scott Anetsberger duplicated his $1 million win of nine years ago in the same instant Merry Millionaire game, lottery spokesman Mike Lang said.

Despite long odds, Anetsberger isn’t the first two-time $1 million instant winner. Kimberly Pleticha of Villa Park won $1 million twice in the instant Cash Jackpot game–the first time in August 2010 and the second only six months later in February.

Lottery officials could not instantly compute the odds against multiple winners, but did note there have been a dozen or more two-time Little Lotto winners over the years.

What would the odds of winning the lottery twice would be? Well, it depends on how frequently one plays the lottery.

Winning the Illinois Lottery requires picking six correct numbers, where the numbers range from 1 to 52. The odds of getting all six numbers correct is 1 in 20,358,520.  It costs $0.50 to play the lottery, and there are three lotteries per week. Assuming that each lottery is independent (a reasonable assumption), one would have to play the lottery 20,358,520 times, over average, to win (using the geometric distribution). If one plays the lottery three times per week, then it would take 130,500 years to win the lottery once at a cost of more than $10M.

Winning the lottery twice can be modeled as a negative binomial random variable. Assuming that our lottery winner plays the lottery three times per week before and after winning the lottery, then it takes ~261,000 years, on average, to win twice.

Since it is only newsworthy to report additional wins by those who have already won the lottery, then we are really only interested in the odds that a lottery winner would win the lottery again. This is a different question. Assuming that our lottery winner continues to play the lottery three times per week, then the odds of winning again are same as the odds of someone else winning the lottery for the first time: 1 in 20,358,520 per lottery. That is, it would take our lottery winner an additional 130,500 years to win the lottery.

If someone plays the lottery more than three times per week, then the odds of winning go up.

Of course, many people play the lottery, so the odds that someone wins the lottery twice over their lifetime is much, much higher. I tell my students every semester, “Someone will win the lottery. Just not you.” If 130,500 people buy one lottery ticket per game, then there would be a two-time winner every 2 years, on average.

Little Lotto involves picking five correct numbers, where the numbers range from 1 to 39.  It is easier to win, but it has a lower payout. The odds of winning are 1 in 575,757, which means that one is 35 times as likely to win the Little Lotto than the regular lottery. It would take 3691 years to win Little Lotto once (by playing three times per week) and 7382 years to win it twice.

Given that there have been 12 two-time winners in Little Lotto in its 23 years of existence, there there is approximately one two-time winner every two years. Given my assumptions, this would suggest that ~3691 people buy a Little Lotto ticket every time. That seems a bit low to me. But I have a head cold and maybe it has temporarily impaired my mathematical abilities.

A seven-time lottery winner’s advice for winning the lottery is to invest more (not less!) of one’s money into buying lottery tickets, as long as one can afford it. He also recommends treating the lottery as a job: the lottery is a skill, and one can improve at it after investing a lot of time. While skill plays a role in playing the lottery (identifying which numbers to pick and identifying which games have the best payoff), I’m pretty sure that this is bad advice. The expected payoff for the lottery is negative, meaning that on average, you are guaranteed to come out behind. The variance in earnings is large, meaning that over many attempts, it is possible that you can come out ahead. But given that one comes out ahead, it would be foolish to attribute one’s success to skill. But maybe I’m missing something.

For the record, I do not recommend gambling or routinely playing the lottery.

For more, read Mike Trick’s post on conditional probabilities and March Madness odds.

Related post: