Tag Archives: analytics

data science isn’t just data wrangling and data analysis: on understanding your data

I have often heard rules of thumb, such as data science/analytics is 80% data cleaning/wrangling and 20% data analysis. I’ve seen various versions of this rule of thumb that all have two elements in common:

1) They assign various percentages of time to data cleaning and analysis, with the time allocated to cleaning greatly outweighing the time allocated to analysis.

2) Time is always partitioned into these two categories: cleaning and analysis.

I want to introduce a third category: understanding the data. This is a critically important part of doing data science and analytics. I acknowledge that many data scientists understand their data quite well, so I am not not criticizing the entire data science community. Instead, I want to point out and discuss the rich tradition of understanding the data that we have fostered in operations research and highlight the importance of this often overlooked aspect of working with data and building data-driven models. As an applied optimization researcher, I believe that data collection, understanding the data, problem identification, and model formulation are critical aspects of research. These topics are important training topics for students in my lab. To solve a data-driven model, we need to understand the data and their limitations.

Here are a few ways I have made efforts to understand the data:

  • I have asked questions about the data to subject matter experts (who shared the data with me).
  • I have done ride-alongs and observed service providers in action to see how they collect and record data as well as how they interpret data categories.
  • I have read manuals that describe data sets.
  • Summary statistics and other analytical tools shed light on distributions and processes that produce the data.
  • Disseminating research findings often results in good questions about the data sources from audience goers, which has improved my understanding of the data.
  • I have read other papers related to my problem that describe how data are collected in other settings.

Understanding the data has helped me understand the data’s limitations and apply the data meaningfully in my research:

  • Data sets often have censored data. What is not included in the data set may be critically important. There is no way to know what is not in a data set unless I understand how it was collected.
  • Some data points are nonsensical or misrecorded (e.g., 8 hour ambulance service times). Others are outliers and are important to include in an analysis. Understanding how the data were recorded help to ensure that the data are used in a meaningful way in the application at hand.
  • Some data points are recorded automatically and others are recorded manually. Both kinds can be high quality or low quality, depending on the setting, employee training, and the tool for automatic collection.
  • Understanding the data is a first line of defense when it comes to data and algorithm bias. Most data sets are biased in that they are not fully representative of the target population or problem, and understanding these biases can help prevent building models that are discriminatory and/or not effective when it comes to the problem at hand.
  • Understanding what data are not included in a data set has resulted in me asking for additional sources of data for my research. Sometimes I have been able to get better data if I ask for it.

Without understanding the data, the analysis could be a matter of garbage in, garbage out. 

This post covers just one of many issues required for avoiding algorithm bias in applying data science/analytics. Colleagues and I shared some of these additional thoughts with Cait Gibbons, who wrote an excellent article about algorithm bias for the Badger Herald. Read her article for more.

 


advanced #analytics for supporting public policy, bracketology, and beyond!

On Monday I gave a keynote talk at the tech conference WiscNet Connections (formerly known as the Future Technologies Conference) in Madison, Wisconsin.

The title of my talk was “Advanced analytics for supporting public policy, bracketology, and beyond!” I talked about advanced analytics as well as my research in aviation security, emergency response, and bracketology. My slides are below.


analytics for governance and justice

In May 2016, the Office of the President released a report entitled “Big Data: A Report on¬†Algorithmic Systems, Opportunity, and Civil Rights” that challenges the idea that data and algorithms are objective and fair. The report outlines President Obama’s plan for identifying and remedying discrimination in data and automated decisions¬†by making data and processes more open and transparent.

This is part of the White House’s plan for data driven governance. With better data, you can make better decisions. I love it.

President Obama said that ‚Äúinformation maintained by the Federal Government is a national asset.‚ÄĚ He started data.gov, which is a gateway to government agency data to researchers and the public.

Created as part of the President’s commitment to democratizing information, Data.gov makes economic, healthcare, environmental, and other government information available on a single website, allowing the public to access raw data and use it in innovative ways.

Data.gov began as a tool to reduce government waste, but it has since branched out to meet other goals, such as the aforementioned social justice issue inequities. The White House created the position “Chief Data Scientist” and hired DJ Patil to fill the position. ¬†He has been working on breakthroughs for cancer treatment lately. ¬†The White House hosted an “Open Data Innovation Summit” in September 2016 to share best practices regarding the opening up of government data. While I applaud the trend of open data, it is necessary but not sufficient for reducing inequities, informing decisions, and cutting government waste.

I am less familiar with the big wins that data driven governance has had. Please let me know what they are in the comments. I have no doubt that there are big wins. With better data, we can make better informed decisions.

Data is a huge topic, and there is a lot of data out there. The government investing in archiving and analyzing data is necessary for¬†breakthroughs to happen. There are a lot of people involved in this effort. My colleague, Dr. Patti Brennan¬†now heads the National Library of Medicine.¬†The National Library of Medicine is composed of data to support medical research, and I’m glad we have a¬†Wisconsin ISYE Professor Emeritus¬†and rockstar in charge.

twitter-patti1.PNG

I started this post before the election. I hope the project continues its momentum in the next administration to have an impact. Only time will tell.

 

 

Data topics at data.gov

Data topics at data.gov


tips for filling out a statistically sound bracket

Go Badgers!!

Here are a few things I do to fill out my bracket using analytics.

1. Let’s start with what not to do. I usually don’t put a whole lot of weight on a team’s record because strength of schedule matters. Likewise, I don’t put a whole lot of weight on¬†bad ranking tools like RPI that do not do a good job of taking strength of schedule into account.

2. Instead of records, use sophisticated ranking tools. The seeding committee using some of these ranking tools to select the seeds, so the seeds themselves reflect strength of schedule and implicitly rank teams.  Here are a few ranking tools that use math modeling.

I like the LRMC (logistic regression Markov chain) method from some of my colleagues at Georgia Tech. Again: RPI bad, LRMC good.

3. Survival analysis quantifies how far each¬†each team is likely to make it in the tournament. This doesn’t¬†give you insight into team-to-team matchups per se, but¬†you can think about the probability that Wisconsin making it to the Final Four reflecting¬†an kind of average across the different teams a team might play during the tournament.

4. Look at the seeds. Only once did all four 1-seeds make the Final Four. It’s a tough road. Seeds matter a lot in the rounds of 64 and 32, not so much after that point. There will be upsets. Some seed match ups produce more upsets than others. The 7-10 and 5-12 match ups are usually good to keep an eye on.

4. Don’t ignore preseason rankings. The preseason rankings are educated guesses on who the best teams are before any games have been played. It may seem silly to consider preseason rankings at the end of the season after all games have been played (when we have much better information!) but the preseason rankings seem to reflect some of the intangibles that predict success in the tournament (a team’s raw talent or athleticism).

6.Math models are very useful, but they have their limits. Math models implicitly assume that the past is good for predicting the future. This is not usually a good assumption when a team has had any major changes, like injuries or suspensions. You can check out crowdsourcing data (who picked who in a matchup), expert opinion, and things like injury reports to make the final decision.

For more reading:


eradicating polio through vaccination and with analytics

The most recent issue of Interfaces (Jan-Feb 2015, 45(1)) has an article about eradicating polio published by Kimberly M. Thompson, Radboud J. Duintjer Tebbens, Mark A. Pallansch, Steven G.F. Wassilak, and Stephen L. Cochi from Kid Risk, Inc., and the U.S. Centers for Disease Control and Prevention (CDC). This paper develops and applies a few analytics models to inform policy questions regarding the eradication of polioviruses (polio) [Link to paper].

The article is timely given¬†that vaccination is in the news again. At least this time, the news is fueled by outrage over¬†GOP Presidential¬†contenders Chris Christie and Rand Paul’s belief¬†that parents should have the choice to vaccinate their children [example here].

Polio has essentially been eradicated in the United States, but polio has not been eradicated in the developing world. The Global Polio Eradication Initiative (GPEI) helped to reduce the number of paralytic polio cases from 350,000 in 1988 to 2,000 in 2001. This enormous reduction has mainly been achieved through vaccination. There are two types of vaccines: the live oral vaccine and the inactivated vaccine (IPV). Those who have been vaccinated have lifelong protection but can participate in polio transmission.

The paper summarizes a research collaboration that occurred over a decade and was driven by three questions asked by global policy leaders:

  • What vaccine (if any) should countries use after wild¬†polioviruses (WPV) eradication, considering both health and economic outcomes?
  • What risks will need to be managed to achieve and maintain a world free of polio?
  • At the time of the 1988 commitment to polio eradication, most countries expected to stop polio vaccinations after WPV eradication, as had occurred for smallpox. Would world health leaders still want to do so after the successful eradication of WPVs?

The paper is written at a fairly high level, since it summarizes about a decade of research that has been published in several papers. They ended up using quite a few methodologies to answer quite a few questions, not just about routine immunization. Here is a snippet from the abstract (emphasis mine):

Over the last decade, the collaboration innovatively combined numerous operations research and management science tools, including simulation, decision and risk analysis, system dynamics, and optimization to help policy makers understand and quantify the implications of their choices. These integrated modeling efforts helped motivate faster responses to polio outbreaks, leading to a global resolution and significantly reduced response time and outbreak sizes. Insights from the models also underpinned a 192-country resolution to coordinate global cessation of the use of one of the two vaccines after wild poliovirus eradication (i.e., allowing continued use of the other vaccine as desired). Finally, the model results helped us to make the economic case for a continued commitment to polio eradication by quantifying the value of prevention and showing the health and economic outcomes associated with the alternatives. The work helped to raise the billions of dollars needed to support polio eradication.

The following figure from the paper summarizes some of the problems addressed by the research team. The problems involved everything from stockpiling vaccines, to administering vaccines for routine immunization and to containing outbreaks:

A decision tree showing the possible options for preventing and containing polio. From "Polio Eradicators Use Integrated Analytical Models to Make Better Decisions" by Kimberly M. Thompson, Radboud J. Duintjer Tebbens, Mark A. Pallansch, Steven G.F. Wassilak, Stephen L. Cochi in Interfaces

A decision tree showing the possible options for preventing and containing polio. From “Polio Eradicators Use Integrated Analytical Models to¬†Make Better Decisions” by Kimberly M. Thompson, Radboud J. Duintjer Tebbens, Mark A. Pallansch, Steven G.F.¬†Wassilak, Stephen L. Cochi in Interfaces

I wanted to include one of the research figures used in the paper that¬†helped guide policy and obtain funding. The¬†figure (see below) is pretty interesting. It shows the costs, in terms of dollars ($) and paralytic polio cases associated with two strategies over a 20 year horizon: (1) intense vaccination until eradication or (2) intense vaccination but only until it’s “cost effective” (routine immunization). The simulation results show that the cumulative costs (in dollars or lives affected) are much, much lower over a 20 year time horizon if they adopt a the vaccination until eradication strategy.¬†This helped to make a big splash. From the paper:

In a press release related to this analysis, Dr. Tachi Yamada, then president of the Bill & Melinda Gates Foundation’s
Global Health Programs stated: ‚ÄúThis study¬†presents a clear case for fully and immediately funding¬†global polio eradication, and ensuring that children¬†everywhere, rich and poor, are protected from¬†this devastating disease.‚Ä̬†In 2011, Bill and Melinda Gates made polio¬†eradication the highest priority for their foundation

polio2

In full disclosure, I’m a big fan of immunization. All of my children are fully vaccinated. My grandmother was born in 1906 and used to tell stories about relatives, so many of whom¬†ultimately¬†died of infectious diseases (Grandma lived until she was ~102!). I’m glad my kids don’t have to worry about getting many of these diseases.¬†I’m also proud to contribute to herd immunity. ¬†We come in contact with people who have compromised immune systems or¬†could not get immunized, and I’m glad we’re playing our part in keeping everyone else healthy.¬†Part of the reason why vaccination is challenging is because social networks play a critical role in disease transmission. Even if enough people have been vaccinated in aggregate to obtain herd immunity in theory, it may not be enough if there are hot spots of unvaccinated children who can cause outbreaks. There are hot spots in some areas in California and other states that have generous exemption policies.

Related posts:

 

 


analytics vs. operations research

Former INFORMS President Anne Robinson recently talked about operations research and analytics in a YouTube video. As President of INFORMS, she did a lot of work to promote analytics in the OR/MS community and to understand perceptions of analytics vs. operations research. I really appreciate what Anne has done for our field. Her research efforts found that people perceived operations research is a toolbox whereas analytics was perceived as an end-to-end process for data discovery, problem formulation, implementation, execution, and value delivery. This is an interesting finding.

This is Anne’s answer to the question: What is the role of OR in analytics?

“Operations research is on the top of the food chain when it comes to analytic capabilities and potential game changing results.”

I love this.

Anne’s challenge is for us to make decision-makers¬†understand that¬†OR is as¬†vital and necessary as analytics. Evangelize early and often. Given the popularity of analytics, we should be able to make some inroads in educating our¬†peers in STEM about OR, but in the long run this may be¬†tough to do. Analytics is the new kid on the block, and it already seems to have reached widespread adoption, whereas operations research–while being at the top of the food chain–is still somewhat of a mystery to those outside of our fairly small¬†field. Operations research has had an identity crisis for a long time, and I don’t see that coming to an end.

I applaud INFORMS decisions to “own” analytics via Analytics Certification and the Conference on Business¬†Analytics (formerly the INFORMS Practice Conference) rather than try to solely market “operations research” to the growing analytics crowd.

What do you think?

 


operations research at Disney

Kristine Theiler, Vice Present of Planning and Operations Support for Walt Disney Parks & Resorts, gave a talk in the ISyE department at the University of Wisconsin-Madison when being awarded theDistinguished Achievement Award. Ms. Theiler has a BS degree in ISyE from UW-Madison.

She leads an internal consulting team that provides decision support for leadership worldwide. She gave a wonderful talk to the students about industrial engineering at Disney. Her team has more than industrial engineers and is increasingly focusing on operations research. Her team has worked on the following issues:

  • food and beverage (beer optimization!)
  • park operations: attraction performance, operation at capacity, efficiency programs
  • hotel optimization: front desk queuing, laundry facility optimization
  • project development: theme park development, new products and services, property expansion
  • operations: cleaning the rides and park, horticulture planning
  • operations research: forecasting, simulation

Ms.¬†¬†Theiler showed us her “magic band” – a bracelet that links together the services that a park-goer (a guest) has purchased as well as her room key and possibly her credit card (with a security code) to optimize efficiency. Guests can choose one of seven bracelet colors. This may facilitate personalization aka Minority Report. The magic band is under production.

She also noted that guests at Disney Toyko are willing to wait longer than guests at any other Disney park. Interesting.

Disney works on four key competencies that mesh well with tools in the OR toolbox:

  1. Capacity/demand analysis
  2. Measuring the impact (guest flow, weight times, transaction times)
  3. Process design and improvement
  4. Advanced analytics

The planning for Shanghai Disneyland is underway. Some of the relevant project planning, such as where to locate the park.¬†Once a site is selected, the IEs will plan train lines between locations; how many ticket booths, turnstyles, and strollers will be needed; how to select the mix of attractions and lay them out; how many tables and chairs are needed; what is the right mix of indoor and outdoor tables; how much merchandise space to set aside; how to route parades; how to handles the “dumps” that happen when a show lets out; how to locate your favorite Disney characters (played by actors) for photo ops; how to plan backstage areas to coordinate complex shows; and locate and run hotel services.

Training scheduling optimization for the cruise lines was one of the more technical projects. There are many side constraints and stochastic issues for the 1500 people that may need to be trained at any given time. There include precedence constants (fire class 1 must be taken before fire class 2), time windows (fire drills can only be run on Tuesdays from 9-11), attendance randomness (employees and class leaders get sick), so contingency plans are a must.

Operations research and industrial engineering are obviously valuable at Disney.¬†One of the main benefits of using advanced analytical methods is that they bring an unbiased perspective. It’s much easier to bring up a difficult issue when you discuss it from a numbers perspective rather than first stating your opinions. Analytics also provides a way to “connect the dots” between services: more people attending a show may lead to an increased need for merchandise space near the show’s exits.

Shanghai Disney