I have often heard rules of thumb, such as data science/analytics is 80% data cleaning/wrangling and 20% data analysis. I’ve seen various versions of this rule of thumb that all have two elements in common:
1) They assign various percentages of time to data cleaning and analysis, with the time allocated to cleaning greatly outweighing the time allocated to analysis.
2) Time is always partitioned into these two categories: cleaning and analysis.
I want to introduce a third category: understanding the data. This is a critically important part of doing data science and analytics. I acknowledge that many data scientists understand their data quite well, so I am not not criticizing the entire data science community. Instead, I want to point out and discuss the rich tradition of understanding the data that we have fostered in operations research and highlight the importance of this often overlooked aspect of working with data and building data-driven models. As an applied optimization researcher, I believe that data collection, understanding the data, problem identification, and model formulation are critical aspects of research. These topics are important training topics for students in my lab. To solve a data-driven model, we need to understand the data and their limitations.
Here are a few ways I have made efforts to understand the data:
- I have asked questions about the data to subject matter experts (who shared the data with me).
- I have done ride-alongs and observed service providers in action to see how they collect and record data as well as how they interpret data categories.
- I have read manuals that describe data sets.
- Summary statistics and other analytical tools shed light on distributions and processes that produce the data.
- Disseminating research findings often results in good questions about the data sources from audience goers, which has improved my understanding of the data.
- I have read other papers related to my problem that describe how data are collected in other settings.
Understanding the data has helped me understand the data’s limitations and apply the data meaningfully in my research:
- Data sets often have censored data. What is not included in the data set may be critically important. There is no way to know what is not in a data set unless I understand how it was collected.
- Some data points are nonsensical or misrecorded (e.g., 8 hour ambulance service times). Others are outliers and are important to include in an analysis. Understanding how the data were recorded help to ensure that the data are used in a meaningful way in the application at hand.
- Some data points are recorded automatically and others are recorded manually. Both kinds can be high quality or low quality, depending on the setting, employee training, and the tool for automatic collection.
- Understanding the data is a first line of defense when it comes to data and algorithm bias. Most data sets are biased in that they are not fully representative of the target population or problem, and understanding these biases can help prevent building models that are discriminatory and/or not effective when it comes to the problem at hand.
- Understanding what data are not included in a data set has resulted in me asking for additional sources of data for my research. Sometimes I have been able to get better data if I ask for it.
Without understanding the data, the analysis could be a matter of garbage in, garbage out.
This post covers just one of many issues required for avoiding algorithm bias in applying data science/analytics. Colleagues and I shared some of these additional thoughts with Cait Gibbons, who wrote an excellent article about algorithm bias for the Badger Herald. Read her article for more.