Mind The Gap
- Deandra Cutajar
- Aug 4, 2022
- 7 min read
Updated: May 3, 2023
Data scientists aim to build a fantastic model that is reliable, accurate, correct and trusted. Such confidence can only be achieved if the data used by the model is equally reliable, accurate, correct and trusted. In the real world, this is hardly ever (to my knowledge) the case, and most of the data scientists' work revolves around cleaning the data for better quality.
A lot of effort has been put into getting better data quality. The issue arises from various aspects. While a data scientist dreams of having impeccable data quality, the truth remains that we often start working with raw data and work our way to the top. Some companies have even developed pipelines and software applications seeking to alleviate the nuisance of data cleaning so that the data scientist can focus on data science.
Whilst some standard data cleaning methods exist and work wonderfully on specific datasets, one must be mindful of the technique utilised for the data at hand. In this article, I will demonstrate a few reasons why and some of the ways I go around it.

Figure 1: Illustration by Geoff Sence
Missing data describes the event when no observation is available for that dataset. It is not erroneous data or anomalous, not even abnormal.
We can infer nothing about the missing data except understanding how and why it is missing.
Once missing data is identified, a data scientist tries to understand whether there is a pattern. For example, suppose I have a data set of peoples' expenses, such as follows:
| | Person A | Person B |
Date | Time | Money Spent (€) | |
18- July | 12:00 | 10.00 | |
19 - July | 12:00 | | 15.00 |
20 - July | 13:00 | | 300.00 |
21 - July | 19:00 | 30.00 | |
23 - July | 11:00 | 120.00 | 120.00 |
Table 1: A dataset about two individuals' expenses.
The above table shows how two people spent different amounts of money per day for five days in a row. Person A spent € 10 on July 18th, whilst Person B did not. Conversely, Person B spent € 300 on July 20th, whereas Person A did not. The empty cells are understood to be missing data, but in this case, missing data means no data. I'll explain why later.
In the second part of Accurate vs Correct, I used the Titanic dataset, which had missing information across different columns.

Figure 1: Titanic data showing missing data.
NaN is how most programming languages represent missing information. It means Not-A-Number, and with this representation, it is easy for those using the data to see these rows. However, caution must be taken since some datasets represent missing numbers with Inf or a large number such as 999 or equally -999.
Figure 1 shows that survived (the target variable in the article) has missing data too. Without knowing whether a passenger survived, that data point is useless. Depending on the % of missing data in the target variable, a data scientist has some options:
If the % of missing data is low, the rows with missing data can be dropped off from the dataset. The threshold is arbitrary, but I apply 5% as a rule.
Try and find the missing information using other datasets. Sometimes this may take time, but finding other data sources describing the same information with better data is possible.
Fill in the gaps using data imputation methods I will share here or otherwise.
Data Imputation
Data imputation is a technique to replace non-existent or inconsistent information with estimates that better represent the underlying truth.
In Table 1, I shared that the missing data means no data. In this case, since missing information is synonymous with no spending, it can be replaced with € 0. For the example in Table 1, no data is equivalent to saying that the person spent € 0 on that day.
Replacing missing data with 0 is a common practice but can be tricky.
Remember that by replacing missing data with 0, a data scientist is shifting the statistical distribution.
Figure 2: Histogram charts showing the distribution of variable x before filling the missing data with 0 (blue) and after replacing missing values with 0 (red).
The distribution statistics changed. The average x went from 29 to 23 due to the imputation. This may seem subtle and insignificant, but there are times when the change in the statistics can be dramatic, and the underlying information from the raw data is either changed or lost.
Conserving the distribution of the raw data is essential since a data scientist wants the model to learn the correct information. Changing the data unknowingly* through miscalculated imputation may lead to great results that describe scenarios which differ from the real world.
**NOTE: changing the data to make your model work is neither accurate nor correct. It's False.
The information harboured by the raw data needs to be conserved as much as possible.
One way to ensure that the underlying statistics do not change is by filling the missing information with either the average of a particular variable or some other statistics that the data scientist would have deemed proper for the problem. However, one must understand what that specific method means for the data.
Referring to the Titanic dataset (Figure 2), the variable age describes how old a passenger was when onboarding the ship. Thus, replacing any missing information about age with 0, as I suggested with the payments, is incorrect. Firstly, age 0 means that someone is newborn, so one would expect that they would be travelling with a parent, i.e. parch should not be 0 if age is 0. This is not the case. Secondly, a passenger whose title is Mrs is undoubtedly older than 18.
Instead, I could have replaced the missing data with the average age - 29 years old. Whilst this would surely conserve the distribution statistics, it would still not be correct.
I replaced the missing data in age by researching what the title in the name meant in 1912. From the research, I found the following:
In 1910, the average marriage age for a man was 25 years whilst a female married at 21 years old.
'Master' was commonly used for boys and young men under 18.
Using the above, together with how many siblings and spouses (sibsp) or parents and children (parch) they travelled with, I could identify which passengers should have ages ranging from 18 onwards or under 18. For each of these cases, I then chose a random number from the ranges I set according to the above logic to fill in the missing data. This is far from perfect but is more accurate than generalising.
For example, a passenger whose title is Master may be some months old, but another passenger whose title is Mr, travelling with his spouse (sibsp = 1) and his three children (parch = 3) is older than that. Whilst the average of 29 years would suffice for the second example, it would be incorrect for the former.
Missing data or data quality, in general, is a pressing issue in data-driven professions and, more importantly, businesses and industries. Without data, every decision is based on gut feeling or limited learning from experiences. Ensuring a reliable dataset and finding the best technique to enrich it without changing the information it harbours in its raw form requires patience and understanding of the limitations of the machine learning algorithms data scientists intend to use. After all, a data scientist wants to listen to the data and not have it repeat what we want it to say. After all, Data Speaks.
Data Imputation - Advanced
The above methods are simple. A more advanced way of imputing the data involves learning from the data itself using probability distributions or even machine learning itself.
Suppose a data scientist works for a clothes retailer and has data for every purchase a client makes. The data includes:
- Name (optional)
- Age (optional)
- Postcode (optional)
- Email (optional)
- Shop Outlet
- Date of Purchase
- Size of Purchase
- Item
- Amount (€)
The Name, Age and Postcode are optional, allowing the customer to choose whether to share that information or not. Email may also be optional, but most customers would fill it in to receive the receipt digitally instead of on paper (be mindful) or receive offers. Most likely to be awarded for every purchase in hopes of a later discount.
Upon seeing the data, the retail manager or the company decides to understand their typical customers. In data science, this is referred to as profiling and would tell a similar story to this:
Customers of age between 28-30 are likely to visit on weekends, purchase sizes between Medium to Large and spend about € 300 a month.
The above can be a profile across all outlets or for one particular shop, depending on the location and the lifestyle in that area. However, to be able to tell the average age or distinct ages of the clientele, the data must contain enough information about the age of the customers. Not all customers fill in their age, so missing data would be a concern.
One advanced way to fill in missing data is to describe the histogram distribution of age with a mathematical formula usually used in probability. A histogram like Figure 2 could be described using an approximate probability equation. Once the parameters of that probability are found, a data scientist can replace each missing value for age by randomly sampling from that distribution. This technique's advantage is preserving the entire actual distribution since the missing data will be filled with different quantities sampled from the distribution and not a constant value.
Another technique involves predicting the missing values using a machine learning technique by comparing the similarity of that data point with another. For example, if two rows are similar, one can assume they are more or less of similar age. However, this would provide the same age for all customers who bought the same size, from the same outlet and spent the same amount. Nonetheless, it is more reliable than filling missing values with a constant.
Truthfully, there exist numerous ways to intelligently fill in the missing information. Different methods may sometimes give a similar performance; in that case, simple techniques are preferred over complex ones. However, the data has a story to tell, and whatever way a data scientist chooses to listen, no information should be lost in "translation". A good model is as good as the data it learns from.




Comments