The Horrors of Bad Techniques

Deandra Cutajar
Jan 13
6 min read

Design by Geoff Sence: Portfolio

When I started my data science career, I quickly learned that the eight years I spent understanding the techniques, theorems, lemmas, corollaries and the laws of mathematics and physics were well spent. Nonetheless, as I began adapting my skills to the industry's data science realm, I quickly started having "nightmares", metaphorically speaking.

I had no industrial experience when I started my first job in data, so I took online courses to catch up on Python (I did my PhD in MatLab). The theory was straightforward for me. I could build machine learning models from scratch, but what I couldn't do was optimise the algorithm and use the laptop's or PC's computational power for performance - which I often refer to as engineering. So, I set out to learn that.

In addition, I also began realising that there are some standard practices that - to make it worse - are dubbed as 'best practices', and I understood that my role would not be that of a classic data scientist - which is a hybrid between engineer and software developer. My role shall be that of a pure scientist who will adapt the theory to the problem and choose accordingly.

In this article, I will mention some of the "horrors" that I encountered in data science in the hopes that online courses and aspiring data scientists will learn that

what is popular or trendy is not always correct and accurate.

Let's start with numerical techniques and some of their horrors.

Case 1: Imputation

Imputation describes a technique whereby data scientists try to fill in missing data with approximate metrics. I have already written an article, Mind the Gap, explaining how standard techniques can lead to skewed data and bias and, in some cases, the data will not represent the actual truth.

Most of these standard techniques are used because they are easy to implement and explain without much thought. But this poses a problem. Any technique applied to it will be superficial without considering the particular use case. You get it right if you are lucky, but most require further adaptation.

Case 2: Normalisation

In general, models or insights are better understood when transformed into a score between 0-100, such as a percentage. The idea of converting insights into a business-understood metric is a sought-after practice and makes the conversations between data and domain experts flow smoothly. Moreover, converting a metric onto a % can help with auditing purposes, automated flows and data quality monitoring. For example, having a message saying "2 % of the data is missing" is more informative than saying "15,000 is missing", whereby the latter loses the context of the volume.

The below equation has gained popularity for its simplistic form:

where:

In its mathematical form, the purpose of the above equation is to provide a normalisation technique to convert a data set into a 0 -1 scale, (which then can be multiplied by a 100 to get a percentage). However, there needs to be a deeper understanding of the data before applying this techniques and let it run automatically.

If the dataset you wish to normalise will always have a default minimum and maximum, then the above equation will serve its purpose and moreover it will be consistent over time, thus reliable to be used in automatic processes, flows and triggers. On the other hand, if either the minimum or the maximum fluctuates, to the point that the variance can be somewhat significant, then the above equation will cause more data quality issues than your non-normalised data.

Anyone can try this out, generate some data points from a minimum to a maximum. To see the effect, keep the relationship between y and x as linear. Then apply the above equation on y. Next generate another set of datapoints but this time range it farther so that the minimum and maximum change. Such are Graphs 1 (a) and (b) which show data simulated to be six months apart. The values for x varied considerably and is reflected in the respective y.

Graph 1: (a) A linear relationship between some input data x and y. (b) The same relationship on different input data x simulated to have occurred six months later.

The datasets in Graph 1 were simulated to represent a time gap of 6 months. Both minimum and maximum of the dataset changed however equally around zero (that is the drawback of synthetic data - unrealistic). Nevertheless, Graph 2 clearly demonstrates that the meaning of 80 % differs. Whereas the left graph shows 80 % at x ~ 1.8, the right relates 80 % with x ~ 6.1.

Graph 2: (a) Normalised y. (b) Normalised y - occurred six months later.

In an automated flow, or report where the percentage is visualised rather than the raw data, the above change in context will confuse the business user. They would be filtering on the percentage as they would have done many times before only now the data doesn't make sense. It is rather unusual to say this, but it's not always the data that's at fault. Sometimes it is the process. However, a data scientist ought to be able to detect this horror.

Case 3: Correlation vs Causation

We see this quoted in multiple statistical memes and I hope that those who shared it understand it. Most data scientists will type

df.corr()

during EDA in the hopes of finding correlated variables for their analysis, or uncorrelated variables for their models.

But correlated variables do not necessarily mean that they are linked or that one influences the other. Any correlation found in the data could either be causal correlation, indirect correlation or an outright coincidence from the data range selected for your analysis.

It is for this reason that whilst the value of correlation is factual, its presence may not be and a deeper insight into the source or reason ought to be investigated.

**Note that this usually requires domain expertise rather than a conclusion expected from the data scientist.

Case 4: String Cleaning

Another horrific data science technique relates to string cleaning efforts. Many courses would identify common techniques:

lower case your characters
remove leading/tailing white spaces
remove punctuations
remove emojis
remove stop words
spelling checker

In theory, all the above are correct techniques, but assuming that they are enough or that they always apply poses a data quality issue. Sometimes emojis can be translated to sentiment and thus you won't want to remove that. Punctuations can identify grammar capabilities for those language models.

When cleaning a text dataset you first must understand what the data represents, and what data quality issues will be present that might skew the analysis or model. Yes, it means getting in the head of the user or the individual inputting that data. It is equally crucial to acknowledge where the data, after cleaning, will be fed into because it can inform the cleaning process.

For example a PostCode column will only require the second element, and possibly turn all characters to upper case rather than lower, especially if fed into a geocoding algorithm. Names and Surnames don't need the removal of stop words or spelling checker. Sure you can apply the techniques for nothing but why waste resources?

Stop words should definitely be left in text as should common words when using embedding techniques for language models such as taxonomy or categorisation. In fact, removing such text would lead to a loss of context and two words that may have different meaning depending on the context will be treating as one and the same.

Case 5: Fuzzy Matching

The equation for fuzzy matching is

where

Therefore, the fuzzy matching is dependent on the length of both strings. This poses an issue because what we are saying here is that a fuzzy score of 80 % can mean different similarity depending on the length of the string. Remember Case 2 above? After all, 80 % is 4 out of 5 which means 1 character is out of place. It could also mean that 2 out of 10 are out of place or 4 out of 20 and so on. They would all give same % but perhaps one might want to look into it.

So what is the option?

There are several, starting with using the Levenshtein distance as a 'raw' score and normalise it using a more consistent equation.

I can add more, and I probably should for the skill and talent of a data scientist was never their ability to code but rather to translate a business problem into a solution that can be broken down into logic.

Remember, data value is brought about by understanding the data first and then search for a solution.