top of page
  • Writer's pictureDeandra Cutajar

You are biased, and how could you not be?

Bias in the data is one of the many issues revolving around data science and AI. I believe the community accepted this as one of the items to check for when assessing the data quality. But what is it, and what gives rise to a bias? More importantly, do we always need to eliminate it?

In this article, I intend to explain bias and share an unpopular opinion that:

To get a good model, sometimes you need to keep that bias!

The reason I say this begins with the mathematical definition of a bias

where θ' is a statistic used to estimate θ. The difference between the expectation of θ' (sometimes the average of that variable θ') and the true θ is a measure of bias. Essentially, it describes the convergence towards one side of θ instead of another, a preference.

Bias does not always relate to sex, gender, race, ethnicity and religion. It does not necessarily imply discrimination.
Statistically, bias measures a tendency to prefer one outcome over another.

The primary example of bias is tossing a coin. If the coin is fair, then upon tossing the coin a large number of times, the number of heads should equal the number of tails. If the coin is unfair, say caused by some defect during manufacture, then the number of heads will not equal the number of tails. This inequality also represents a bias because the coin prefers one side over the other due to its weight distribution.

Such bias is not harmful, but another type can lead to discrimination. In 2019, an article showed that women applying for credit cards were rejected based on gender. No matter their demography or credit score, as long as they put 'Female', they were automatically rejected.

Due to numerous other examples of unfair and discriminatory outcomes, bias is understood to be a bad characteristic in the data that needs to be dealt with to have a just and fair model.

But that is not entirely true.

For any algorithm to be able to simulate real-world scenarios, it needs to know about these biases. When these biases result from or may result in discrimination, data experts can mitigate this without changing the real-world data and ensure fairer and more balanced future data.

Living in an unbiased world means an equal outcome for every prediction we try to calculate. A bank can be fully unbiased if they approve loans and credit cards randomly, without profiling the customer. Insurance companies can be fair by charging the same premium to everyone regardless of the risk exposure. To remain unbiased, we would no longer need complex mathematical solutions to identify patterns for ideal customers, target audiences, and risky profiles. We would give each customer the same probability throughout and cross our fingers.

Does that mean that to use knowledge and experience, we ought to accept all types of biases, including discriminatory bias?


As I emphasised in my articles, exploratory data analysis (EDA) is essential before training a model or AI. All biases, or most, must be identified and assessed during this stage, but not necessarily removed. Why? Without these biases and preferences, companies will fail to understand which products resonate with their customers and how to attract new customers. We may disagree on this, but each of us has our own way of living, and companies need to learn about these biases to address each one on its merit.

Consider company A, which promotes various products. For the sake of this article, I will not specify what the products are or what the company does to avoid any indirect bias arising from my background or the reader. The company wants to know which products are most successful and what is the customer profile for each product. The stakeholders also want to predict the probability of upselling and cross-selling products to current customers and how to reach new customers. Thus, they hire a data scientist or contact a consultancy company.

The data experts look at the data and find that whilst the same customer profile prefers some products, other products are purchased by a completely different profile. I will not say what makes these profiles different, but to give a broader understanding, a customer profile can be determined based on their personal non-identifying information such as location, age, and gender, but it can also be a profile interpolated by the customer's behaviour on the platform; such as time of purchase (which indirectly gives information about whether the person works a day-job or shifts).

I want to share that while sensitive information such as gender, race, and ethnicity can be used as part of a profile, they are usually flagged to lead to a potential bias! The reason these are kept in some models could be due to the nature of the business.

Back to our company A. The business stakeholders learned what products their clients liked and purchased, and they will use that information to recommend other products. Recommendation Engines work with the bias/preference so that if a customer wants product A and is similar in demography to a customer who liked product B, the engine will most likely promote product B to the first customer. That is how upselling and cross-selling work, and if I may be direct, so does influencing: promoting a product with an audience that resonates with you.

Have you ever scrolled through social media? Which posts/reels do you read, and which do you scroll over? Ahh! There's your bias, and the algorithm knows.

Check your recommendations next time you're on your stream provider or shopping online. These are unlikely to promote products they cannot link to your profile or preferred products. Without that bias, that categorisation, recommendation engines will recommend at random and lead to annoyed customers. That is what personalised recommendation does. It identifies your profile and then connects your profile with other similar customers.

So when is bias bad, and how do we identify it?

I believe there is no rule on this; instead, we should understand the bias concerning the problem. If we consider the case of the credit card, an analysis of the model based on which variables were used to predict the outcome would have flagged this issue. The feature "gender" would have been at the top of the feature importance analysis, which should have raised concerns.

During the EDA, the data scientist could have identified that the variable gender would be a "strong predictor", primarily if most females lie in one category. The analysis should measure how many risky customers are females and how that portion compares to the overall female presence, i.e., storytelling. Maybe only 10% of females were risky, but they make up 95% of the risky customers. Saying "95% of our risky customers are female" differs from saying "10% of females are risky customers, which comprise 95% of the total risky profiles". The former gives rise to prejudice, whereas the latter provides deeper context.

If these metrics remain uninformative, there is another statistic to look at: the interrelationships between variables. Like a correlation between two variables, one variable may be deemed a strong predictor due to its correlation or indirect relation with the proper predictors. However, the machine learning algorithm or AI doesn't care about choosing the variables that make more sense. Instead, the algorithm picks the variables that optimise the results quickly.

*This is also true for tree-based algorithms. More will be explained in another article.

So, in the case of gender, it could be that the variable gender had a relationship with other variables that were the true predictors, but the algorithm picked gender. To check this, one can try to remove the variable from the inputs and see how the model performs. Data experts should mitigate the imbalance if the model's performance reduces dramatically while ensuring the information remains intact. Remember that in cases such as gender discrimination, we want to remove social prejudices inherent in the data and not valid information.

To explain this, I shall use an example with skincare. You tried two skin care products on different parts of your face. One of them worked well, but you had a reaction from the other one. You left a review so others know of your experience and hopefully do not have the same reaction. We all do it! That negative review will bias people against that product, but it is informative and thus should not be removed for the sake of unbiased data. Instead, any new customer reading the reviews will assess what you wrote and compare it with what others who loved the product wrote.

Any bias inherent in the data due to societal prejudices must be assessed to ensure no information is lost without allowing unfair decisions that would impact people's lives. It is essential to identify such biases as they most likely arise from outdated procedures. Once the processes change, the data imbalance can be monitored to assert that these new procedures work towards a fair system. More importantly, when societal prejudices are present, no automated decision should be allowed, and the algorithm needs to be supervised in every case.

Before ending this article, I want to mention silent biases. These biases are HIDDEN from the data but are also a consequence of experience or discrimination. Silent biases result from processes that occur before the data is recorded. For example, consider you have a car company, but due to a preference or prejudice, a particular brand rarely makes it in your ledger. On the rarest occasion when it does, the cars are heavily examined with compliance checks before approval. Thus, a data expert analysing your data can never say, "Cars of brand A are riskier for your business than other brands". The data expert will arrive at the opposite conclusion because of the extra checks. Data experts will see that cars of brand A are always compliant. Nonetheless, that prejudice (regardless of whether it is founded) led to a silent bias that remained undetected. The only way to detect these biases is by going through the processes with a business expert or during a presentation of an analysis or model outcome and explainability.


We are all biased whether we all accept it or not. As data experts, we are responsible for educating the companies on their biases, which can result from previous experiences that led to specific severe procedures on a category than others. Once these biases are identified, and the impact on these categories is discussed, data experts and business experts can determine the safe route to ensure that the algorithm still represents the business model but gives way to a fairer assessment when applicable.

Eliminating the bias is not easy and, at times, is not the best route. There is a difference between bias and discrimination. Data experts can guide businesses to distinguish between the two.


Recent Posts

See All


bottom of page