Quality over Quantity over Quality

Deandra Cutajar
Mar 4, 2024
4 min read

Updated: Mar 5, 2024

The title alone may have your head spinning because we are more familiar with the notion of

Quality over Quantity.

Image designed by Geoff Sence.

Of course, data science is not an exception, but to a certain extent, it may be.

The quality of the data is imperative for machine learning models and AI. As I shared in previous articles, data scientists spend most of their time looking at the dataset and identifying some of the most common obstacles, such as:

missing data
outliers
correlations

and more. It is universally accepted and agreed upon. One of the most "social" arguments shared amongst data scientists is data quality. People complain about traffic, weather, prices, or taxes, and data scientists have rounds of complaining about data. For this reason, we ask for more data, hoping that more data will overcome its low quality. Data scientists have been so accustomed to seeing low-quality data that they immediately accept

Quantity over Quality.

In this article, I will explain how quality is always more desirable than quantity, but realistically, quantity is the only way towards quality models.

Focusing on the first part of the title, "Quality over Quantity", may seem trivial. For machine learning algorithms and AI in general, it has become prejudiced that a small dataset, i.e. a table with around 10,000 rows, is not good enough. Some go even further to claim that given the small dataset, a model is not possible.

That is NOT true.

A model can still be trained if a data scientist has a small dataset as long as the data quality is good. A data scientist can build a model with only a couple of data points, provided that these data points contain complete and accurate information.

I will convince you by using three data points describing a relationship between variables y and x. A data scientist would think of a model, either regression or a classifier, to determine how changes in x reflect changes in y.

Figure 1: Three data points relating x with y. The blue line shows a straight line passing through all points, thus defining a linear relationship between x and y.

When fitting a regression model to the three data points, we get a straight line passing through the origin, as shown in Figure 1. The linear relationship was achieved because the data quality was high, and thus, with just three points, I successfully modelled the relationship with high accuracy. This also applies to other models, such as classifiers. Moreover, if I add more data of the same quality, they will fill the gap between and extend beyond, but the underlying model remains unchanged. To see this, press 'PLAY' in Figure 2.

Figure 2: An animation showing how the model remains unchanged with new high-quality data.

I exaggerated with three data points because linear relationships are even rarer than good data quality. However, I hope to have shown that if the quality is good, we require significantly fewer data points to train models. Consequently, the data team would require less processing power to run analyses, build models, and train AI.

Training a model can become costly. With large amounts of data to train on, the costs keep rising due to storage and processing power. Not to mention licences and hardware to ensure the data team can run analyses and build models efficiently to support the business.

There are ways to reduce data storage costs, but the more data we collect, the more storage we need. However, an ecosystem designed for archiving, retaining, and eventually deleting data can help manage that. Focusing on changing the storage structure is helpful for better data access and storage management. Any cost saved today will be used to store the data tomorrow.

Optimising data costs is achieved by ensuring that each data point stored brings value to the business. Data scientists spend time cleaning the data. Imagine having a considerate number of rows or columns dropped from analysis each time due to their low quality. Either save costs and not store those rows, or make space for good quality data.

So why don't we do that?

It seems that we have given up on data quality and moved over quantity to try to absorb errors in data in the hopes that they will cancel out. This brings me to the second relationship, "Quantity over Quality".

The reason why, most often, quantity supersedes quality is due to low data quality. By gathering more data, we hope that the error cancels out. Statistically, this only happens if the error is normally distributed around 0. This rarely happens. Instead, we end up with noise bias, to be distinguished from discriminatory bias. The bias arises from random errors or systematic errors at the time of recording the data. Without delving into the details, these two are different. A random error has a higher chance of cancelling out with a large number of data, whereas systematic errors will most likely lead to a bias.

Having more low-quality data doesn't solve any problem. It only poses more challenges to data experts who must identify solutions to make up for that quality. However, these solutions will not be future-proof. This is why models, including AI LLM models, are subject to re-training and re-tuning.

Any model, whether statistical, machine learning or AI, depends on the data. To ensure a future-proof ecosystem around data and AI, we need to assert that the data is of good quality and shape, thus optimising storage costs and access and ensuring that any processing power dedicated to innovation and AI is valuable.

Quality will always be preferred. When that fails, data professionals hope that quantity will shed some light.

Quality over Quantity over Quality

Recent Posts

Comments