top of page

Accurate vs Correct - Part 1

  • Writer: Deandra Cutajar
    Deandra Cutajar
  • Feb 11, 2022
  • 10 min read

Have you ever spoken with a data scientist where they said:

When I see an accuracy of 80% or higher, I know something might be wrong.

I bet your first reaction is confusion because the above statement defies any logic, at least theoretically. It is normal to ask:

Why? Is not a higher accuracy better?

Theoretically, YES! In practice, higher accuracy may not be better. Most data science courses promote high accuracy or show statistics that translate to 100% accuracy or are equally remarkable. All these aim to help the scientist student understand whether the model they built performs well or not. There is nothing fundamentally wrong with learning in near perfect conditions. However, as most people often remark, the ideal world (with ideal data) doesn't exist (yet). I'm happy to be wrong and I highly encourage feedback with directions on where to find this ideal world.


The reason that real data science performance cannot (yet) achieve high accuracy is simple.

Data is not perfect.

If one attends an online course with perfect data, then of course one can get high accuracy, ideal metrics that conform with theory. However real data is messy, full of mysteries and information that we may have not yet understood. It may contain information that the model (mathematical machine) cannot understand how to use because it can't make sense out of it. The most catastrophic scenario is if the model attempts to use the information to reach the desired accuracy, and throws all sense and understanding out of the window. Here I am not saying that the model picks up useful information that the scientist or businessperson didn't know about. On the contrary, I am saying that the model picks up something incorrectly and tries to use it just to reach that wonderful accuracy.


This also happens outside Data Science, what people refer to as rumours, speculation, conspiracy theories or otherwise. The acceptance or justification of whatever information meets with current beliefs. But to what end?


In scientific terms, accuracy measures how well the model predicted whatever it was meant to. In other words, accuracy determines how much the model agrees with the result that we are trying to understand. Thus, a 100% accuracy would indicate that the model predicted exactly what the data scientist wanted it to predict. Regardless of how weird or abnormal or irregular, the data points are, if a data scientist forces the model enough, I can assure you the machine learning algorithm would come up with a way to satisfy it. But is that smart?

Is that correct?

The purpose of the series of articles Accurate vs Correct aims to provide a visual understanding of the difference between high accuracy and correct modelling. As per previous articles, I shall keep to simple models and data that is easy to play around with, but at the same time, easy to keep in mind when assessing or building more complex models. It is my goal to ensure that once you read this article, everyone starts understanding the advantages of machine learning and otherwise.


When looking at some information, you need to do two things:

- ensure you captured the information well,

- ensure you captured all the information.


Ideally, one captures all the information well but the focus should stay on the goal. This means that a model needs to identify which data will give strength, and which data will be a weakness to the predictive power of the model. Not all the information is useful for a task, and therefore a model must select the data from which to learn. It is not easy to comprehend this in terms of data science, but I will try to visualise it and show you why the model ought to ignore some information and pursue another.


Case 1: Great Data


This rarely happens, but let's say we have the below data so that when x increases, y increases at the same rate with some slight deviations there and then, to make it somewhat real. Admittedly, this is dreamy data for a data scientist, because just by looking at the graph, it is obvious that there is a strong linear relationship between the two variables. Do you agree?


Figure 1: Intrinsic raw data for Case 1.


The linear relationship is clear and thus it makes sense to try and draw a line - the simplest model - to the data. Indeed after drawing a straight line using a mathematical equation, the final result is the graph below.


Figure 2: A line drawn on the raw data for Case 1.


The process that I did, drawing a line to capture the overall trend of the points is called model fitting, and because I used a line, the model is called Linear Regression. Linear because it is a straight line, and Regression because the variables x and y are continuous, they have values starting from 0 increasing continuously to 100.


Note that the minimum and maximum values of the variables do not determine their continuity. A continuous variable can start from any number to any other number.


Much like going to a dress tailor to make those adjustments to the waist, shoulders width or sleeves length, in science, we fit mathematical equations to help understand the data points to then predict observations. The tailor aims to ensure that the dress looks good on you and feels comfortable without being too tight or loose in different areas. Similarly, the data science model aims to capture the behaviour of the data without overfitting (too tight) or underfitting (too loose).


Ensuring that the data science model works well means that it needs to be tested. In the tailor shop, you would wear the dress, move around and if you know that you might be dancing in it, you try out the moves and see if you’re comfortable. More importantly, no accidents to be foreseen should the dress be tight or too long.


A data scientist tests the model by trying to predict the value of y for some randomly chosen value for x. In the graph below, this was done by drawing a vertical (dashed) line from that value up until it hits the linear model (solid line), and then continuing horizontally to check where the dashed line hits the y axis. Upon repeating this step several times, if the predictions agree with expectation, the data scientist concludes that indeed, the model makes sense.


Figure 3: Testing the prediction for x = 45.


Unfortunately, the reality of the above is that the data was purposely generated to represent great data, and in the real data science world, this is rarely the case. For example, the true value of y when x = 45 should be 45 which remarkably is close to the value of y on the fitted line.


Real data will have easy points, but also points that are difficult for the model to understand. A dataset can either have some points that misbehave, referred to as outliers, or the entire dataset is all over the place, which is usually called noisy. In this article, I will give a visual representation of these, but a deeper dive into each type of data will be held in different articles.


Case 2: Good Data


Sometimes the data is good but not great. In other words, most of the data would follow a nice trend but then there will be some points that fall out of this pattern. In the second “The Scientific Method article, I explained how Exploratory Data Analysis is crucial for a data scientist.


When given a set of data, my first action is to figure out a way to visualise it. Note that here I am only using two variables, and thus it is easy. But if a data scientist knows what the output is, then a good start would be to see how each variable behaves against the result, such as x (input/variable) against y (output/result). For the more experienced people, one can also plot distributions, but I will talk about this in another article.


Looking at the below graph, can you spot the data points that misbehave?


Figure 4: Good data for Case 2.


When I look at the data I identify the outliers marked in the graph below.

Figure 5: Identify outliers for Case 2.


Since most of the data looks normal, these points are often referred to as outliers, because, put simply, they lie on the outside of the normal space of the charts. They do not follow the same underlying behaviour as the rest of the data. So, how do these points influence the thought process of a data scientist when assessing and building a model, machine learning or otherwise?


Suppose that at first glance, the data scientist thinks that these two points will probably be ignored by any model and proceeds to model-fitting. Can you guess what would happen?


The beauty of mathematics is in its universal flexibility. A paper (or a code editor) is the playground and equations are the different rides you can go on. Being quite excited by equations myself, I used 3 different models in the attempt to try and see which ones would follow the trend of the data best. In the graph below, I invite you to check how each model describes the data by clicking or tapping the model on or off in the top right legend. Then, before proceeding to read the article, write down your thoughts on each model based on the previous example and article mentioned above.


Figure 6: Case 2 Model - Fitting.


My summary regarding the above graph would be that Model - Fit 2 is more or less the same as Model - Fit 1 but that Model - Fit 3 looks like it tried to create an explanation for the outliers. Model - Fit 3 passed through more points than the other models. Moreover, if we pull some accuracy numbers, Model - Fit 3 has the highest accuracy metric chosen for the analysis. A data scientist who may choose a model based on this metric without visualising the graph might as well choose Model - Fit 3. Do you agree with the data scientist? Is Model - Fit 3 the best? Should the data scientist go to the business and say that this model has the highest accuracy and therefore it is ready to be used?


In the third article of The Scientific Method, I wrote about training and testing. Analogous to the same process, the fitting of a model is the training part. Thus, before jumping to conclusions on what a data scientist may choose to do or not do, a test needs to be conducted for better judgement.


Similarly to Case 1, pick some random values on the x-axis, and for each model, find what the equivalent y would be.


I took the liberty of choosing 2 values, and for each model (same colour but dashed line), I conducted the same tests as in the previous example.


Figure 7: Testing each model for Case 2 data.


The two points that I chose are x = 41 and x = 59. Of course, I chose these points to explain something, but in reality, test data ought to be random. One must not selectively choose the test data.


Remember that you can zoom in on any area of the graph, using a cursor or otherwise.


When x = 59, all 3 models predict more or less the same value. Zooming in: Model - Fit 2 is the closest to the true value but the rest are well within the accepted range. On the other hand, when x = 41, the 3 models predict 3 very distinct values. I encourage you to zoom in but the following are approximate predictions:


- Model - Fit 1: 37

- Model - Fit 2: 27

- Model - Fit 3: 13


This is peculiar, isn’t it? Without conducting the tests, one would have thought that Model - Fit 1 ignored most of the information and did the bare minimum whilst Model - Fit 3 went above and beyond to ensure that every piece of information is captured. Yet, when a new data point is used to test the models, it is obvious that Model - Fit 3 was distracted by all the information and forgot to learn the bigger picture. Albeit predictions are not perfect, Model - Fit 1 captured the necessary information using the data that the model judged as relevant to keep. Model - Fit 1 ignored the outliers because in general, the points lie outside the area of learning.


One then asks, why aren’t the predictions perfect? The answer is that the data points were shifted up and down by a value to represent a human or instrumental error. The data scientist would IDEALLY want the model to correct for this, but the model doesn't know the magnitude of the error to accurately correct it. Instead, the model does its best to capture the general information which will inevitably inherit some of these errors.


The data used for the study in Case 2 was also simulated for this article, but it doesn’t fall far from the truth. It is very common in Data Science to look for these outliers, and then decide how to treat them. In essence, there is no right or wrong way but rather depends on the company’s objective and the business problem. Some data science projects aim to predict these outliers, and so training on datasets that have outliers is important. Other projects would require removing the outliers within some criteria. For each problem, there is a different approach.


Case 3: Nightmare Data


If Case 2 gave the impression that it is not that bad, let me share a dataset that is more true than one may think.


Figure 8: The more realistic data for Case 3.


The true signal sometimes is so hidden by the noise that data scientists need to gather all their skills and expertise to understand what is going on. In the above scenario, one would resort to a continuous train-test process to ensure consistency and correctness. Don’t get me wrong, accuracy is possible but then the hypothesis and explainability would be complicated.


As I did for Case 2, I fit 3 models to the data as shown below.


Figure 9: Three models were fitted to Case 3 data.


Toggle around with the models, and think which makes sense. Upon shaping the logic, think of a story to tell to a business person, such as:


- In general, y increases with x (Model - Fit 1).

- Initially, y increases with x, but then starts to decrease until y begins to increase again with x, at a faster rate (Model - Fit 2).

- The relationship between y and x is complicated, but the model shows the best R-Squared (given that the model passes closer to the points than other models), so it is the true relationship (Model - Fit 3).


Which one do you think is the correct answer?


I’ll give you a hint: I used the same clean data throughout (a straight line where y=x), and added different levels of shifts to simulate the different severity of noise. Now, was your answer accurate or correct?

Comments


© 2023 by REVO. Proudly created with Wix.com

bottom of page