Accuracy vs Correct: Part 2

Deandra Cutajar
Jun 30, 2022
5 min read

The second part of the series Accuracy vs Correct uses classifiers to draw out some practices to evaluate whether the model is accurate or correct. In a previous article, I discussed this concept related to linear models, which are easier to visualise and explain. I hope that with patience I can do the same with classifiers.

A classifier is a model that looks at the data and attempts to place each data point in its respective category. As an example, suppose there are 100 marbles and for each marble colour, there is a bucket. The buckets are numbered but looking at the data, a data scientist can tell that by knowing what colour the marble is, they can predict what bucket number the marble belongs to. The classifier first learns the pattern, and then for each new marble, the model can predict which bucket to put it into. This is a simple example since our data contains marble colour as input and bucket number as output.

As one may have guessed from the image, I decided to use the Titanic dataset that was made publicly available. It contains information about the passengers and whether they survived the tragedy or not. In this article, I'm going to build two models to predict whether a passenger should have survived the tragedy or not, and then compare that prediction against the actual status. I would like to add that I won't be concluding with an analysis of whether there was space for Jack on the door, even though the answer is clearly YES.

Figure 1: Titanic illustration by Geoff Sence.

The first five rows of the data are shown below:

Figure 2: Data.

Looking at the columns one by one, they contain the following information

pclass - Passenger Class (1 = 1st, 2 = 2nd and 3 = 3rd)
survived - Status of Passenger Survival (1 = Yes and 0 = No)
name - Name of the Passenger
gender - Gender of Passenger
age - Age
sibsp - Number of Siblings (Sib) or Spouses (Sp) aboard
parch - Number of Parents (Par) or Children (Ch) aboard
ticket - Ticket number
fare - Price paid
cabin - Cabin number
embarked - Port of Embarkation ( C = Cherbourg, Q = Queenstown, S = Southampton)
boat - Lifeboat number if passenger survived
body - Body number if the passenger did not survive and the body was recovered
home.dest - Passenger Destination

It is important to understand what each column means and how it can be used. In the end, I want to predict whether a passenger survived using the rest of the information and then compare it with the actual outcome.

But, I need to be careful not to suffer from data leakage.

Data leakage occurs when a data scientist builds a model using some data a.k.a. training data, and includes information that will not be available in the unseen data a.k.a testing data. In the above dataset, if I were to use the features boat and body to try and predict whether a passenger will survive, the model will suffer from said data leakage. These are causal information generated once the status of survival is known. In other words, it's easy to predict that an event happened, after learning something about it. In reality, one can't know the body number or lifeboat number before the tragedy happens. I shall get back to this later on.

Firstly, I split the data into training and testing parts, and run some analysis on the training data. It is important to keep the testing data hidden both from the analysis and the model since this data shall represent unseen, new information.

Figure 2: % of Survived Passengers per Gender.

The above chart speaks a clear statement. There was a substantially larger proportion of females that survived than males. This is unfair and resonates with the understanding of "women and children first" known as the Birkenhead drill. Whilst there seems to be no legal basis for this, it was part of the Sea Promise until 2020. According to the source, the Sea Promise changed from:

"seek to preserve the motto of the sea: women and children first"

"to let those who are weaker and less able than myself come first."

It is unclear why it was then reverted back to the original form, but it has changed to:

"to let those less able come first."

The above Sea Promise explains the difference in survival between gender. There is also another characteristic that shows a significant discrepancy in the % of survival.

Figure 3: % of Survived Passengers by Class.

Figures 2 and 3 clearly show discrimination. In fact, if we were to put two rules that would predict the chance of survival, these would be:

- Gender = Female

- Passenger Class = 1st.

Knowing these two variables already gives a good prediction. The other variables have some missing information which I filled using a process called imputation. I shall write another article on that, but in brief, it is a process by which a data scientist determines what value would a variable have if it was filled in. At times it is straightforward or requires a reasonably approximate method. Other times it needs a lot of thought.

The first model - Model 1 - attempted to learn which passengers would survive the Titanic tragedy based on the following information:

- pclass

- gender - age - sibsp - parch - fare

The second model - Model 2 - included all the above and I added two other features:

- is_body: whether a body was recovered (1 = Yes, 0 = No)

- is_boat: whether the lifeboat number is known (1 = Yes, 0 = No)

Both models were tasked to learn the patterns during training and validate them during testing.

The statistics of the models are as follows:

ACCURACY	Training	Testing
Model A	99.9 %	95.4 %
Model B	98.3 %	73. 8 %

One reads the above table as follows:

Model A got an accuracy of 99.9% during Training. This means that Model A predicted the actual status of survival 99.9% of the time. When new information was provided in the Testing phase, Model A predicted the true survival status (accuracy) 95.4% of the time.

Replace Model A with Model B and its respective statistics from the table.

Model A did very well both in training and testing whilst Model B seemed to have learned well but did not perform as good during testing. Could you relate Model A/B to either Model 1 or 2? Which model would you trust most? More importantly, which model is accurate and/or correct?

A data scientist must take the time to understand what each feature means, and which model used what. Only then can one understand and predict the performance of the models.

Model 2 statistics are represented by Model A, and respectively Model B is Model 1. Using fewer features meant that I got lower accuracy on the test data but what were the features used to gain accuracy?

Both is_body and is_boat contain information that is only gathered after the tragedy has occurred and the survival of the passenger is known. If a body was found then the passenger did not survive, whereas if the person was on a lifeboat, they most likely survived. The only way Model A/2 performed exceptionally well was because it used information about the output, that it should not know before a tragedy.

Now that you know the difference, can you tell me which one is correct?

This example is one of many where automated data modelling may -not necessarily - either fail or lead to undesirable outcomes in due time. A machine learning model does not know what each feature means but tries to complete a task and reach accuracy using 1s and 0s. It is a data scientist's job to ensure that what the model uses to learn is reliable, makes sense and does not interfere with realistic scenarios. If there was a way to know whether one would survive a shipwreck before it happens, I am sure I would not step on a cruise liner unless I'm 100% sure I would. Won't you?

Accuracy vs Correct: Part 2

Recent Posts

Comments