The Scientific Method - Part 3: Experiment

Deandra Cutajar
Aug 27, 2021
9 min read

Updated: Dec 8, 2021

Experimenting in data science is what one imagines an experiment to be. Instead of chemicals, there is data that describes different characteristics relating to the problem, such as age, distance travelled and so on. Instead of putting the ingredients into a beaker, data scientists put the data into a model, which is a machine built to process and learn from the data. More importantly, instead of risking exploding the lab, data scientists risk overheating or overloading the computer. Now the latter may cause another minor explosion if it crushes the computer that other people are also using, or end up utilising more resources than budgeted for if you know what I mean.

"I know what everyone's gonna say. But they're already saying it. We're mad scientists. We've gotta own it." - Tony Stark, Avengers - Age of Ultron

My interpretation of Tony Stark's comment relates to the different ways a data scientist aspires to test the hypotheses. During the research, the data scientist would have linked the data to different potential models or statistical equations all whilst keeping in mind the output that they are after. In the movie, Dr Banner was projecting the same disastrous output while Stark's mind was set on [a] Vision.

Keeping the same reference above, the danger in a data scientist experiment comes about when the scientist has either overlooked the input data or has misinterpreted the model's performance on that same data.

In the research article, I shared some of the steps I take before beginning the experiments. These are not deterministic, and I believe every scientist has their own way of doing research. I also like to use a (virtual) notebook where I list the action points and it more or less looks something like this:

Hypothesis A:
- Model 1
  - Train/Test/Validate
  - Important Features
  - Interpret
Hypothesis B:
- Model 1
- Model 2

Listing the steps helps me break down a grand task into small actions and achievable deliverables aimed at monitoring the progress. The hypotheses have already been established and researched. Essentially, what needs to be done is to test each hypothesis using the models that were deemed to be best suited for the data and compare.

Suppose we have a hypothesis A for which a Model 1 was selected. Now to assess the model's performance, a data scientist needs to guide the model to learn from a set of data, let's call it textbook and be able to complete the homework and ultimately sit for the exam. These keywords translate into train-validate-test a model. All of you have done this in real life, and I am going to share how I did it during my primary education.

In our language subjects, I used to have "dictations" on verbs, past and present tense. My mother, who was very present during my early education taught me how to separate words that I need to train on, from words that resemble what I shall be tested on. So we would spend hours building these cards based on school textbooks. As part of my study, she would say the word and the tense to be applied and then I needed to answer accordingly. During the training part, I was allowed to see the answers. This helped me relate to how a verb converts according to tenses. However, as you are all aware, the test answers will be hidden. The validation part during my study was a set of cards randomly reserved to act as a mock test.

Breaking it down into simpler terms, the train part allowed me to see all the answers in order to learn the patterns and conditions. Once I felt confident about my skills, I used to validate them against a set of cards that I had not seen. Up until this stage, I was still "studying in the kitchen", so whatever I got wrong I would try to learn better. Ultimately, the day of the test would come, I sat on my desk, answered the questions to the best of my abilities based on what I had learned and understood. Then a couple of days later, the teacher would give me the result of how well I did, or otherwise.

A model, whether machine learning or a robust mathematical and statistical solution, needs to learn from the data and then be validated against another data set. This is more important in machine learning and artificial intelligence (AI) but for the sake of the article, a model must train on some data and be tested on another data it has not seen yet. Once that part is over, the next step is grading its performance. In other words, the model sat for the exam and the data scientist will give the model a result.

A model's performance is assessed by the following, amongst others:

Execution Time
Accuracy
Interpretation

First, a model must efficiently handle large amounts of data and store an output within an acceptable amount of time. Imagine a business wants to understand which advert will most likely attract the most customers. If the algorithm runs over two years, the advert will be outdated by then. Data scientists with a lot of support from engineers, developers and testers would aim for the optimal execution time for the model.

Focusing back on the science aspect of the experiment, accuracy is a score that shows how many correct answers did the model get. Much like that mathematics exam or multiple questions or quiz you may have undertaken, the accuracy can give you how well the model did. However, this interpretation needs an elaborate discussion in a separate article, because being accurate and being correct, in data science may not mean the same thing.

For the rest of the article, I will talk about Interpretation as a measure of performance. Suppose a machine was built that takes a variable X as input and outputs a prediction Y, in a similar relation as we discussed for correlation. The data can have whatever shape.

I gathered these two plots from a famous dataset experiment called Anscombe's Quartet, which also aims to raise awareness about how different datasets can lead to the same statistics. The reason I am presenting two of the quartet datasets is to emphasise the importance to go beyond statistics and understand the data.

Look at the two graphs above. Our machine takes the value x as input and tries to model a linear relationship with y. This relationship is represented by a solid line in each plot and is described by two characters:

gradient: a measure of the inclination (so to speak) of the solid line
y-intercept: the value of y when x = 0, or where the solid line hits the y-axis (vertical)

The y-intercept tells you what happens when x = 0. In other words, what do you expect when your input is 0. Sometimes the range of the data is so far off from the 0 value, that this is overlooked or ignored. Other times the value of y when x is 0, is an indication of a potential bias, a shift of your data depending on what the actual value of y should be at x = 0. More on the bias will be discussed in a separate article.

Now let us discuss the gradient. In a linear scenario such as the above, the gradient gives a measure of how y is expected to change with x. Specifically, a linear relationship describes how a step in x will result in a jump in y. The characteristics which define the jump of y are defined by the gradient, but regardless of where the change in x occurred in the x-axis (horizontal), a jump in y will always have the same intensity. Whether you are jumping from x = 5 to x = 6 or x = 9 to x = 10, y will change in the same way and land to the value respected by the linear relationship.

The equation that governs a linear relationship is:

where:

m is the gradient
c is the y-intercept

I invite you to select two values for m and c, and then start varying x and see what you get for y. Example: set c = 0, and m = 2. Can you predict how y will change with x? Can you predict how increasing x by 1 will affect y? Hint: write down a list of x, calculate the respective y and then find the difference between each row. You should have the same differences throughout and that is why linear relationships are so easy to understand and visualise.

For those who are reading this article on a desktop, I invite you to hover the cursor on the solid line and you should see a box with OLS trendline written in bold. Those reading from their smartphone can touch the line. The bottom part shows the information on the dataset which can be obtained from the link on Anscombe's Quartet. What I'd like you to focus on is the two lines under OLS trendline for either of the lines.

Data set 1 has statistics:

while Data set 2 has statistics:

each of which has been rounded to 2 decimal places (i.e. two numbers after the point). A very basic explanation of the second statistics R-squared is that it relates to the average distance between the data points and the solid line. So a high value (maximum = 1) means that most of the data points lie closely with the line and a low value (minimum = 0) represent a dataset that on average, is furthest away from the line.

If I just look at the statistics and I have no idea what the graphs look like, I would assume that the datasets look the same. That y increases with x by an amount m each time. Moreover, the relation of data points with the solid line is the same as well, and the R-squared is higher than 0.6 which is not a too bad measure. I may be inclined to communicate the following story to the business.

The variable y increases consistently with x. At x = 0, y = 3 and as x increases by 1, y increases by 0.5. Moreover the bigger the x, the larger the y and the datapoints share a nice correlation.

Would I be correct? Does the above statement apply to the graph on the right? Let's recap and focus on the graph.

The equation is saying that as x increases, then y increases with it. This is not true because when x increases past 11, y begins to drop continuously and not just for one point. The model tried to draw a straight line because that is what it was told to do. In doing so, the line is forced and the model did not capture the true underlying information. Essentially, if a data scientist goes to the business with the equation, a business person who knows their data well will most likely not trust the model, and for good reason. It is one thing to show how the business beliefs do not align with the data, but it is a whole different thing to force it. Data-driven insight means that the data has the power, and a smart model should learn from it and adapt.

However, there is one problem with this approach. The real data science world deals with a number of variables that can increase to 100 or more. In the above example, our input was x, one feature that was investigated against the output y. So one can imagine how impractical it would be to look these relationships up for each machine learning algorithm for, say 100 features. Moreover, most algorithms are not as straightforward as linear, and they require more work to be able to extract the same information as above, but not impossible. Depending on the time allocated, one can investigate. Data scientists usually aim to rank each feature based on how important they are to the model, a process referred to as Feature Importance.

In another article, I will go deeper into shaping the explanation. For the time being, I will share how I use the Feature Importance. First, this method aims to classify variables that were used by the model according to how useful they were for the model to learn to predict or categorize accordingly. Assume you want to know whether to carry an umbrella or not. Is it useful to check what the weather is predicted to be, or what you are going to eat for dinner? The variable that is the most important would have the highest statistics, and then the next important variable and so on. Following the logic through the importance of the features will show the understanding that the model acquired during the training.

In fact, by doing feature importance, a data scientist can run an analysis on the hypothesis learnt by the model based on the top 10 features. The number of features is arbitrary, but it should give a general understanding of the model. For the full hypothesis, a data scientist can go through all the features and translate their importance based on the type of machine learning that was used. It is also a common practice to drop out variables that have low importance to avoid over-fitting. More on over/under-fitting in another article when we discuss accuracy.

If the hypothesis is not right, then it makes sense to revisit the experiment or try a different one on your list. At this point, we are still "studying in the kitchen", so this is the part where data scientists should refine their understanding, experiment and interpretation.

Some data science courses aim to promote accuracy and statistics as a means to determine the success or fail of a model. They are an indication but explainability has always been the key factor that determined success. I will delve deeper into this in another article where I focus on how explainability is shaping the data science and artificial intelligence industry in the hopes of working together with the public. Look at it this way, will you buy something because a salesperson tells you it's worth the money, no questions asked?

Before I move on to how I choose to communicate the findings of a model, I will conclude with a concept that will most definitely remain imprinted, or so I hope. The below two graphs were plotted from a Datasaurus dataset to show how the same statistics can have different underlying plots. I encourage you to make a run for it.

The Scientific Method - Part 3: Experiment

Recent Posts

Comments