The Scientific Method - Part 2: Research

Deandra Cutajar
Aug 4, 2021
7 min read

Research is, in my opinion, another very important step in the project lifecycle. I read articles claiming that Data Science is dead because most of the work can be automated. However research can never be automated and because of it, I believe that data science shall live on. I make this claim with great awareness about how businesses are evolving and how the industry is educating itself on data richness and machine learning.

Research is about searching for something. According to Cambridge Dictionary, Research is

a detailed study of a subject, especially in order to discover (new) information or reach a (new) understanding.

In its definition, there is nothing repetitive to be automated and this is where the value of a data scientist comes into play. However, a person need not be a data scientist to understand what research is and do it.

Conducting research is like opening a puzzle box. The pieces are scattered around and slowly, one piece at a time you start forming the picture. In this case, the final picture will be shown on the cover of the box so it is more guided and relatively simpler.

When renting or buying a property, it is the same analogy. An individual sets a financial budget, which will be a condition. They later go on to see what the market (data) is like and what properties (information) can be purchased within that budget. Different requirements come into play. A buyer may want a garage, whereas another can afford a pool. Now the market is wide, there are lots of different properties in different locations, so then the interested party will refine the search by selecting their preferences. I am sure that most of the readers have now understood what type of research they have conducted in their own lives.

Data Science isn't any different. The research we conduct using the data must relate to the hypothesis we have discussed earlier with the business and conditions set forth. Moreover, it must be refined to align with the requirements, and its application. In general, there is no "ideal" time dedicated to research since it greatly depends on how well the construction of the hypothesis went. Nonetheless, I recommend spending at least one week to research the following:

the Business problem
the Data
Machine-learning techniques or statistical tools
Scoring metrics

Researching the business problem is useful to gather broader knowledge on the project. In the first article, I expressed that I love doing the job when I know what I am doing it for, and that stretches beyond knowing the data science techniques. Adding value to current knowledge requires searching for new insight and it starts with learning how the problem affects the business. It is useful to see the different approaches that other scientists may have taken to tackle the problem. The market is diverse and some of the industrial problems may have already been studied especially by large tech companies. Maybe not as specific as the project requires but, in my experience, there will always be some overlap.

In the end, science is universal.

There may also be a literature paper published by a student about a study conducted towards a degree. Even if the information is purely theoretical, wherever it comes from, as long as the source is reliable, it can either confirm your understanding or show something new. This part may not take a lot of time if the hypothesis was constructed carefully. Nonetheless, one never knows what other information has already been found. Moreover, it is always useful for the data scientist to do some reading on the problem such as "customer behaviour in industry A".

Now that all the literature and theory have been gathered, the next step is wrangling the data and getting acquainted with it. Before the data scientist begins to figure out which model to apply to the problem, or which statistical measure to use, the data ought to be explored. During the Exploratory Data Analysis, a data scientist can identify the following, amongst other properties:

Correlation
MultiCollinearity
Missing Data
Sparsity
Outliers

All these will indicate how the model will perform, its limitations and its interpretability.

Correlation and Multicollinearity measure how variables (features) change with respect to other variables. Correlation shows whether a change in feature A implies a change on feature B. For example, look at the two graphs below.

Each graph shows how y (vertical) relates to x (horizontal). The titles at the top of the charts indicate which relationship shows correlation or otherwise. At first glance, correlation can be determined by how far the points spread from a linear trend. This is a common understanding that should do for now. The more scattered the points are, the less the two variables are correlated. In fact, the graph on the right shows points that form a circular spread.

Using both x and y from the left chart (Correlated) as inputs to a machine-learning algorithm could confuse the model. Due to their relationship, both features will essentially relate to the output in the same way. The model may choose to learn from one feature and not the other. There is nothing scientifically wrong with this. Nonetheless, the business may have a preference towards a feature based on their experience, and how the business operates. Showing unpopular features will lead to questions and doubts towards the model, which could have been avoided if correlation analysis was conducted and the same information can be gained by a more acceptable feature.

Correlation analysis can also lead to understanding which variable is best for the model to predict an output. If the output of a model is y, the input is x and the model is linear, the left graph above shows that the variable x will be a good indicator of our output. That is to say, by knowing the value of x, a scientist can predict, more or less the value of y. Suppose a data scientist wants to predict their personal savings in one year, and the available features include salary, bills, annual bonus, hours spent at the gym, hours spent reading and hours spent with family. Which features would the model use to understand what saving patterns does the scientist practice? A quick look at one's bank statement can help shed light.

Multicollinearity is similar but with more variables. In a case when a variable is a sum or difference of another two variables such as X = Y + Z, the three variables are linearly related, i.e. when either Y or Z increases, X increases with the same amount, and so on. The model will struggle to capture the true impact of each variable which may lead to imprecise modelling and errors. This usually is the case when the data contains X amount of people, where amount Y are of type A and amount Z belong to type B. Essentially if the *type* is deemed important, it is advisable to remove the total (X) from the model and avoid risking multicollinearity. Regardless, each variable can be calculated out of two so no information is lost.

It is important to say that not all algorithms are affected by these properties. Those that are of decision-based nature are deemed to be immune. These models use the below logic;

If <condition is true> then do <action 1>, otherwise do <action 2>

and are therefore not quite influenced by relationships within the data. The only cause for concern is related to which feature does the model choose to use the most. In this case, one decides how to handle this on a project basis.

Data quality depends on the instruments utilised to gather it and human error. The downtime of some services or others can result in Missing Data which would induce a gap in the information provided. It is therefore useful to know the volume of how many data points are missing. Dealing with these missing points is tricky and requires thought. Any assumption may induce a preference towards a value. Such behaviour is referred to as bias, towards which I will dedicate an article. Missingness goes a bit hand-in-hand with Sparsity in the sense that, if the missing data is replaced with a value of 0, and the volume of rows with a value of zero is high, then one is inducing another problem, amongst others.

Another good analysis to conduct is whether there are data points that look weird, different, data points that do not follow the rest of the points. They are referred to as Outliers because they usually lie away from the central location of the majority of the data. Each industry may choose to deal with these points differently, and I will share some of the ways they can be dealt with in another article.

At this point, the data scientist should have understood the kind of data types to work with and can begin selecting the tools to deliver the project. Choosing which machine-learning technique to use, or whether one needs a machine learning algorithm depends on the data available. There are instances where a machine learning algorithm is a bit over the top for the problem. Other times, the data is large and during the exploration, the data scientist would have deemed it suitable to train a model without inducing errors or biases. Sometimes a statistical mathematical model is all that is needed, and occasionally a bit of both. There are projects where a statistical model is utilised to help the model use the features in the way it knows how to. More on this will be explained in a separate article when I discuss the transformation of variables.

Finally, or rather in conjunction with the latter paragraph, the data scientist must understand the scoring metrics that ought to be used, and why. A metric is a measure that would judge whether the performance of a model was a success, a failure or still requires more work. An example of a metric is a rank. There are lots of metrics to use, and in any case, the mathematical language is always flexible to come up with new ways to score models. Remember that the metric needs to be understood by both the data scientist and the business.

Unfortunately, sometimes research is considered to be a waste of time. The person who is doing research can be judged as delaying the actual work when in fact, it is quite the opposite. The research should help the data scientist to list action points and experiments to conduct towards the deliverable of the project. That means that once the research is complete and communicated to the business, especially if there is new information or additional understanding on the problem, the data scientist begins testing the hypotheses based on the research.

The Scientific Method - Part 2: Research

Recent Posts

Comments