To understand AI, understand Science

Deandra Cutajar
Mar 4
4 min read

Everyone wants to use AI, but very few try to understand it. Even fewer want to understand it, and most ask AI to explain itself (as if the AI will know it is wrong). To understand AI and consequently comprehend its reliability, ethics and anything associated with it, such as data privacy and copyright infringements,

one must understand Science.

I am not saying you should understand every theorem, lemma or corollary. However, you need to understand and acknowledge what the field of Science aims to do and how everything derived from it ultimately works.

Oxford Dictionary defines Science as the:

knowledge about the structure and behaviour of the natural and physical world, based on facts that you can prove.

Before machine learning or AI started trending, science was conducted via experiments of trial and error, requiring three elements:

input,
hypothesis,
output.

The input (variables or features) can be measured and should contribute to the output. The output refers to what is observed and measured, while the hypothesis is a theoretical understanding, usually as an equation.

The whole concept of science is to perfect the hypothesis and predict the output depending on the inputs. In other words, science is the process by which, knowing an input, one can predict observable phenomena with high accuracy and repeatability.

Such a concept is applied in (supervised) machine learning, whereby several variables are inputted into the algorithm together with the output. The machine is expected to construct the theory in between. If the theory is accurate, then every observable prediction can be calculated by knowing its input. In fact, before the realisation of bias and errors, most stakeholders demanded a predictive accuracy of 100 % because it was expected that the theory would always hold.

If one knows a bunch of variables, one can predict the outcome some time ahead, which is the whole concept of predictive analytics, machine learning and AI.

AI attempts to predict what the user wishes to know via the prompt. Its sole purpose in life is to take the input and figure out the output as accurately as possible. If one reads about encoders, this becomes remarkably obvious. Every machine built was designed to conduct some logic to predict behaviour and act it out. When the data quality is excellent, and the theory holds, such is a possibility that led many to build predictive models for revenues, churn, retention, demand, etc.

So now that we have established that AI's sole purpose is to build logic that matches the output to the input, where do copyright and data privacy issues arise?

In unsupervised learning, the output is unknown. Instead, patterns are searched for in the input, and the algorithm attempts to generate an output that reflects the input. Those working with clustering algorithms know this might take several trial and error.

Clustering algorithms are also applicable in texts and are usually the basis for topic modelling, sentiment analysis, and segmentation models. If the output is known, then one opts for a classifier because there is more information to provide to the algorithm, so in theory, there should be more success. When the output is unknown, the algorithm builds the hypothesis.

How does AI build a hypothesis?

It doesn't, not in the way scientists are used to. Instead, it depends on standard equations, and a gazillion hypotheses are variations of that one global hypothesis. The whole purpose of AI is to reconstruct an output based on the input from the hypothesis stored in the neurons.

The image below was taken from Sparse Autoencoders lecture notes by Andrew Ng. It describes how encoders work, where Layer 1 has the input, and the whole process aims to construct Layer 3 outputs.

The image is taken from Sparse Autoencoders.

Moreover, Andrew NG explains the following just below the image:

The neurons in the layers are trying to construct a hypothesis so that the output is similar to the input. So when we prompt an AI tool, it aims to produce an output based on derivative data stored in the neurons and activation functions that were learnt via the training process, with the sole purpose

to produce an "output that is similar to" the input.

In all fairness, this fact is valid for all algorithms because, in Science, one conducts experiments to learn features that predict observable phenomena. In other words, we want the AI to be able to do precisely that.

This is where issues arise!

Since the whole point of AI is to reconstruct the input as accurately as possible, the information in the neurons includes enough features that when joined together, the output is similar, if not exactly like the output. Anything less than that is dubbed as "hallucinations".

This mechanism gives rise to copyright and proprietary issues and data privacy concerns.

Ok, what do we do?

The problem is not the AI. AI is doing what we want it to: learning patterns and hypotheses to predict observable phenomena accurately and repeatedly. The problem revolves around the ethics of the input data, specifically:

how was it collected,
how was it curated, especially around PII or sensitive data,
how was the proprietor of the data compensated?

If we can truthfully answer the above questions, we can move away from seeing AI as a data theft machine but instead as it is: a machine designed to construct a hypothesis so the output mirrors the input.

To understand AI, understand Science

Recent Posts

Comments