Data Privacy in Data Science

Deandra Cutajar
May 30, 2024
5 min read

Data privacy is often considered a phenomenon for Big Tech companies. People think that only tech companies are at risk of breaching data privacy because that is what we are learning from social media. One need only speak to another person to know how poorly data privacy literacy is shared amongst private citizens.

Design by Geoff Sence

From personal experience, I once brought an individual's attention that their camera was breaching my data privacy rights. They explained, thinking it was a 'duh' moment, that it was simply "just a camera". That is where the problem in our society lies. There is nothing "just" about technology. It is not just a camera, never just a photo, not just a recording, not just a video and not just a selfie. With today's technology, anything can be turned into something more. As citizens, we need to become aware, protect ourselves, and ensure we are not putting anyone else in danger, especially minors. Our duty as adults is to ensure that no photo is handled as "just a photo".

Data professionals carry a more significant responsibility in ensuring that the data privacy of individuals is respected and not exploited in the name of "AI for Good". If AI is intended for good, and I believe we can achieve that, then we start by respecting data proprietors. In this article, I shall explain how data privacy governance in data analytics and science ensures that the data team doesn't see data they don't need. Still, the team should be able to link anonymous data with identifiable data for the teams who, as agreed with data proprietors, require personally identifiable data.

In a previous article, I focused on three data roles: analyst, engineer, and scientist. I shall refer to those roles again to explain each role's different stages and responsibilities when ensuring that a data point is handled carefully because no data point is "just a data point" anymore.

The first thing a company needs to realise when making a data agreement with a client is that the client trusts you to take care of their data. That trust must be respected, adhered to, and, as I will explain, enforced throughout the pipeline through which the data is processed.

When a client and a business agree that some data will need to be stored in the company's storage for the performance of the service, product or even improvements, the data engineering team needs to get involved to:

Separate PII from non-PII data.
Understand which team or users will require access to that PII.
Link each unique PII record to an internal unique identifier.
Keep non-PII data linked to an internal unique identifier.

For each data entry, the engineers, together with the company's data owners, i.e., the department that reached an agreement with the data proprietors, work together to define the description of each column and its purpose of use.

When clients trust you with their data, they set some rules on what your business can and cannot do with that data. Any negotiations about the data must happen at this stage. Whatever the agreement at that stage, it is final unless the discussion is revisited and a new agreement is reached.

Respecting the client's wishes regarding the use of their data is crucial for the company's reputation.

Once the purpose of the data's use is agreed upon, and thus the company is now storing the client's data, that agreement needs to be shared with the rest of the company, especially those who may require access to the data. Without oversimplifying, data analysts and scientists who may sit in different business functions must be aware of this.

When a project commences, both data analysts and scientists look into the database to search for the necessary data to complete the work as would be requested. Using a data glossary or dictionary, they either already have access to the table or may request it. Regardless, there is a process that needs to be set up.

The likelihood that a data professional has access to a table is high because they need access to data for their work. However, the chances for that data person to know whether they can use that data for their project are remarkably low. Moreover, data scientists will more likely collect data to build models, machine learning, or AI. Thus, it is becoming increasingly important to know what data professionals can and cannot do with the data.

Using the distinct roles' purpose described in my previous article, a data analyst may have less of a concern around this because if the permissions to the data based on the roles were properly made at the start, then the data analyst can only access data that was deemed necessary for the role. They could then go on and do their analysis and deliver their project.

Data scientists, on the other hand, cannot assume the same. While they can access the data, it doesn't imply that they can use that same data to build predictive models, train those models, and commercialise them. That assumption is wrong.

The key to understanding Data Privacy in Data Science is to separate what can be known because it is available from what needs to be known for the project.

**I'd like to add that it is generally in the client's interest to allow for such training as that would enhance the product's performance for their business model.

However, this is not the default, so data privacy in data science is important.

Data privacy in data science doesn't stop at ensuring the data accessed can be used for modelling. It also includes the model itself. Some articles ago, I wrote about Sharing Data with AI. Specifically, I explained how data scientists can take a sample of the dataset, and with enough computational power, we can recreate different variations of the true dataset, most of which will closely resemble the original data.

When data scientists build models using anonymised data, we try to find behavioural patterns. This is where correctly labelling sensitive data becomes important. If I know that a person with a gender and ethnicity goes from 'Lat A, Lon A' to 'Lat B, Lon B', that information becomes dangerous and identifiable. Even if I don't use gender and ethnicity, knowing the exact latitude and longitude of a person's routine can be identifiable. Map it to a geographical map; voila, that person can be identified without a name, surname, or ID.

I know what some of you will argue. Some will say, "Tough luck," as it seems that data privacy has sailed, whilst others, in their attempt to do the right thing, will ask the correct question: "How can we protect the individual without jeopardising the project?"

Normalise or Standardise. In other words, transform the data while maintaining the pattern.

Always keep a reference of your transformation so that data professionals will not know the exact position of the person, but those who need to know can reverse the transformation. This mindset can be adapted and extended to other forms of data whereby a data privacy concern is raised.

Data privacy is everybody's problem and responsibility. It starts with each of us understanding our rights and then extending that respect to our family, friends, neighbours, colleagues, and clients. Data professionals have that responsibility, too. We can't leave it to the Business to know because we are the Data team. We need to collaborate with the Business and ensure that the reputation grows on ethical grounds and respect for the clients.

Data Privacy in Data Science

Recent Posts

Comments