How to anonymize your data?

It is becoming more important to anonymize your data before doing AI. Here we will show you some ideas on how to do it.

This is because as technology advances, we are able to gather, process, and use more data than ever before. And while this data can be incredibly useful for improving our understanding of the world and making new discoveries, it can also be used to identify individuals and track their behavior.

When it comes to using data for AI, privacy must be paramount. What we have learned is that a lot of times this PII information is useless for AI. Therefore, you could both protect and improve your AI results by anonymizing your patient data. This document will guide you on how to do this.

What data should be anonymized?

When it comes to anonymizing a dataset, the choice is not always black and white. While there's consensus on what makes 'sensitive data', like Personally Identifiable Information (PII), interpretations of its boundaries may vary from person to person or industry to industry. For example, contact info could be seen as impersonal to a marketing agency but, it may be seen as highly sensitive by security personnel. AI developers need to weigh up their options carefully when deciding which datasets require an extra layer of security - get this one wrong and you might risk running into hot water!

The broad agreement is to protect and anonymize all Personally Identifiable Information (PII) from the rest of the variables—irrespective of legal or industry influence. This includes:

Names
Mobile numbers
Photographs
Passwords
Address
Legal documents

Here are the data types that can not be uploaded to Arkangel AI and must be protected:

Data regulated by the Payment Card Industry Data Security Standards, or other financial account numbers or credentials.
Information regulated by the U.S. Health Insurance Portability and Accountability Act.
Social security numbers (or local equivalent), driver’s license numbers or other government ID numbers.
Sensitive personal data (including special categories of personal data defined under Article 9 and criminal offense data defined under Article 10 of the E.U. and U.K. General Data Protection Regulation).
Any personal data of individuals under 16 years old.
Information subject to regulation or protection under the U.S. Gramm-Leach-Bliley Act, U.S. Children’s Online Privacy Protection Act or similar foreign or domestic laws.

6 techniques to anonymize your data

There are multiple software and techniques to anonymize your information but these are the most common:

Data Masking

Data masking is giving access to a modified version of the sensitive data. To do this technique you need to create a mirror version (masked) of the database. Anonymization can range from encryption, character shuffling, or dictionary substitution.

Pseudonymization

Pseudonymization means giving private identifiers to a sensitive variable. For example, the name "Laura Velasquez" might be switched with "Olivia Smith". This guarantees data confidentiality and statistical accuracy.

Generalization

Generalization requires transforming certain data types to make them less identifiable. To do this you could change the values into a range of values with logical boundaries to your use case. For example, the user-specific address could be transformed and the zip code could be used instead. You could do the same with age, dates, among others. The idea is to remove the specifics of the data without compromising the understanding of the information.

Tip: Generalization is a technique that could enhance your AI performance.

Data Swapping

Data swapping (shuffling or data permutation) is a powerful technique that exchanges the values of certain columns, such as date of birth, it's possible to make data unrecognizable and protect confidential information. Shuffling enables businesses to ensure their dataset privacy while still gaining valuable insights from analysis.

Data Perturbation

Data perturbation changes the initial dataset slightly by using rounding methods and random noise. Noise, in this case, could be anything that interrupts data transmission or communication. Only authorized users can do away with the noise to understand the information sent.

Synthetic Data

Synthetic data uses algorithms to create its own datasets, based upon patterns discovered within pre-existing information. By utilizing mathematical equations including linear regression and standard deviation calculations, artificial records are born without compromising anyone’s safety.

PreviousData Best Practices NextHow to build a good dataset for ML?

Last updated 2 years ago

Was this helpful?