Data anonymization plays a huge role in contemporary data-driven society and most of the time data is sensitive. We will use `anonympy` package for solving this issue. Each method depends on what kind of data (data type) we are trying to anonymize. Data anonymization has lots of pitfalls, therefore there is no single step approach to achieve it. Every method should be used carefully and before applying anything, we should thoroughly understand the data and keep our end goal in mind.
Introduction
Our world is bombarded with digital data. 2.5 quintillion bytes is the number for amount of data produced every day. And most of the time, data is personal and sensitive, something that the person whom it relates to, wouldn't want to disclose it. Some examples of personal and sensitive data are names, identification card numbers, ethnicity, etc. However, data also contains valuable business insights. So, how do we balance privacy and the need to gather and share valuable information? That's where data anonymization comes in.
Background
With the rising need of data anonymization and extensibility of Python's packages, I thought it would be nice to create a library which can provide numerous data anonymization techniques and be easy to use. Please meet, my very first package - anonympy, created with the hope to contribute to open-source community and help other users to deal sensitive data. As for now, the package provides functions to anonymize tabular (pd.DataFrame
) and image data.
Using the Code
As a usage example, let's anonymize the following dataset - sample.csv.
Let's start by installing the package. It can be achieved in two steps:
pip install anonympy
pip install cape-privacy==0.3.0 --no-deps
Next, load our sample dataset which we will try to anonymize:
import pandas as pd
url = r'https://raw.githubusercontent.com/ArtLabss/open-data-anonimizer/
0287f675a535101f145cb975baf361a96ff71ed3/examples/files/new.csv'
df = pd.read_csv(url, parse_dates=['birthdate'])
df.head()
By looking at columns, we can see that all are personal and sensitive. Therefore, we will have to apply relevant techniques to each and every column. We need to initializing our dfAnonymizer
object.
from anonympy.pandas import dfAnonymizer
anonym = dfAnonymizer(df)
It’s important to know of what data type is a column before applying any functions. Let’s check the data types and see what methods are available to us.
print(anonym.numeric_columns)
print(anonym.categorical_columns)
print(anonym.datetime_columns)
... ['salary', 'age']
... ['first_name', 'address', 'city', 'phone', 'email', 'web']
... ['birthdate']
from anonympy.pandas.utils import available_methods
print(available_methods())
... `numeric`:
* Perturbation - "numeric_noise"
* Binning - "numeric_binning"
* PCA Masking - "numeric_masking"
* Rounding - "numeric_rounding"
`categorical`:
* Synthetic Data - "categorical_fake"
* Synthetic Data Auto - "categorical_fake_auto"
* Resampling from same Distribution - "categorical_resampling"
* Tokenazation - "categorical_tokenization"
* Email Masking - "categorical_email_masking"
`datetime`:
* Synthetic Date - "datetime_fake"
* Perturbation - "datetime_noise"
`general`:
* Drop Column - "column_suppression"
In our dataset
, we have 6 categorical columns, 2 numerical and 1 of datetime
type. Also, from the list that available_methods
returned, we can find functions for each data type.
Let’s add some random noise to age
column, round the values in salary
column and partially mask email
column.
anonym.numeric_noise('age')
anonym.numeric_rounding('salary')
anonym.categorical_email_masking('email')
'salary':'numeric_rounding',
'email':'categorical_email_masking'})
To see the changes call to_df()
, or for short summary, call info()
method.
anonym.info()
Now we would like to substitute names in first_name
column with fake ones. For that, we first have to check if Faker
has a corresponding method for that.
from anonympy.pandas.utils import fake_methods
print(fake_methods('f'))
... factories, file_extension, file_name, file_path, firefox, first_name,
first_name_female, first_name_male, first_name_nonbinary, fixed_width,
format, free_email, free_email_domain, future_date, future_datetime
Good, Faker
has a method called first_name
, let’s permutate the column.
anonym.categorical_fake('first_name')
Checking fake_methods
for other column names it turns out, Faker also has methods for address
and city
. The web
column can be substituted with url
method and phone
with phone_number
.
anonym.categorical_fake_auto()
anonym.categorical_fake({'web': 'url', 'phone': 'phone_number'})
Last column left to anonymize is birthdate
. Since we have age
column which contains the same information, we could drop this column using column_supression
method. However, for the sake of clarity, let’s add some noise to it.
anonym.datetime_noise('birthdate')
That’s it. Let’s now compare our datasets before and after anonymization.
Before:
After:
And now, your dataset is safe for public release.
Points of Interest
Data privacy and protection is an important part data handling and should be paid proper attention to. Everyone wants his personal and sensitive data to be protected and secure. Therefore, in this article, I showed you how to use anonympy for simple anonymization and pseudoanonymization with python. This library should not be used as a magic wand that will do everything, you still have to thoroughly understand your data and the techniques that are being applied and always keep in mind your end goal.
Here is the GitHub repository for the package - anonympy.
Good Luck with anonymizing your data!
History
- 9th February, 2022: Initial version