Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / artificial-intelligence / data-science

anonympy - Data Anonymization with Python

5.00/5 (2 votes)
9 Feb 2022Public Domain3 min read 18.1K  
An overview of newly written package anonympy and a walk-through some of its methods and functionality
Data anonymization plays a huge role in contemporary data-driven society and most of the time data is sensitive. We will use `anonympy` package for solving this issue. Each method depends on what kind of data (data type) we are trying to anonymize. Data anonymization has lots of pitfalls, therefore there is no single step approach to achieve it. Every method should be used carefully and before applying anything, we should thoroughly understand the data and keep our end goal in mind.

Introduction

Our world is bombarded with digital data. 2.5 quintillion bytes is the number for amount of data produced every day. And most of the time, data is personal and sensitive, something that the person whom it relates to, wouldn't want to disclose it. Some examples of personal and sensitive data are names, identification card numbers, ethnicity, etc. However, data also contains valuable business insights. So, how do we balance privacy and the need to gather and share valuable information? That's where data anonymization comes in.

Background

With the rising need of data anonymization and extensibility of Python's packages, I thought it would be nice to create a library which can provide numerous data anonymization techniques and be easy to use. Please meet, my very first package - anonympy, created with the hope to contribute to open-source community and help other users to deal sensitive data. As for now, the package provides functions to anonymize tabular (pd.DataFrame) and image data.

Using the Code

As a usage example, let's anonymize the following dataset - sample.csv.
Let's start by installing the package. It can be achieved in two steps:

Python
pip install anonympy
pip install cape-privacy==0.3.0 --no-deps

Next, load our sample dataset which we will try to anonymize:

Python
import pandas as pd

url = r'https://raw.githubusercontent.com/ArtLabss/open-data-anonimizer/
      0287f675a535101f145cb975baf361a96ff71ed3/examples/files/new.csv'
df = pd.read_csv(url, parse_dates=['birthdate'])
df.head()

Image 1

By looking at columns, we can see that all are personal and sensitive. Therefore, we will have to apply relevant techniques to each and every column. We need to initializing our dfAnonymizer object.

Python
from anonympy.pandas import dfAnonymizer 

anonym = dfAnonymizer(df)

It’s important to know of what data type is a column before applying any functions. Let’s check the data types and see what methods are available to us.

Python
# check dtypes 
print(anonym.numeric_columns) 
print(anonym.categorical_columns) 
print(anonym.datetime_columns) 

... ['salary', 'age']
... ['first_name', 'address', 'city', 'phone', 'email', 'web']
... ['birthdate']

# available methods for each data type
from anonympy.pandas.utils import available_methods

print(available_methods())

... `numeric`:        
  * Perturbation - "numeric_noise"         
  * Binning - "numeric_binning"         
  * PCA Masking - "numeric_masking"        
  * Rounding - "numeric_rounding" 
`categorical`:         
  * Synthetic Data - "categorical_fake"         
  * Synthetic Data Auto - "categorical_fake_auto"         
  * Resampling from same Distribution - "categorical_resampling"         
  * Tokenazation - "categorical_tokenization"         
  * Email Masking - "categorical_email_masking" 
`datetime`:         
  * Synthetic Date - "datetime_fake"         
  * Perturbation - "datetime_noise" 
`general`:         
  * Drop Column - "column_suppression" 

In our dataset, we have 6 categorical columns, 2 numerical and 1 of datetime type. Also, from the list that available_methods returned, we can find functions for each data type.

Let’s add some random noise to age column, round the values in salary column and partially mask email column.

Python
anonym.numeric_noise('age')   
anonym.numeric_rounding('salary')  
anonym.categorical_email_masking('email') 

# or with a single line 
# anonym.anonymize({'age':'numeric_noise',                      
                    'salary':'numeric_rounding',                      
                    'email':'categorical_email_masking'})

To see the changes call to_df(), or for short summary, call info() method.

Python
anonym.info()

Image 2

Now we would like to substitute names in first_name column with fake ones. For that, we first have to check if Faker has a corresponding method for that.

Python
from anonympy.pandas.utils import fake_methods  

print(fake_methods('f')) # agrs: None / 'all' / any letter  

... factories, file_extension, file_name, file_path, firefox, first_name, 
first_name_female, first_name_male, first_name_nonbinary, fixed_width, 
format, free_email, free_email_domain, future_date, future_datetime 

Good, Faker has a method called first_name, let’s permutate the column.

Python
anonym.categorical_fake('first_name') 

# passing a dictionary is also valid -> {column_name: method_name} 
# anonym.categorical_fake({'first_name': 'first_name_female'}

Checking fake_methods for other column names it turns out, Faker also has methods for address and city. The web column can be substituted with url method and phone with phone_number.

Python
anonym.categorical_fake_auto() # this will change `address` and `city` 
                               # because column names correspond to method names 
anonym.categorical_fake({'web': 'url', 'phone': 'phone_number'}) # here we need to specify, 
                               # because column names differs from method name 

Last column left to anonymize is birthdate. Since we have age column which contains the same information, we could drop this column using column_supression method. However, for the sake of clarity, let’s add some noise to it.

Python
anonym.datetime_noise('birthdate')

That’s it. Let’s now compare our datasets before and after anonymization.

Before:

Click to enlarge

After:

Click to enlarge

And now, your dataset is safe for public release.

Points of Interest

Data privacy and protection is an important part data handling and should be paid proper attention to. Everyone wants his personal and sensitive data to be protected and secure. Therefore, in this article, I showed you how to use anonympy for simple anonymization and pseudoanonymization with python. This library should not be used as a magic wand that will do everything, you still have to thoroughly understand your data and the techniques that are being applied and always keep in mind your end goal.
Here is the GitHub repository for the package - anonympy.

Good Luck with anonymizing your data!

History

  • 9th February, 2022: Initial version

License

This article, along with any associated source code and files, is licensed under A Public Domain dedication