Once we have data loaded into a flexible structure, we need to decide how we’re going to deal with missing and duplicate data elements.
Introduction
This article is part of the Data Cleaning with Python and Pandas series. It’s aimed at getting developers up and running quickly with data science tools and techniques.
If you’d like to check out the other articles in the series, you can find them here:
Now that our data is loaded and ready to go, we need to delve into some of the problems the dataset may contain.
Note that we've created a complete Jupyter Notebook with the source data files for this series of modules, which you can download and install locally.
The first step in most data cleaning processes is to analyze the dataset to find and deal with any missing data. Missing data or missing values generally occur in datasets when no data value has been stored, typically when the value is not mandatory in a form or when it’s calculated from other variables. Data missing from a data set is a pretty common occurrence and can have a significant effect on the visualization and interpretation of data sets.
Finding the Missing Data Elements
First, we have to find out how much data and what values are missing before deciding what we can do to interpret the missing values. Pandas DataFrames has some great methods for doing this, including:
- The
isnull
method checks to see if each field is null. - The
sum
method adds together fields that are passed into it.
First, let’s start a new code block and enter the following:
customers.isnull()
This writes the table with a check on each field in the table to see if it’s a null value. Let's expand this a little to get a summary:
customers.isnull().sum()
And here’s the result:
This provides a better summary of our values and, apart from the first and last name, it looks like there are a number of values to fix. Let's look at some of the options available when dealing with missing data.
Dropping Missing Pandas Columns with drop and dropna
Sometimes missing values are in columns we don't really need to report on anyway, or they have so few missing values we can drop the affected rows entirely. Let’s use the following methods to drop some unneeded values:
- The drop method drops columns or rows using a custom filter
- The dropna method drops columns or rows that contain null elements
If we look at the example columns first, two of the three items with the greatest number of missing values are email and street_num. These data elements are normally unique, and while it could be important to look at email domain names (everything after the @ symbol), most of the time we won't want to do much reporting on these columns, so drop them away by modifying the code block to:
customers.drop(columns='email', inplace=True)
customers.drop(columns='street_num', inplace=True)
customers.isnull().sum()
If you run this code block now, you’ll see that the two columns have been dropped from the data structures. This makes the missing values potentially a little better, though the number of missing items in the postcode column is still pretty high, so drop postcode as well.
Also, we’re going to assume that visualization down to the street level isn’t important, so we’re going to remove those columns as well. It's important to note that because we’re using in-place removal, you need to run the notebook from the start or you’ll get errors.
Dropping Missing Pandas Rows
Next, we’re going to remove some of the rows that we can’t interpret based on the fields, in particular city
and state
. If the customers haven't supplied this information, we’ll try dropping those rows to see how much of our total data this affects.
Start a new code block and add the following:
customers.dropna(subset = ['city'], inplace=True)
print(customers.shape)
print(customers.isnull().sum())
This drops all the rows with a null value in the city column.
You’ll see, based on the printouts, that we now have no null values in the city
field, and we’re down to 921 records from 1000.
Modify the code above and do the same removal process for the state
field. This takes the total fields down to 848, so there was a small amount of overlap and 848 records out of 1000 is still pretty good.
Replacing Values
This just leaves gender
fields with blank values, but we’re going to use a different method for this field and replace these with the static value "Unspecified".
- The
fillna
method fills columns or rows using the specified value
Start a new code block and add the following:
customers["gender"].fillna('Unspecified', inplace=True)
print(customers.shape)
print(customers.isnull().sum())
This scans the customer dataset on the gender
column and changes any missing values to "Unspecified".
If you look at the output now, all of our columns have values. Even though we haven't looked too closely at the products, we’ll also add the value "Unspecified" to missing values in the company column in the products DataFrame.
Summary
We briefly went through some simple options for dealing with missing values in our dataset. Next, we’ll take a look at combining DataFrames into larger, more complete sets of normalized data so we can finish our data cleaning and start on visualizations.
Puzzle image source: https://medium.com/@ODSC/missing-data-in-supervised-machine-learning-b6df0f02a731