Correcting Missing Data in Pandas

Glenn Prince

5.00/5 (13 votes)

29 May 2020CPOL5 min read

9.2K

123

In this third part of the Data Cleaning with Python and Pandas series, we delve into some of the problems the dataset may contain.

Once we have data loaded into a flexible structure, we need to decide how we’re going to deal with missing and duplicate data elements.

Introduction

This article is part of the Data Cleaning with Python and Pandas series. It’s aimed at getting developers up and running quickly with data science tools and techniques.

If you’d like to check out the other articles in the series, you can find them here:

Part 1 - Introducing Jupyter and Pandas
Part 2 - Loading CSV and SQL Data into Pandas
Part 3 - Correcting Missing Data in Pandas
Part 4 - Combining Multiple Datasets in Pandas
Part 5 - Cleaning Data in a Pandas DataFrame
Part 6 - Reshaping Data in a Pandas DataFrame
Part 7 - Data Visualization using Seaborn and Pandas

Now that our data is loaded and ready to go, we need to delve into some of the problems the dataset may contain.

Note that we've created a complete Jupyter Notebook with the source data files for this series of modules, which you can download and install locally.

The first step in most data cleaning processes is to analyze the dataset to find and deal with any missing data. Missing data or missing values generally occur in datasets when no data value has been stored, typically when the value is not mandatory in a form or when it’s calculated from other variables. Data missing from a data set is a pretty common occurrence and can have a significant effect on the visualization and interpretation of data sets.

Finding the Missing Data Elements

First, we have to find out how much data and what values are missing before deciding what we can do to interpret the missing values. Pandas DataFrames has some great methods for doing this, including:

The isnull method checks to see if each field is null.
The sum method adds together fields that are passed into it.

First, let’s start a new code block and enter the following:

Python

customers.isnull()

This writes the table with a check on each field in the table to see if it’s a null value. Let's expand this a little to get a summary:

Python

customers.isnull().sum()

And here’s the result:

This provides a better summary of our values and, apart from the first and last name, it looks like there are a number of values to fix. Let's look at some of the options available when dealing with missing data.

Dropping Missing Pandas Columns with drop and dropna

Sometimes missing values are in columns we don't really need to report on anyway, or they have so few missing values we can drop the affected rows entirely. Let’s use the following methods to drop some unneeded values:

The drop method drops columns or rows using a custom filter
The dropna method drops columns or rows that contain null elements

If we look at the example columns first, two of the three items with the greatest number of missing values are email and street_num. These data elements are normally unique, and while it could be important to look at email domain names (everything after the @ symbol), most of the time we won't want to do much reporting on these columns, so drop them away by modifying the code block to:

Python

customers.drop(columns='email', inplace=True)
customers.drop(columns='street_num', inplace=True)

customers.isnull().sum()

If you run this code block now, you’ll see that the two columns have been dropped from the data structures. This makes the missing values potentially a little better, though the number of missing items in the postcode column is still pretty high, so drop postcode as well.

Also, we’re going to assume that visualization down to the street level isn’t important, so we’re going to remove those columns as well. It's important to note that because we’re using in-place removal, you need to run the notebook from the start or you’ll get errors.

Dropping Missing Pandas Rows

Next, we’re going to remove some of the rows that we can’t interpret based on the fields, in particular city and state. If the customers haven't supplied this information, we’ll try dropping those rows to see how much of our total data this affects.

Start a new code block and add the following:

Python

customers.dropna(subset = ['city'], inplace=True)

print(customers.shape)
print(customers.isnull().sum())

This drops all the rows with a null value in the city column.

You’ll see, based on the printouts, that we now have no null values in the city field, and we’re down to 921 records from 1000.

Modify the code above and do the same removal process for the state field. This takes the total fields down to 848, so there was a small amount of overlap and 848 records out of 1000 is still pretty good.

Replacing Values

This just leaves gender fields with blank values, but we’re going to use a different method for this field and replace these with the static value "Unspecified".

The fillna method fills columns or rows using the specified value

Start a new code block and add the following:

Python

customers["gender"].fillna('Unspecified', inplace=True)

print(customers.shape)
print(customers.isnull().sum())

This scans the customer dataset on the gender column and changes any missing values to "Unspecified".

If you look at the output now, all of our columns have values. Even though we haven't looked too closely at the products, we’ll also add the value "Unspecified" to missing values in the company column in the products DataFrame.

Summary

We briefly went through some simple options for dealing with missing values in our dataset. Next, we’ll take a look at combining DataFrames into larger, more complete sets of normalized data so we can finish our data cleaning and start on visualizations.

Puzzle image source: https://medium.com/@ODSC/missing-data-in-supervised-machine-learning-b6df0f02a731

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)