Most analysis on data is not performed on a single system or dataset, so in this step we look at combining multiple data sets to provide a bigger picture of our data.
Introduction
This article is part of the Data Cleaning with Python and Pandas series. It’s aimed at getting developers up and running quickly with data science tools and techniques.
If you’d like to check out the other articles in the series, you can find them here:
With all the missing values dealt with, let’s combine data from the product, customer, and purchase datasets to get a more complete set of data in a single DataFrame. This will provide a better view of where we’re going with this data set and what overall insights we can leverage.
Note that we've created a complete Jupyter Notebook with the source data files for this series of modules, which you can download and install locally.
Pandas offers a number of methods for combining data with different sets of logic. Before diving into some of the more complex combination sets we might use, let’s take a look at a few of the simpler methods.
Concatenating DataFrames
The core function for combining data is concat()
. This function provides simple joining of two DataFrames that can be expanded with the union option or intersection logic similar to traditional relational databases.
- The
concat
method joins DataFrames together when columns match.
By itself, concat()
will join two or more DataFrames with the same keys or "column headings," and push the rows together one after the other. For example, two DataFrames with the columns X, Y, Z and 10 rows each will join together into a single DataFrame with the columns X, Y, Z and 20 rows of data.
The concat
function has a number of different options for combining data, including, but not limited to:
- Outer join joins the data from two or more DataFrames and includes rows that don't have matching keys (and the result may contain no values).
- Inner join joins the data from two or more DataFrames only where the frames match keys (and the result may drop rows that don't match).
Merging DataFrames
Pandas also includes options to merge datasets using the rows of one set of data as inputs against keys from another set of data. This form of joining and merging is pretty powerful and it’s what we’re going to do with our datasets.
- The
merge
method joins DataFrames together using left and right joins
At the moment, our dataset includes three separate DataFrames: customers, products, and purchases. For an application, this makes a lot of sense as your products and customers don't change too much, but your purchases may change every day.
For data analysis and visualization, having all the data in one big DataFrame makes more sense. So we’re going to merge our customer and product datasets into our purchases data.
Create a new code block and add the following:
combinedData = pd.merge(purchases, customers, left_on='customer_num', right_on='id', sort=False)
combinedData = pd.merge(combinedData, products, left_on='product_num', right_on='id', sort=False)
We’re using the Pandas merge function to merge the three DataFrames. The first merge takes the purchases DataFrame and merges it with the customers DataFrame.
We’re also using two optional parameters here, left_on
and right_on
. These parameters merge the table based on the knowledge that the left_on
key matches the right_on
key even if the key names are different.
Finally, we specify the sort
option to be false as we don't need to sort data yet and it makes this method a little quicker to execute.
Add the following lines to examine our new combined DataFrame:
print(combinedData.head(5))
print(combinedData.shape)
print(purchases.shape)
And here’s the result:
As you can see, we now have one big DataFrame with a number of columns combined from all three DataFrames.
You’ll also see that when we compare row counts between the purchases
DataFrame and the resultant DataFrame, we’re down to 5069 rows out of 6000. Because merge
uses an inner join by default, the rows that couldn't be matched to a customer (as they were removed through the first stage of data cleaning) were dropped from the combined DataFrame.
Summary
We looked at the Pandas function to concatenate data sets, then moved on to merge our three DataFrames into a single, complete DataFrame that we can manipulate further to present data for visualization.
We are now going to look at cleaning up the last of the values and keys that may cause some issues before reshaping our data for visualization.