With our dataset in place, we’ll take a quick look at the visualizations you can easily create from a dataset using popular Python libraries, then walk through an example of a visualization.
Introduction
This article is part of the Data Cleaning with Python and Pandas series. It’s aimed at getting developers up and running quickly with data science tools and techniques.
If you’d like to check out the other articles in the series, you can find them here:
Sometimes, even after you’ve cleaned up your dataset, you still sometimes need to reshape your Pandas DataFrame to get the most out of the data. Reshaping is the term used when the table structure is manipulated to form different datasets, such as making "wide" data tables "long."
This will feel familiar if you’ve worked with Pivot Tables in Excel or the built-in pivot and crosstab support included in many relational databases.
As an example, the tables above (from the Pandas documentation) have been reshaped by pivoting, stacking or unstacking the table.
- The
pivot
method takes a large data set with multiple indexes and summarizes it - The
stack
method takes a table with multiple indexes and groups them - The
unstack
method takes a table with multiple unique columns and ungroups them
At this stage, we’re going to look at a number of methods to reshape the data with Pandas. We’ll see how we can use pivoting and stacking of DataFrames to get a different picture of our data.
Note that we've created a complete Jupyter Notebook with the source data files for this series of modules, which you can download and install locally.
Pivoting a Pandas DataFrame
With Pandas, we can use the pivot
function to create a new DataFrame from an existing one. At the moment our tables are indexed by purchase ID, but let’s pivot our previously-created combinedData
table into something a little more interesting.
First, let's try the following pivot
method by starting a new code block and adding:
productsByState = combinedData.pivot(index='product_id', columns='company', values='paid')
The result looks like this:
Running this command generates a duplicate index error as pivot
works only on DataFrames with keys that are unique.
But there’s another method that can give us a result that gets around this. pivot_table
works much like pivot, except it aggregates the duplicate values instead of generating an error.
- The
pivot_table
method takes a large data set and summarizes it by aggregating duplicates
Let's use this method with the defaults:
productsByState = combinedData.pivot_table(index=['product_id', 'product'], columns='state', values='paid')
You can see the result here:
This produces a DataFrame with a list of products and the mean value for each state across the columns. This is not really that useful, so let's change the aggregation method:
reshapedData = combinedData.pivot_table(index=['product_id', 'product'], columns='state', values='paid', aggfunc=np.sum)
reshapedData = reshapedData.fillna(0)
print(reshapedData.head(10))
This now produces a table of products with the sum of all the sales of those products by state. The second line in this method also removes the NaN values and replaces them with a 0, as the assumption is there are no sales in that state for those products.
Grouping Data in a Pandas DataFrame
The other reshaping activity we’ll look at is grouping the data elements together. Let's go back to our original big DataFrame and create a new DataFrame that groups a single customer's transactions together.
- The
groupby
method takes a large data set and groups by a columns values
Start a new code block and add:
volumesData = combinedData.groupby(by='customer_id') print(volumesData.head(10))
Here's the result:
It doesn’t really look like that did anything because our DataFrame is indexed on the purchase_id
.
Lets add an aggregate function to summarize the data so our grouping works as intended:
volumesData = combinedData.groupby(by='customer_id').sum()
print(volumesData.head(10))
Again, here's the result:
This groups our dataset the way we were expecting, but we seem to be missing some columns and purchase_id
doesn't really make sense, so let’s expand our groupby
method and trim off the purchase_id
column:
volumesData = combinedData.groupby(by=['customer_id','first_name','last_name','product_id','product']).sum()
volumesData.drop(columns='purchase_id', inplace=True)
print(volumesData.head(10))
Here's our new result:
The final result looks pretty good and gives us a good idea of what the customers are buying, the amounts, and how much they’re paying.
Finally, we’ll make one more groupby
change to our dataset. Add the following to create a totals-by-state DataFrame:
totalsData = combinedData.groupby(by='state').sum().reset_index()
totalsData.drop(columns=['purchase_id','customer_id','product_id'], inplace=True)
The key change here is we added a reset_index
method after the sum
method. This is to ensure the resulting DataFrame has a usable index for our visualization work.
Summary
We took our complete, clean dataset and reshaped it a few different ways to give us some added insights into our data.
Next we’ll look at visualizations and see how they’re an important tool for presenting our data and ensuring the results are clean.
Header image source: http://ohi-science.org/data-science-training/tidyr