Simple Graphing with IPython and Pandas
Plotting Some Data
We have our data read in and have completed some basic analysis. Let’s start plotting it.
First remove some columns to make additional analysis easier.
customers = sales[['name','ext price','date']] customers.head()
name | ext price | date | |
---|---|---|---|
0 | Carroll PLC | 578.24 | 2014-09-27 07:13:03 |
1 | Heidenreich-Bosco | 1018.78 | 2014-07-29 02:10:44 |
2 | Kerluke, Reilly and Bechtelar | 289.92 | 2014-03-01 10:51:24 |
3 | Waters-Walker | 413.40 | 2013-11-17 20:41:11 |
4 | Waelchi-Fahey | 1793.52 | 2014-01-03 08:14:27 |
This representation has multiple lines for each customer. In order to understand purchasing patterns, let’s group all the customers by name. We can also look at the number of entries per customer to get an idea for the distribution.
customer_group = customers.groupby('name') customer_group.size()
name Berge LLC 52 Carroll PLC 57 Cole-Eichmann 51 Davis, Kshlerin and Reilly 41 Ernser, Cruickshank and Lind 47 Gorczany-Hahn 42 Hamill-Hackett 44 Hegmann and Sons 58 Heidenreich-Bosco 40 Huel-Haag 43 Kerluke, Reilly and Bechtelar 52 Kihn, McClure and Denesik 58 Kilback-Gerlach 45 Koelpin PLC 53 Kunze Inc 54 Kuphal, Zieme and Kub 52 Senger, Upton and Breitenberg 59 Volkman, Goyette and Lemke 48 Waelchi-Fahey 54 Waters-Walker 50 dtype: int64
Now that our data is in a simple format to manipulate, let’s determine how much each customer purchased during our time frame.
The
sum
function allows us to quickly sum up all the values by customer.
We can also sort the data using the
sort
command.
sales_totals = customer_group.sum() sales_totals.sort(columns='ext price').head()
ext price | |
---|---|
name | |
Davis, Kshlerin and Reilly | 19054.76 |
Huel-Haag | 21087.88 |
Gorczany-Hahn | 22207.90 |
Hamill-Hackett | 23433.78 |
Heidenreich-Bosco | 25428.29 |
Now that we know what the data look like, it is very simple to create a quick bar chart plot. Using the IPython notebook, the graph will automatically display.
my_plot = sales_totals.plot(kind='bar')
Unfortunately this chart is a little ugly. With a few tweaks we can make it a little more impactful. Let’s try:
- sorting the data in descending order
- removing the legend
- adding a title
- labeling the axes
my_plot = sales_totals.sort(columns='ext price',ascending=False).plot(kind='bar',legend=None,title="Total Sales by Customer") my_plot.set_xlabel("Customers") my_plot.set_ylabel("Sales ($)")
<matplotlib.text.Text at 0x7ff9bf23c510>
This actually tells us a little about our biggest customers and how much difference there is between their sales and our smallest customers.
Now, let’s try to see how the sales break down by category.
customers = sales[['name','category','ext price','date']] customers.head()
name | category | ext price | date | |
---|---|---|---|---|
0 | Carroll PLC | Belt | 578.24 | 2014-09-27 07:13:03 |
1 | Heidenreich-Bosco | Shoes | 1018.78 | 2014-07-29 02:10:44 |
2 | Kerluke, Reilly and Bechtelar | Shirt | 289.92 | 2014-03-01 10:51:24 |
3 | Waters-Walker | Shirt | 413.40 | 2013-11-17 20:41:11 |
4 | Waelchi-Fahey | Shirt | 1793.52 | 2014-01-03 08:14:27 |
We can use
groupby
to organize the data by category and name.
category_group=customers.groupby(['name','category']).sum() category_group.head()
ext price | ||
---|---|---|
name | category | |
Berge LLC | Belt | 6033.53 |
Shirt | 9670.24 | |
Shoes | 14361.10 | |
Carroll PLC | Belt | 9359.26 |
Shirt | 13717.61 |
The category representation looks good but we need to break it apart to
graph it as a stacked bar graph.
unstack
can do this for us.
category_group.unstack().head()
ext price | |||
---|---|---|---|
category | Belt | Shirt | Shoes |
name | |||
Berge LLC | 6033.53 | 9670.24 | 14361.10 |
Carroll PLC | 9359.26 | 13717.61 | 12857.44 |
Cole-Eichmann | 8112.70 | 14528.01 | 7794.71 |
Davis, Kshlerin and Reilly | 1604.13 | 7533.03 | 9917.60 |
Ernser, Cruickshank and Lind | 5894.38 | 16944.19 | 5250.45 |
Now plot it.
my_plot = category_group.unstack().plot(kind='bar',stacked=True,title="Total Sales by Customer") my_plot.set_xlabel("Customers") my_plot.set_ylabel("Sales")
<matplotlib.text.Text at 0x7ff9bf03fc10>
In order to clean this up a little bit, we can specify the figure size and customize the legend.
my_plot = category_group.unstack().plot(kind='bar',stacked=True,title="Total Sales by Customer",figsize=(9, 7)) my_plot.set_xlabel("Customers") my_plot.set_ylabel("Sales") my_plot.legend(["Total","Belts","Shirts","Shoes"], loc=9,ncol=4)
<matplotlib.legend.Legend at 0x7ff9bed5f710>
Now that we know who the biggest customers are and how they purchase products, we might want to look at purchase patterns in more detail.
Let’s take another look at the data and try to see how large the individual purchases are. A histogram allows us to group purchases together so we can see how big the customer transactions are.
purchase_patterns = sales[['ext price','date']] purchase_patterns.head()
ext price | date | |
---|---|---|
0 | 578.24 | 2014-09-27 07:13:03 |
1 | 1018.78 | 2014-07-29 02:10:44 |
2 | 289.92 | 2014-03-01 10:51:24 |
3 | 413.40 | 2013-11-17 20:41:11 |
4 | 1793.52 | 2014-01-03 08:14:27 |
We can create a histogram with 20 bins to show the distribution of purchasing patterns.
purchase_plot = purchase_patterns['ext price'].hist(bins=20) purchase_plot.set_title("Purchase Patterns") purchase_plot.set_xlabel("Order Amount($)") purchase_plot.set_ylabel("Number of orders")
<matplotlib.text.Text at 0x7ff9becdc210>
In looking at purchase patterns over time, we can see that most of our transactions are less than $500 and only a very few are about $1500.
Another interesting way to look at the data would be by sales over time. A chart might help us understand, “Do we have certain months where we are busier than others?”
Let’s get the data down to order size and date.
purchase_patterns = sales[['ext price','date']] purchase_patterns.head()
ext price | date | |
---|---|---|
0 | 578.24 | 2014-09-27 07:13:03 |
1 | 1018.78 | 2014-07-29 02:10:44 |
2 | 289.92 | 2014-03-01 10:51:24 |
3 | 413.40 | 2013-11-17 20:41:11 |
4 | 1793.52 | 2014-01-03 08:14:27 |
If we want to analyze the data by date, we need to set the date column
as the index using
set_index
.
purchase_patterns = purchase_patterns.set_index('date') purchase_patterns.head()
ext price | |
---|---|
date | |
2014-09-27 07:13:03 | 578.24 |
2014-07-29 02:10:44 | 1018.78 |
2014-03-01 10:51:24 | 289.92 |
2013-11-17 20:41:11 | 413.40 |
2014-01-03 08:14:27 | 1793.52 |
One of the really cool things that pandas allows us to do is resample the data. If we want to look at the data by month, we can easily resample and sum it all up. You’ll notice I’m using ‘M’ as the period for resampling which means the data should be resampled on a month boundary.
purchase_patterns.resample('M',how=sum)
Plotting the data is now very easy
purchase_plot = purchase_patterns.resample('M',how=sum).plot(title="Total Sales by Month",legend=None)
Looking at the chart, we can easily see that December is our peak month and April is the slowest.
Let’s say we really like this plot and want to save it somewhere for a presentation.
fig = purchase_plot.get_figure() fig.savefig("total-sales.png")