1. 程式人生 > >Quick and Dirty Data Analysis with Pandas

Quick and Dirty Data Analysis with Pandas

Before you can select and prepare your data for modeling, you need to understand what you’ve got to start with.

If you’re a using the Python stack for machine learning, a library that you can use to better understand your data is Pandas.

In this post you will discover some quick and dirty recipes for Pandas to improve the understanding of your data in terms of it’s structure, distribution and relationships.

  • Update March/2018: Added alternate link to download the dataset as the original appears to have been taken down.

Data Analysis

Data analysis is about asking and answering questions about your data.

As a machine learning practitioner, you may not be very familiar with the domain in which you’re working. It’s ideal to have subject matter experts on hand, but this is not always possible.

These problems also apply when you are learning applied machine learning either with standard machine learning data sets, consulting or working on competition data sets.

You need to spark questions about your data that you can pursue. You need to better understand the data that you have. You can do that by summarizing and visualizing your data.

Pandas

The Pandas Python library is built for fast data analysis and manipulation. It’s both amazing in its simplicity and familiar if you have worked on this task on other platforms like R.

The strength of Pandas seems to be in the data manipulation side, but it comes with very handy and easy to use tools for data analysis, providing wrappers around standard statistical methods in statsmodels and graphing methods in matplotlib.

Onset of Diabetes

We need a small dataset that you can use to explore the different data analysis recipes with Pandas.

The UIC Machine Learning repository provides a vast array of different standard machine learning datasets you can use to study and practice applied machine learning. A favorite of mine is the Pima Indians diabetes dataset.

The dataset describes the onset or lack of onset of diabetes in female Pima Indians using details from their medical records. (update: download from here). Download the dataset and save it into your current working directory with the name pima-indians-diabetes.data.

Need help with Machine Learning in Python?

Take my free 2-week email course and discover data prep, algorithms and more (with code).

Click to sign-up now and also get a free PDF Ebook version of the course.

Summarize Data

We will start out by understanding the data that we have by looking at it’s structure.

Load Data

Start by loading the CSV data from file into memory as a data frame. We know the names of the data provided, so we will set those names when loading the data from the file.

Python
123 importpandas aspdnames=['preg','plas','pres','skin','test','mass','pedi','age','class']data=pd.read_csv('pima-indians-diabetes.data',names=names)

Describe Data

We can now look at the shape of the data.

We can take a look at the first 60 rows of data by printing the data frame directly.

Python
1 print(data)

We can see that all of the data is numeric and that the class value on the end is the dependent variable that we want to make predictions about.

At the end of the data dump we can see the description of the data frame itself as a 768 rows and 9 columns. So now we have idea of the shape of our data.

Next we can get a feeling for the distribution of each attribute by reviewing summary statistics.

Python
1 print(data.describe())

This displays a table of detailed distribution information for each of the 9 attributes in our data frame. Specifically: the count, mean, standard deviation, min, max, and 25th, 50th (median), 75th percentiles.

We can review these statistics and start noting interesting facts about our problem. Such as the average number of pregnancies is 3.8, the minimum age is 21 and some people have a body mass index of 0, which is impossible and a sign that some of the attribute values should be marked as missing.

Learn more about the describe function on DataFrame.

Visualize Data

A graph is a lot more telling about the distribution and relationships of attributes.

Nevertheless, it is important to take your time and review the statistics first. Each time you review the data a different way, you open yourself up to noticing different aspects and potentially achieving different insights into the problem.

Pandas uses matplotlib for creating graphs and provides convenient functions to do so. You can learn more about data visualization in Pandas.

Feature Distributions

The first and easy property to review is the distribution of each attribute.

We can start out and review the spread of each attribute by looking at box and whisker plots.

Python
123 importmatplotlib.pyplot aspltpd.options.display.mpl_style='default'data.boxplot()

This snippet changes the style for drawing graphs (via matplotlib) to the default style, which looks better.

Attribute box and whisker plots

Attribute box and whisker plots

We can see that the test attribute has a lot of outliers. We can also see that the plas attribute seems to have a relatively even normal distribution. We can also look at the distribution of each attribute by discretization the values into buckets and review the frequency in each bucket as histograms.

Python
1 data.hist()

This lets you note interesting properties of the attribute distributions such as the possible normal distribution of attributes like pres and skin.

Attribute Histogram Matrix

Attribute Histogram Matrix

You can review more details about the boxplot and hist function on DataFrame

Feature-Class Relationships

The next important relationship to explore is that of each attribute to the class attribute.

One approach is to visualize the distribution of attributes for data instances for each class and note and differences. You can generate a matrix of histograms for each attribute and one matrix of histograms for each class value, as follows:

Python
1 data.groupby('class').hist()

The data is grouped by the class attribute (two groups) then a matrix of histograms is created for the attributes is in each group. The result is two images.

Attribute Histogram Matrix for Class 0

Attribute Histogram Matrix for Class 0

Attribute Histogram Matrix for Class 1

Attribute Histogram Matrix for Class 1

This helps to point out differences in the distributions between the classes like those for the plas attribute.

You can better contrast the attribute values for each class on the same plot

1 data.groupby('class').plas.hist(alpha=0.4)

This groups the data by class by only plots the histogram of plas showing the class value of 0 in red and the class value of 1 in blue. You can see a similar shaped normal distribution, but a shift. This attribute is likely going to be useful to discriminate the classes.

Overlapping Attribute Histograms for Each Class

Overlapping Attribute Histograms for Each Class

You can read more about the groupby function on DataFrame.

Feature-Feature Relationships

The final important relationship to explore is that of the relationships between the attributes.

We can review the relationships between attributes by looking at the distribution of the interactions of each pair of attributes.

Python
12 frompandas.plotting importscatter_matrixscatter_matrix(data,alpha=0.2,figsize=(6,6),diagonal='kde')

This uses a built function to create a matrix of scatter plots of all attributes versus all attributes. The diagonal where each attribute would be plotted against itself shows the Kernel Density Estimation of the attribute instead.

Attribute Scatter Plot Matrix

Attribute Scatter Plot Matrix

This is a powerful plot from which a lot of inspiration about the data can be drawn. For example, we can see a possible correlation between age and preg and another possible relationship between skin and mass.

Summary

We have covered a lot of ground in this post.

We started out looking at quick and dirty one-liners for loading our data in CSV format and describing it using summary statistics.

Next we looked at various different approaches to plotting our data to expose interesting structures. We looked at the distribution of the data in box and whisker plots and histograms, then we looked at the distribution of attributes compared to the class attribute and finally at the relationships between attributes in pair-wise scatter plots.

Frustrated With Python Machine Learning?

Master Machine Learning With Python

Develop Your Own Models in Minutes

…with just a few lines of scikit-learn code

Covers self-study tutorials and end-to-end projects like:
Loading data, visualization, modeling, tuning, and much more…

Finally Bring Machine Learning To
Your Own Projects

Skip the Academics. Just Results.

相關推薦

Quick and Dirty Data Analysis with Pandas

Tweet Share Share Google Plus Before you can select and prepare your data for modeling, you need

Quick and Dirty Data Analysis for your Machine Learning Problem

Tweet Share Share Google Plus A part of having a good understanding of the machine learning prob

Data Analysis with Python : Exercise- Titantic Survivor Analysis | packtpub.com

.com pub nal kaggle out conda anti vivo python kaggle-titantic, from: https://www.youtube.com/watch?v=siEPqQsPLKA install matplotlib: con

Beginning Data Exploration and Analysis with Apache Spark 使用Apache Spark開始資料探索和分析 中文字幕

使用Apache Spark開始資料探索和分析 中文字幕 Beginning Data Exploration and Analysis with Apache Spark 無論您是想要探索資料還是開發複雜的機器學習模型,資料準備都是任何資料專業人士的主要任務 Spark是一種引擎,它

Data Cleaning with Python and Pandas: Detecting Missing Values

Data Cleaning with Python and Pandas: Detecting Missing ValuesData cleaningcan be a tedious task.It’s the start of a new project and you’re excited to appl

Pythonic Data Cleaning With NumPy and Pandas

Data scientists spend a large amount of their time cleaning datasets and getting them down to a form with which they can work. In fact, a lot of data sc

[Machine Learning with Python] Data Preparation by Pandas and Scikit-Learn

In this article, we dicuss some main steps in data preparation. Drop Labels Firstly, we drop labels for train set. Here we use drop() method in Pandas li

[Python] Slice the data with pandas

slice example name [] ant 2.4 int index ram For example we have dataframe like this: SPY AAPL IBM

[Python] Normalize the data with Pandas

orm cnblogs port pre .sh use panda 技術分享 height import os import pandas as pd import matplotlib.pyplot as plt def test_run():

學習筆記之pandas: Python Data Analysis Library

open .com sets 學習 and ref ftw pro title Python Data Analysis Library — pandas: Python Data Analysis Library https://pandas.pydat

pandas - Python Data Analysis Library - 安裝和版本

pandas - Python Data Analysis Library - 安裝和版本 http://pandas.pydata.org/ pandas is an open source, BSD-licensed library providing high-performa

資料視覺化之"Surveying the complementary role of automatic data analysis and visualization in knowledge discovery"

Surveying the complementary role of automatic data analysis and visualization in knowledge discovery Enrico Bertini; Denis Lalanne; VAKD '09 Proceedings o

EM,SEM演算法操作例項:《Statistical Analysis with Missing Data》習題9.1 & 9.2

一、題目 Example 9.1 & 9.2 重現書中Example 9.1與9.2。 先貼出SEM演算法: SEM 下面是Example 9.1與Example 9.2原例: Example 9.1 Example 9.2

缺失資料的極大似然估計:《Statistical Analysis with Missing Data》習題7.16

一、題目 a)極大似然估計 X X X為伯努利分佈,並且

缺失資料構造置信區間:《Statistical Analysis with Missing Data》習題7.9

一、題目 7.9 依題意,我們用下述方法生成模擬資料: y i

插補缺失資料的幾種方法:《Statistical Analysis with Missing Data》習題4.15

一、題目 本題基於之前習題1.6產生關於 ( Y 1

《Statistical Analysis with Missing Data》習題1.6

題目 解答 由於題目要求需要重複三次類似的操作,故首先載入所需要的包,構造生成資料的函式以及繪圖的函式: library(tidyr) # 繪圖所需 library(ggplot2) # 繪圖所需 # 生成資料 GenerateData <

How to use APIs with Pandas and store the results in Redshift

How to use APIs with Pandas and store the results in RedshiftHere is an easy tutorial to help understand how you can use Pandas to get data from a RESTFUL