1. 程式人生 > >Prepare Data for Machine Learning in Python with Pandas

Prepare Data for Machine Learning in Python with Pandas

If you are using the Python stack for studying and applying machine learning, then the library that you will want to use for data analysis and data manipulation is Pandas.

This post gives you a quick introduction to the Pandas library and point you in the right direction for getting started.

pandas for data analysis

Pandas for data analysis.
Photo by gzlu, some rights reserved.

Data Analysis In Python

The Python SciPy stack is a popular for scientific computing in general. It provides powerful libraries for handing gridded data (like NumPy) and plotting (like matplotlib). Until recently, a piece that had been missing from the suite was a good library for handling data.

Data, typically does not come in a form that is ready to used. A very large part of working on a data-driven problem like machine learning is data analysis and data munging.

  • Data Analysis: This is using the tools like statistics and data visualization to better understand the problem by understanding the data.
  • Data Munging: This is the process of transforming raw data into a form so that it is appropriate for your job, like data analysis or machine learning.

Traditionally, you had to cobble together your own tool-chain of scripts in Python to perform these tasks.

These days, if you search for data analysis in Python you can’t avoid learning about Pandas. It has quickly become the go-to library for data handling in Python.

Need help with Machine Learning in Python?

Take my free 2-week email course and discover data prep, algorithms and more (with code).

Click to sign-up now and also get a free PDF Ebook version of the course.

What is Pandas?

Pandas is a Python library for data analysis and data manipulation. It adds the missing piece to the SciPy framework for handling data.

Pandas was create by Wes McKinney in 2008 primarily for quantitative financial work. As such it has a strong foundation in handling time series data and charting.

You use Pandas to load data into Python and perform your data analysis tasks. It is perfect for working with tabular data like data from a relational database or data from a spreadsheet.

Wes describes the vision of Pandas as to crate: the most powerful and flexible open source data analysis and manipulation tool available in any language.

An admirable mission that makes you want to support his cause, if only to make your own data analysis tasks easier.

Pandas Features

Pandas is a pleasure to use.

In my experience it is simple, elegant and intuitive. Having come from R, the idioms and operations are familiar and relevant.

Pandas is built on top of standard libraries in the SciPy stack. It uses NumPy for fast array handling, and provides convenient wrappers around some statistical operations from StatsModels and charting from Matplotlib.

There is a strong focus on time series given the libraries inception in the financial domain. It also has a strong focus on data frames for handling standard gridded data. Data handling is a core requirement of a library of this kind and speed has been made a priority. It is fast and provides data structures and operations like indexing and handling of sparsity.

Some important features to note include”

  • Manipulation: moving columns, slicing, reshaping, merging, joining, filtering, and others.
  • Time-series Handling: operations on date/times, resampling, moving windows and auto-alignment of datasets.
  • Missing Data Handling: auto-exclude, drop, replace, interpolate missing values
  • Group-by Operations: SQL like group by.
  • Hierarchical Indexing: data structure level, powerful for efficiently organizing data by columns.
  • Summary Statistics: Fast and powerful summary statistics of data.
  • Visualization: Simplified access to plots on data structures, such as histograms, box plots, general plots and a scatter matrix.

Pandas is available under a permissive license (Simplified BSD) and can be easily installed along with the the rest of SciPy.

Pandas Resources

This has been a quick introduction to the Pandas library and there is more to learn. Install the library, grab a dataset and start to try things out. There is no better way to get started.

Visit the Pandas homepage and have a read of the library vision and features. You can also check-out the github page for the project.

A great place to start is the list of tutorials which includes links to cookbooks, lessons, and various notable IPython notebooks around the web.

Finally, for me, I live in the API documentation.

Papers

I find papers can give a good overview of an open source library, particularly in the Python and R ecosystems. Take a look at the following papers for a structured overview of what Pandas is all about.

Videos

There are a lot of great videos on YouTube of people demonstrating Pandas on their own data and at conferences.

A great starting point is Wes’ own 10-minute tour of pandas. Take a look. It’s a little time-series data heavy, but it’s a great and quick overview. You can also checkout his IPython notebook for this tour.

Books

Amazon ImageFinally, Wes is the author of the definitive book on data analysis in Python. If you want to get serious, practice, but also consider grabbing the book. It’s called: Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython.

Frustrated With Python Machine Learning?

Master Machine Learning With Python

Develop Your Own Models in Minutes

…with just a few lines of scikit-learn code

Covers self-study tutorials and end-to-end projects like:
Loading data, visualization, modeling, tuning, and much more…

Finally Bring Machine Learning To
Your Own Projects

Skip the Academics. Just Results.

相關推薦

Prepare Data for Machine Learning in Python with Pandas

Tweet Share Share Google Plus If you are using the Python stack for studying and applying machin

Rescaling Data for Machine Learning in Python with Scikit

Tweet Share Share Google Plus Your data must be prepared before you can build models. The data p

Get Your Data Ready For Machine Learning in R with Pre

Tweet Share Share Google Plus Preparing data is required to get the best results from machine le

Essential libraries for Machine Learning in Python

Python is often the language of choice for developers who need to apply statistical techniques or data analysis in their work. It is also used by data scie

Introduction to Random Number Generators for Machine Learning in Python

Tweet Share Share Google Plus Randomness is a big part of machine learning. Randomness is used a

How to Prepare Data For Machine Learning

Tweet Share Share Google Plus Machine learning algorithms learn from data. It is critical that y

How to Get Started with Machine Learning in Python

Tweet Share Share Google Plus The Python conference PyCon2014 has held recently and the videos f

斯坦福大學公開課機器學習:machine learning system design | data for machine learning(數據量很大時,學習算法表現比較好的原理)

ali 很多 好的 info 可能 斯坦福大學公開課 數據 div http 下圖為四種不同算法應用在不同大小數據量時的表現,可以看出,隨著數據量的增大,算法的表現趨於接近。即不管多麽糟糕的算法,數據量非常大的時候,算法表現也可以很好。 數據量很大時,學習算法表現比

NXP Owns the Stage for Machine Learning in Edge Devices

SAN JOSE, Calif. and BARCELONA, Spain, Oct. 16, 2018 (GLOBE NEWSWIRE) -- (ARMTECHCON and IoT World Congress Barcelona) - Mathematical advances that are dri

NXP's New Development Platform for Machine Learning in the IoT

NXP Semiconductors has launched a new machine learning toolkit. Called "eIQ", it's a software development platform that supports popular neural network fra

Abdul Latif Jameel Clinic for Machine Learning in Health at MIT aims to revolutionize disease prevention, detection, and treatme

Today, MIT and Community Jameel, the social enterprise organization founded and chaired by Mohammed Abdul Latif Jameel ’78, launched the Abdul Latif Jameel

Best Books For Machine Learning in R

Tweet Share Share Google Plus R is a powerful platform for data analysis and machine learning. I

[Javascript] Classify JSON text data with machine learning in Natural

comm about cnblogs ++ get ssi learn clas save In this lesson, we will learn how to train a Naive Bayes classifier and a Logistic Regressi

Assessing Annotator Disagreements in Python to Build a Robust Dataset for Machine Learning

Assessing Annotator Disagreements in Python to Build a Robust Dataset for Machine LearningTea vs. Coffee: the perfect example of decisions and disagreement

How to Create a Linux Virtual Machine For Machine Learning Development With Python 3

Tweet Share Share Google Plus Linux is an excellent environment for machine learning development

Save and Load Machine Learning Models in Python with scikit

Hello Jason, I am new to machine learning. I am your big fan and read a lot of your blog and books. Thank you very much for teaching us machine le

How to Clean Text for Machine Learning with Python

Tweet Share Share Google Plus You cannot go straight from raw text to fitting a machine learning

[Javascript] Classify text into categories with machine learning in Natural

bus easy ann etc hms scrip steps spam not In this lesson, we will learn how to train a Naive Bayes classifier or a Logistic Regression cl

Why Data Normalization is necessary for Machine Learning models

Why Data Normalization is necessary for Machine Learning modelsNormalization is a technique often applied as part of data preparation for machine learning.

Data Handling using Pandas; Machine Learning in Real Life

Data Handling using Pandas; Machine Learning in Real LifeToday we will see some essential techniques to handle a bit more complex data, than the examples I