1. 程式人生 > >Quick and Dirty Data Analysis for your Machine Learning Problem

Quick and Dirty Data Analysis for your Machine Learning Problem

A part of having a good understanding of the machine learning problem that you’re working on, you need to know the data intimately.

I personally find this step onerous sometimes and just want to get on with defining my test harness

, but I know it always flushes out interested ideas and assumptions to test. As such, I use a step-by-step process to capture a minimum number of observations about the actual dataset before moving on from this step in the process of applied machine learning.

Quick and Dirty Data Analysis

Quick and Dirty Data Analysis
Photo by timparkinson

, some rights reserved

In this post you will discover my quick and easy process to analyse a dataset and get a minimum set of observations (and a minimum understanding) from a given dataset.

Need more help with Weka for Machine Learning?

Take my free 14-day email course and discover how to use the platform step-by-step.

Click to sign-up and also get a free PDF Ebook version of the course.

Data Analysis

The objective of the data analysis step is to increase the understanding of the problem by better understanding the problems data.

This involves providing multiple different ways to describe the data as an opportunity to review and capture observations and assumptions that can be tested in later experiments.

There are two different approaches I used to describe a given dataset:

  1. Summarize Data: Describe the data and the data distributions.
  2. Visualize Data: Create various graphical summaries of the data.

The key here is to create different perspectives or views on the dataset in order to elicit insights in you about the data.

1. Summarize Data

Summarizing the data is about describing the actual structure of the data. I typically use a lot of automated tools to describe things like attribute distributions. The minimum aspects of the data I like to summarize are the structure and the distributions.

Data Structure

Summarizing the data structure is about describing the number and data types of attributes. For example, going through this process highlights ideas for transforms in the Data Preparation step for converting attributes from one type to another (such as real to ordinal or ordinal to binary).

Some motivating questions for this step include:

  • How many attributes and instances are there?
  • What are the data types of each attribute (e.g. nominal, ordinal, integer, real, etc.)?

Data Distributions

Summarizing the distributions of each attributes can also flush out ideas for possible data transforms in the Data Preparation step, such a the need and effects of Discretization, Normalization and Standardization.

I like to capture a summary of the distribution of each real valued attribute. This typically includes the minimum, maximum, median, mode, mean, standard deviation and number of missing values.

Some motivating questions for this step include:

  • Create a five-number summary of each real-valued attribute.
  • What is the distribution of the values for the class attribute?

Knowing the distribution of the class attribute (or mean of a regression output variable) is useful because you can use it to define the minimum accuracy of a predictive model.

For example, if there is a binary classification problem (2 classes) with the distribution of 80% apples and 20% bananas, then a predictor can predict “apples” for every test instance and be assured to achieve an accuracy of 80%. This is the worst case algorithm that all algorithms in the test harness must beat when Evaluating Algorithms.

Additionally, if I have the time or interest, I like to generate a summary of the pair-wise attribute correlations using a parametric (Pearson’s) and non-parametric (Spearman’s) correlation coefficient. This can highlight attributes that might be candidates for removal (highly correlated with each other) and others that may be highly predictive (highly correlated with the outcome attribute).

2. Visualize Data

Visualizing the data is about creating graphs that summarize the data, capturing them and studying them for interesting structure that can be described.

There are seemingly an infinite number of graphs you can create (especially in software like R), so I like to keep it simple and focus on histograms and scatter plots.

Attribute Histograms

I like to create histograms of all attributes and mark class values. I like this because I used Weka a lot when I was learning machine learning and it does this for you. Nevertheless, it’s easy to do in other software like R.

Having a discrete distribution graphically can quickly highlight things like the possible family of distribution (such as Normal or Exponential) and how the class values map onto those distributions.

Attribute Histograms

Attribute Histograms Showing Class Values

Some motivating questions for this step include:

  • What families of distributions are shown (if any):
  • Are there any obvious structures in the attributes that map to class values?

 Pairwise Scatter-plots

Scatter plots plot one attribute on each axis. In addition, a third axis can be added in the form of the color of the plotted points mapping to class values. Pairwise scatter plots can be created for all pairs of attributes.

These graphs can quickly highlight 2-dimensional structure between attributes (such as correlation) as well as cross-attribute trends in the mapping of attribute to class values.

Pairwise Scatter-plots Showing Class Values

Pairwise Scatter-plots Showing Class Values

Some motivating questions for this step include:

  • What interesting two-dimensional structures are shown?
  • What interesting relationship between the attributes to class values are shown?

Summary

In this post you discovered a process for data analysis that seeks creating different views on the data in order to elicit observations and assumptions about the data.

The two approaches used are:

  1. Summarize Data: Describe the data and the data distributions.
  2. Visualize Data: Create various graphical summaries of the data.

Want Machine Learning Without The Code?

Master Machine Learning With Weka

Develop Your Own Models in Minutes

…with just a few a few clicks

Covers self-study tutorials and end-to-end projects like:
Loading data, visualization, build models, tuning, and much more…

Finally Bring The Machine Learning To
Your Own Projects

Skip the Academics. Just Results.


相關推薦

Quick and Dirty Data Analysis for your Machine Learning Problem

Tweet Share Share Google Plus A part of having a good understanding of the machine learning prob

Quick and Dirty Data Analysis with Pandas

Tweet Share Share Google Plus Before you can select and prepare your data for modeling, you need

5 Types of Regressions for your Machine Learning Toolbox

However, some seasoned techniques are here to stay. At the top of the list are regression techniques. As long as this number is as high, you will encounter

How to Define Your Machine Learning Problem

Tweet Share Share Google Plus The first step in any project is defining your problem. You can us

4 Data Visualization and Web Reporting Tools for your BI solution

Making complex simple with smart data analysisIt is hard to overestimate the value of insightful analytics nowadays as all the business processes have beco

How to Normalize and Standardize Your Machine Learning Data in Weka

Tweet Share Share Google Plus Machine learning algorithms make assumptions about the dataset you

斯坦福大學公開課機器學習: advice for applying machine learning | regularization and bais/variance(機器學習中方差和偏差如何相互影響、以及和算法的正則化之間的相互關系)

交叉 來講 相對 同時 test 如果 開始 遞增 相互 算法正則化可以有效地防止過擬合, 但正則化跟算法的偏差和方差又有什麽關系呢?下面主要討論一下方差和偏差兩者之間是如何相互影響的、以及和算法的正則化之間的相互關系 假如我們要對高階的多項式進行擬合,為了防止過擬合現象

R語言統計入門課程推薦——生物科學中的資料分析Data Analysis for the Life Sciences

Data Analysis for the Life Sciences是哈佛大學PH525x系列課程——生物醫學中的資料分析(PH525x series - Biomedical Data Science ),課程全部採用R語言進行統計分析理論教學與實戰。教材採用Rmarkdo

Enhancing data analysis for large hadron collider

"The methods we developed greatly enhance our discovery potential for new physics at the LHC," says Kyle Cranmer, a professor of physics and the senior au

and the (big) implications for your app

Android process death — and the (big) implications for your appWith more developers using Dependency Injection (e.g. Dagger) in Android and adopting patter

How To Load Your Machine Learning Data Into R

Tweet Share Share Google Plus You need to be able to load data into R when working on a machine

How to Layout and Manage Your Machine Learning Project

Tweet Share Share Google Plus Project layout is critical for machine learning projects just as i

Save And Finalize Your Machine Learning Model in R

Tweet Share Share Google Plus Finding an accurate machine learning is not the end of the project

How to Better Understand Your Machine Learning Data in Weka

Tweet Share Share Google Plus It is important to take your time to learn about your data when st

How to Transform Your Machine Learning Data in Weka

Tweet Share Share Google Plus Often your raw data for machine learning is not in an ideal form f

斯坦福大學公開課機器學習: advice for applying machine learning - evaluatin a phpothesis(怎麽評估學習算法得到的假設以及如何防止過擬合或欠擬合)

class 中一 技術分享 cnblogs 訓練數據 是否 多個 期望 部分 怎樣評價我們的學習算法得到的假設以及如何防止過擬合和欠擬合的問題。 當我們確定學習算法的參數時,我們考慮的是選擇參數來使訓練誤差最小化。有人認為,得到一個很小的訓練誤差一定是一件好事。但其實,僅

斯坦福大學公開課機器學習:advice for applying machine learning | learning curves (改進學習算法:高偏差和高方差與學習曲線的關系)

繪制 學習曲線 pos 情況 但我 容量 繼續 並且 inf 繪制學習曲線非常有用,比如你想檢查你的學習算法,運行是否正常。或者你希望改進算法的表現或效果。那麽學習曲線就是一種很好的工具。學習曲線可以判斷某一個學習算法,是偏差、方差問題,或是二者皆有。 為了繪制一條學習曲

斯坦福大學公開課機器學習: advice for applying machine learning | deciding what to try next(revisited)(針對高偏差、高方差問題的解決方法以及隱藏層數的選擇)

ice 簡單 pos .com img 想要 技術 分割 就是 針對高偏差、高方差問題的解決方法: 1、解決高方差問題的方案:增大訓練樣本量、縮小特征量、增大lambda值 2、解決高偏差問題的方案:增大特征量、增加多項式特征(比如x1*x2,x1的平方等等)、減少la

【文獻閱讀】Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

https://blog.csdn.net/u011995719/article/details/77834375         命名技巧:        

【原】Coursera—Andrew Ng機器學習—課程筆記 Lecture 10—Advice for applying machine learning

Lecture 10—Advice for applying machine learning   10.1 如何除錯一個機器學習演算法? 有多種方案: 1、獲得更多訓練資料;2、嘗試更少特徵;3、嘗試更多特徵;4、嘗試新增多項式特徵;5、減小 λ;6、增大 λ 為了避免一個方案一個方