1. 程式人生 > >How To Talk About Data in Machine Learning (Terminology from Statistics and Computer Science)

How To Talk About Data in Machine Learning (Terminology from Statistics and Computer Science)

Data plays a big part in machine learning.

It is important to understand and use the right terminology when talking about data.

In this post you will discover exactly how to describe and talk about data in machine learning. After reading this post you will know the terminology and nomenclature used in machine learning to describe data.

This will greatly help you with understanding machine learning algorithms in general.

How To Talk About Data in Machine Learning

How To Talk About Data in Machine Learning
Photo by PROWilliam J Sisti, some rights reserved.

Let’s get started.

Data As you Know It

How do you think about data?

Think of a spreadsheet, like Microsoft Excel. You have columns, rows, and cells.

Data Terminology in Data in Machine Learning

Data Terminology in Data in Machine Learning

  • Column: A column describes data of a single type. For example, you could have a column of weights or heights or prices. All the data in one column will have the same scale and have meaning relative to each other.
  • Row: A row describes a single entity or observation and the columns describe properties about that entity or observation. The more rows you have, the more examples from the problem domain that you have.
  • Cell: A cell is a single value in a row and column. It may be a real value (1.5) an integer (2) or a category (“red”).

This is how you probably think about data, columns, rows and cells.

Generally, we can call this type of data: tabular data. This form of data is easy to work within machine learning.

Need help with Statistics for Machine Learning?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Data As It Is Known in Machine Learning

There are different flavors of machine learning that give different perspectives on the field. For example there is a the statistical perspective and the computer science perspective.

Next we will look at the different terms used to refer to data as you know it.

Statistical Learning Perspective

The statistical perspective frames data in the context of a hypothetical function (f) that the machine learning algorithm is trying to learn.

That is, given some input variables (input), what is the predicted output variable (output).

output = f(input)

Those columns that are the inputs are referred to as input variables.

Whereas the column of data that you may not always have and that you would like to predict for new input data in the future is called the output variable. It is also called the response variable.

output variable = f(input variables)

Statistical Learning Perspective

Statistical Learning Perspective

Typically, you have more than one input variable. In this case the group of input variables are referred to as the input vector.

output variable = f(input vector)

If you have done a little statistics in your past you may know of another more traditional terminology.

For example, a statistics text may talk about the input variables as independent variables and the output variable as the dependent variable. This is because in the phrasing of the prediction problem the output is dependent or a function of the input or independent variables.

dependent variable = f(independent variables)

The data is described using a short hand in equations and descriptions of machine learning algorithms. The standard shorthand used in the statistical perspective is to refer to the input variables as capital “x” (X) and the output variables as capital “y” (Y).

Y = f(X)

When you have multiple input variables they may be dereferenced with an integer to indicate their ordering in the input vector, for example X1, X2 and X3 for data in the first three columns.

Computer Science Perspective

There is a lot of overlap in the computer science terminology for data with the statistical perspective. We will look at the key differences.

A row often describes an entity (like a person) or an observation about an entity. As such, the columns for a row are often referred to as attributes of the observation. When modeling a problem and making predictions, we may refer to input attributes and output attributes.

output attribute = program(input attributes)

Computer Science Perspective

Computer Science Perspective

Another name for columns is features, used for the same reason as attribute, where a feature describes some property of the observation. This is more common when working with data where features must be extracted from the raw data in order to construct an observation.

Examples of this include analog data like images, audio and video.

output = program(input features)

Another computer science phrasing is that for a row of data or an observation as an instance. This is used because a row may be considered a single example or single instance of data observed or generated by the problem domain.

prediction = program(instance)

Models and Algorithms

There is one final note of clarification that is important and that is between algorithms and models.

This can be confusing as both algorithm and model can be used interchangeably.

A perspective that I like is to think of the model as the specific representation learned from data and the algorithm as the process for learning it.

model = algorithm(data)

For example, a decision tree or a set of coefficients are a model and the C5.0 and Least Squares Linear Regression are algorithms to learn those respective models.

Summary

In this post you discovered the key terminology used to describe data in machine learning.

  • You started with the standard understanding of tabular data as seen in a spreadsheet as columns, rows and cells.
  • You learned the statistical terms of input and output variables that may be denoted as X and Y respectively.
  • You learned the computer science terms of attribute, feature and instance.
  • Finally you learned that talk of models and algorithms can be separated into learned representation and process for learning.

Do you have any questions about this post or about data terminology used in machine learning? Leave a comment and ask your question and I will do my best to answer it.

Get a Handle on Statistics for Machine Learning!

Statistical Methods for Machine Learning

Develop a working understanding of statistics

…by writing lines of code in python

It provides self-study tutorials on topics like:
Hypothesis Tests, Correlation, Nonparametric Stats, Resampling, and much more…

Discover how to Transform Data into Knowledge

Skip the Academics. Just Results.

相關推薦

How To Talk About Data in Machine Learning (Terminology from Statistics and Computer Science)

Tweet Share Share Google Plus Data plays a big part in machine learning. It is important to unde

How To Handle Missing Values In Machine Learning Data With Weka

Tweet Share Share Google Plus Data is rarely clean and often you can have corrupt or missing val

How Facebook Uses Bayesian Optimization to Conduct Better Experiments in Machine Learning Models

How Facebook Uses Bayesian Optimization to Conduct Better Experiments in Machine Learning ModelsHyperparameter optimization is a key aspect of the lifecycl

A new course to teach people about fairness in machine learning

In my undergraduate studies, I majored in philosophy with a focus on ethics, spending countless hours grappling with the notion of fairness: both how to de

How to Build an Ensemble Of Machine Learning Algorithms in R (ready to use boosting, bagging and stacking)

Tweet Share Share Google Plus Ensembles can give you a boost in accuracy on your dataset. In thi

How to Work Through a Regression Machine Learning Project in Weka Step

Tweet Share Share Google Plus The fastest way to get good at applied machine learning is to prac

How Beginners Get It Wrong In Machine Learning

Tweet Share Share Google Plus The 5 Most Common Mistakes That Beginners Make And How To Avoid Th

How Do I Get Started In Machine Learning?

Tweet Share Share Google Plus I get daily emails asking the question: How do I get started in ma

How to Build an Intuition for Machine Learning Algorithms

Tweet Share Share Google Plus Machine learning algorithms are complex. To get good at applying a

How to Win at SEO in the Age of Machine Learning

In the recent past, we have been hearing a lot about machine learning, but do we really know what is machine learning? And how it can change the organic se

10 Examples of How to Use Statistical Methods in a Machine Learning Project

Tweet Share Share Google Plus Statistics and machine learning are two very closely related field

How to Prepare Data For Machine Learning

Tweet Share Share Google Plus Machine learning algorithms learn from data. It is critical that y

How To Get Started In Machine Learning: A Self

Tweet Share Share Google Plus Specifically, the original poster of the question had completed t

AI In China: How Uber Rival Didi Chuxing Uses Machine Learning To Revolutionize Transportation

Chinese company, Didi Chuxing may be known by most as the world's largest ride-sharing company with a goal "to build a better journey," but its vision reve

[Javascript] Classify JSON text data with machine learning in Natural

comm about cnblogs ++ get ssi learn clas save In this lesson, we will learn how to train a Naive Bayes classifier and a Logistic Regressi

How to setup kernel debug in Virtual Machine and redirect usermode debug sessions

轉載自:http://blog.sina.com.cn/s/blog_65e729050100m7on.html 在Windows高效排錯中提到了除錯重定向。書中沒有詳細介紹。今天恰好有機會在虛擬機器上從頭開始配置了一下,所以把詳細的內容記錄在這裡,算是補充。 文章本身使用英文寫的。由於書中是用

[轉]How to display the data read in DataReceived event handler of serialport

本文轉自:https://stackoverflow.com/questions/11590945/how-to-display-the-data-read-in-datareceived-event-handler-of-serialport   問: I have the followin

Data Leakage in Machine Learning 機器學習訓練中的資料洩漏

refer to:  https://www.kaggle.com/dansbecker/data-leakage There are two main types of leakage: Leaky Predictors and a Leaky Validation Strategies. L

Top 4 Steps for Data Preprocessing in Machine Learning

Data Processing in the machine learning is a data mining technique. In this process, the raw data gathered and you analyze the data to find a way to transf

'Assassin's Creed: Odyssey' reviews are here to talk about good times in ancient Greece

Assassin's Creed took a year-long break in 2016, then came out swinging last year with a quite literally game-changing new vision for the series in Assassi