1. 程式人生 > >How to Scale Machine Learning Data From Scratch With Python

How to Scale Machine Learning Data From Scratch With Python

Many machine learning algorithms expect data to be scaled consistently.

There are two popular methods that you should consider when scaling your data for machine learning.

In this tutorial, you will discover how you can rescale your data for machine learning. After reading this tutorial you will know:

  • How to normalize your data from scratch.
  • How to standardize your data from scratch.
  • When to normalize as opposed to standardize data.

Let’s get started.

  • Update Feb/2018: Fixed minor typo in min/max code example.
  • Update Mar/2018: Added alternate link to download the dataset as the original appears to have been taken down.
  • Update Aug/2018: Tested and updated to work with Python 3.6.
How To Prepare Machine Learning Data From Scratch With Python

How To Prepare Machine Learning Data From Scratch With Python
Photo by Ondra Chotovinsky, some rights reserved.

Description

Many machine learning algorithms expect the scale of the input and even the output data to be equivalent.

It can help in methods that weight inputs in order to make a prediction, such as in linear regression and logistic regression.

It is practically required in methods that combine weighted inputs in complex ways such as in artificial neural networks and deep learning.

In this tutorial, we are going to practice rescaling one standard machine learning dataset in CSV format.

Specifically, the Pima Indians dataset. It contains 768 rows and 9 columns. All of the values in the file are numeric, specifically floating point values. We will learn how to load the file first, then later how to convert the loaded strings to numeric values.

Tutorial

This tutorial is divided into 3 parts:

  1. Normalize Data.
  2. Standardize Data.
  3. When to Normalize and Standardize.

These steps will provide the foundations you need to handle scaling your own data.

1. Normalize Data

Normalization can refer to different techniques depending on context.

Here, we use normalization to refer to rescaling an input variable to the range between 0 and 1.

Normalization requires that you know the minimum and maximum values for each attribute.

This can be estimated from training data or specified directly if you have deep knowledge of the problem domain.

You can easily estimate the minimum and maximum values for each attribute in a dataset by enumerating through the values.

The snippet of code below defines the dataset_minmax() function that calculates the min and max value for each attribute in a dataset, then returns an array of these minimum and maximum values.

123456789 # Find the min and max values for each columndef dataset_minmax(dataset):minmax=list()foriinrange(len(dataset[0])):col_values=[row[i]forrow indataset]value_min=min(col_values)value_max=max(col_values)minmax.append([value_min,value_max])returnminmax

We can contrive a small dataset for testing as follows:

123 x1 x250 3020 90

With this contrived dataset, we can test our function for calculating the min and max for each column.

12345678910111213141516 # Find the min and max values for each columndef dataset_minmax(dataset):minmax=list()foriinrange(len(dataset[0])):col_values=[row[i]forrow indataset]value_min=min(col_values)value_max=max(col_values)minmax.append([value_min,value_max])returnminmax# Contrive small datasetdataset=[[50,30],[20,90]]print(dataset)# Calculate min and max for each columnminmax=dataset_minmax(dataset)print(minmax)

Running the example produces the following output.

First, the dataset is printed in a list of lists format, then the min and max for each column is printed in the format column1: min,max and column2: min,max.

For example:

12 [[50, 30], [20, 90]][[20, 50], [30, 90]]

Once we have estimates of the maximum and minimum allowed values for each column, we can now normalize the raw data to the range 0 and 1.

The calculation to normalize a single value for a column is:

1 scaled_value = (value - min) / (max - min)

Below is an implementation of this in a function called normalize_dataset() that normalizes values in each column of a provided dataset.

12345 # Rescale dataset columns to the range 0-1def normalize_dataset(dataset,minmax):forrow indataset:foriinrange(len(row)):row[i]=(row[i]-minmax[i][0])/(minmax[i][1]-minmax[i][0])

We can tie this function together with the dataset_minmax() function and normalize the contrived dataset.

12345678910111213141516171819202122232425 # Find the min and max values for each columndef dataset_minmax(dataset):minmax=list()foriinrange(len(dataset[0])):col_values=[row[i]forrow indataset]value_min=min(col_values)value_max=max(col_values)minmax.append([value_min,value_max])returnminmax# Rescale dataset columns to the range 0-1def normalize_dataset(dataset,minmax):forrow indataset:foriinrange(len(row)):row[i]=(row[i]-minmax[i][0])/(minmax[i][1]-minmax[i][0])# Contrive small datasetdataset=[[50,30],[20,90]]print(dataset)# Calculate min and max for each columnminmax=dataset_minmax(dataset)print(minmax)# Normalize columnsnormalize_dataset(dataset,minmax)print(dataset)

Running this example prints the output below, including the normalized dataset.

123 [[50, 30], [20, 90]][[20, 50], [30, 90]][[1, 0], [0, 1]]

We can combine this code with code for loading a CSV dataset and load and normalize the Pima Indians diabetes dataset.

Download the Pima Indians dataset from the UCI Machine Learning Repository and place it in your current directory with the name pima-indians-diabetes.csv (update: download from here). Open the file and delete any empty lines at the bottom.

The example first loads the dataset and converts the values for each column from string to floating point values. The minimum and maximum values for each column are estimated from the dataset, and finally, the values in the dataset are normalized.

12345678910111213141516171819202122232425262728293031323334353637383940414243 from csv import reader# Load a CSV filedef load_csv(filename):file=open(filename,"rb")lines=reader(file)dataset=list(lines)returndataset# Convert string column to floatdef str_column_to_float(dataset,column):forrow indataset:row[column]=float(row[column].strip())# Find the min and max values for each columndef dataset_minmax(dataset):minmax=list()foriinrange(len(dataset[0])):col_values=[row[i]forrow indataset]value_min=min(col_values)value_max=max(col_values)minmax.append([value_min,value_max])returnminmax# Rescale dataset columns to the range 0-1def normalize_dataset(dataset,minmax):forrow indataset:foriinrange(len(row)):row[i]=(row[i]-minmax[i][0])/(minmax[i][1]-minmax[i][0])# Load pima-indians-diabetes datasetfilename='pima-indians-diabetes.csv'dataset=load_csv(filename)print('Loaded data file {0} with {1} rows and {2} columns').format(filename,len(dataset),len(dataset[0]))# convert string columns to floatforiinrange(len(dataset[0])):str_column_to_float(dataset,i)print(dataset[0])# Calculate min and max for each columnminmax=dataset_minmax(dataset)# Normalize columnsnormalize_dataset(dataset,minmax)print(dataset[0])

Running the example produces the output below.

The first record from the dataset is printed before and after normalization, showing the effect of the scaling.

123 Loaded data file pima-indians-diabetes.csv with 768 rows and 9 columns[6.0, 148.0, 72.0, 35.0, 0.0, 33.6, 0.627, 50.0, 1.0][0.35294117647058826, 0.7437185929648241, 0.5901639344262295, 0.35353535353535354, 0.0, 0.5007451564828614, 0.23441502988898377, 0.48333333333333334, 1.0]

2. Standardize Data

Standardization is a rescaling technique that refers to centering the distribution of the data on the value 0 and the standard deviation to the value 1.

Together, the mean and the standard deviation can be used to summarize a normal distribution, also called the Gaussian distribution or bell curve.

It requires that the mean and standard deviation of the values for each column be known prior to scaling. As with normalizing above, we can estimate these values from training data, or use domain knowledge to specify their values.

Let’s start with creating functions to estimate the mean and standard deviation statistics for each column from a dataset.

The mean describes the middle or central tendency for a collection of numbers. The mean for a column is calculated as the sum of all values for a column divided by the total number of values.

1 mean = sum(values) / total_values

The function below named column_means() calculates the mean values for each column in the dataset.

1

相關推薦

How to Scale Machine Learning Data From Scratch With Python

Tweet Share Share Google Plus Many machine learning algorithms expect data to be scaled consiste

Machine Learning Algorithms From Scratch: With Python

I believe my books offer thousands of dollars of education for tens of dollars each. They are months if not years of experience distilled into a few hundre

6 Steps To Write Any Machine Learning Algorithm From Scratch: Perceptron Case Study

This goes back to what I originally stated. If you don't understand the basics, don't tackle an algorithm from scratch. For the Perceptron, let's go ahead

How to Create a Cordova Plugin from Scratch

How to Create a Cordova Plugin from Scratchcordova iconIf you heading here, that means you were searching for something or somehow we create custom Cordova

Code Machine Learning Algorithms From Scratch Archives

The backpropagation algorithm is the classical feed-forward artificial neural network. It is the technique still used to train large deep learning network

How to Improve Machine Learning Results

Tweet Share Share Google Plus Having one or two algorithms that perform reasonably well on a pro

Stop Coding Machine Learning Algorithms From Scratch

Tweet Share Share Google Plus You Don’t Have To Implement Algorithms …if you’re a beginner and j

How to Use Machine Learning Results

Tweet Share Share Google Plus Once you have found and tuned a viable model of your problem it is

How to Evaluate Machine Learning Algorithms with R

Tweet Share Share Google Plus What algorithm should you use on your dataset? This is the most co

Benefits of Implementing Machine Learning Algorithms From Scratch

Tweet Share Share Google Plus Machine Learning can be difficult to understand when getting start

How to Evaluate Machine Learning Algorithms

Tweet Share Share Google Plus Once you have defined your problem and prepared your data you need

How to Implement Stacked Generalization From Scratch With Python

Tweet Share Share Google Plus Code a Stacking Ensemble From Scratch in Python, Step-by-Step. Ens

How To Investigate Machine Learning Algorithm Behavior

Tweet Share Share Google Plus Machine learning algorithms are complex systems that require study

How to Visualize Time Series Residual Forecast Errors with Python

Tweet Share Share Google Plus Forecast errors on time series regression problems are called resi

How To Talk About Data in Machine Learning (Terminology from Statistics and Computer Science)

Tweet Share Share Google Plus Data plays a big part in machine learning. It is important to unde

How To Load CSV Machine Learning Data in Weka (如何在Weka中載入CSV機器學習資料)

How To Load CSV Machine Learning Data in Weka 原文作者:Jason Brownlee 原文地址:https://machinelearningmastery.com/load-csv-machine-learning-data-weka/

How to Develop a Neural Machine Translation System from Scratch

Tweet Share Share Google Plus Develop a Deep Learning Model to Automatically Translate from Germ

How to Normalize and Standardize Your Machine Learning Data in Weka

Tweet Share Share Google Plus Machine learning algorithms make assumptions about the dataset you

How To Load Your Machine Learning Data Into R

Tweet Share Share Google Plus You need to be able to load data into R when working on a machine

How to Better Understand Your Machine Learning Data in Weka

Tweet Share Share Google Plus It is important to take your time to learn about your data when st