Why Data Normalization is necessary for Machine Learning models

阿新 • • 發佈：2018-12-28

Why Data Normalization is necessary for Machine Learning models

Normalization is a technique often applied as part of data preparation for machine learning. The goal of normalization is to change the values of numeric columns in the dataset to use a common scale, without distorting differences in the ranges of values. For machine learning, every dataset does not require normalization. It is required only when features have different ranges.

For example, consider a data set containing two features, age and income.Where age ranges from 0–100, while income ranges from 0–20,000 and higher. Income is about 1,000 times larger than age and ranges from 20,000–500,000. So, these two features are in very different ranges. When we do further analysis, like linear regression for example, the attribute income will intrinsically influence the result more due to its larger value. But this doesn’t necessarily mean it is more important as a predictor.

To explain further, lets build and compare two deep neural network models: one without using normalized data and another one with normalized data.

Below is a Neural Network Model built using original data:

'''Using covertype dataset from kaggle to predict forest cover type'''

import pandas as pdfrom sklearn.cross_validation import train_test_splitfrom IPython import displayfrom matplotlib import cmfrom matplotlib import gridspecfrom matplotlib import pyplot as pltimport numpy as npimport pandas as pdfrom sklearn import metricsimport tensorflow as tffrom tensorflow.python.data import Datasetimport kerasfrom keras.utils import to_categoricalfrom keras import modelsfrom keras import layersimport mathimport matplotlib.pyplot as plt

df = pd.read_csv('covtype.csv')x = df[df.columns[:54]]

y = df.Cover_Type

x_train, x_test, y_train, y_test = train_test_split(x, y , train_size = 0.7, random_state =  90)

'''As y variable is multi class categorical variable, hence using softmax as activation function and sparse-categorical cross entropy as loss function.'''

model = keras.Sequential([ keras.layers.Dense(64, activation=tf.nn.relu,                   input_shape=(x_train.shape[1],)), keras.layers.Dense(64, activation=tf.nn.relu), keras.layers.Dense(8, activation=  'softmax') ])optimizer = tf.train.RMSPropOptimizer(0.001)model.compile(optimizer='adam',              loss='sparse_categorical_crossentropy',              metrics=['accuracy'])

history1 = model.fit( x_train, y_train, epochs= 26, batch_size = 60, validation_data = (x_test, y_test))

Output:Epoch 1/26 406708/406708 [==============================] — 19s 47us/step — loss: 8.2614 — acc: 0.4874 — val_loss: 8.2531 — val_acc: 0.4880

Epoch 2/26 406708/406708 [==============================] — 18s 45us/step — loss: 8.2614 — acc: 0.4874 — val_loss: 8.2531 — val_acc: 0.4880

…...............

Epoch 26/26 406708/406708 [==============================] — 17s 42us/step — loss: 8.2614 — acc: 0.4874 — val_loss: 8.2531 — val_acc: 0.4880

Validation accuracy of above model is 0.4880.

There are different methods to normalize data. I will be normalizing features by removing the mean and scaling to unit variance.

import pandas as pdfrom sklearn import preprocessing

df = pd.read_csv('covtype.csv')

x = df[df.columns[:55]]y = df.Cover_Type

x_train, x_test, y_train, y_test = train_test_split(x, y , train_size = 0.7, random_state =  90)

train_norm = x_train[x_train.columns[0:10]]test_norm = x_test[x_test.columns[0:10]]

# Normalize Training Data

std_scale = preprocessing.StandardScaler().fit(train_norm)x_train_norm = std_scale.transform(train_norm)

#Converting numpy array to dataframetraining_norm_col = pd.DataFrame(x_train_norm, index=train_norm.index, columns=train_norm.columns) x_train.update(training_norm_col)print (x_train.head())

# Normalize Testing Data by using mean and SD of training set

x_test_norm = std_scale.transform(test_norm)testing_norm_col = pd.DataFrame(x_test_norm, index=test_norm.index, columns=test_norm.columns) x_test.update(testing_norm_col)print (x_train.head())

#Build neural network model with normalized data

model = keras.Sequential([ keras.layers.Dense(64, activation=tf.nn.relu,                   input_shape=(x_train.shape[1],)), keras.layers.Dense(64, activation=tf.nn.relu), keras.layers.Dense(8, activation=  'softmax') ])optimizer = tf.train.RMSPropOptimizer(0.001)model.compile(optimizer='adam',              loss='sparse_categorical_crossentropy',              metrics=['accuracy'])

history2 = model.fit( x_train, y_train, epochs= 26, batch_size = 60, validation_data = (x_test, y_test))

#Output:Train on 464809 samples, validate on 116203 samples Epoch 1/26 464809/464809 [==============================] - 16s 34us/step - loss: 0.5433 - acc: 0.7675 - val_loss: 0.4701 - val_acc: 0.8022 Epoch 2/26 464809/464809 [==============================] - 16s 34us/step - loss: 0.4436 - acc: 0.8113 - val_loss: 0.4410 - val_acc: 0.8124 Epoch 3/26....................Epoch 26/26 464809/464809 [==============================] - 16s 34us/step - loss: 0.2703 - acc: 0.8907 - val_loss: 0.2773 - val_acc: 0.8893

Validation accuracy of model is pretty good, almost 88.93%

#Plot accuracy of above two models

#First modelimport matplotlib.pyplot as pltdef plot_history(history1): plt.figure() plt.xlabel('Epoch') plt.ylabel('Accuracy [1000$]') plt.plot(history1.epoch, np.array(history1.history['acc']), label='Train Accuracy') plt.plot(history1.epoch, np.array(history1.history['val_acc']), label = 'Val Accuracy') plt.legend() plt.ylim([0.7, 1])plot_history(history1)

#Second model:import matplotlib.pyplot as pltdef plot_history(history1): plt.figure() plt.xlabel('Epoch') plt.ylabel('Accuracy [1000$]') plt.plot(history1.epoch, np.array(history1.history['acc']), label='Train Accuracy') plt.plot(history1.epoch, np.array(history1.history['val_acc']), label = 'Val Accuracy') plt.legend() plt.ylim([0.7, 1])plot_history(history2)

By comparing below two graphs we see that model 1 have very low validation accuracy (48%). The reason for low accuracy is that, model is not learning at all, that is the reason we are getting a straight line in a graph for both test and train data. To overcome model learning problem, we normalize the data. We normalized the data in model 2 and as a result with every epoch, accuracy increased. And at epoch 26 accuracy reached to 89%.

Thanks. Happy Learning :)

Why Data Normalization is necessary for Machine Learning models

Why Data Normalization is necessary for Machine Learning models

Why Data Normalization is necessary for Machine Learning models

斯坦福大學公開課機器學習：machine learning system design | data for machine learning（數據量很大時，學習算法表現比較好的原理）

Using Amazon’s Mechanical Turk for Machine Learning Data

Get Your Data Ready For Machine Learning in R with Pre

Rescaling Data for Machine Learning in Python with Scikit

How to Prepare Data For Machine Learning

Prepare Data for Machine Learning in Python with Pandas

git中Please enter a commit message to explain why this merge is necessary

Statistical Methods for Machine Learning

U25%(1,16) and U25%(1,168)on《C4.5:programs for machine learning》

《C4.5: Programs for Machine Learning》chaper4實驗結果重現

the resource for machine learning

[Infographic] The Best Tools for Machine Learning Gengo AI

The 50 Best Public Datasets for Machine Learning

Facebook's PyTorch plans to light the way to speedy workflows for Machine Learning • DEVCLASS

Essential libraries for Machine Learning in Python

Top 10 Open Image Datasets for Machine Learning Research

Why the difference between AI and machine learning matters

Why are enterprises slow to adopt machine learning?

NXP Owns the Stage for Machine Learning in Edge Devices

Why Data Normalization is necessary for Machine Learning models

Why Data Normalization is necessary for Machine Learning models

相關推薦