1. 程式人生 > >How to One Hot Encode Sequence Data in Python

How to One Hot Encode Sequence Data in Python

Machine learning algorithms cannot work with categorical data directly.

Categorical data must be converted to numbers.

This applies when you are working with a sequence classification type problem and plan on using deep learning methods such as Long Short-Term Memory recurrent neural networks.

In this tutorial, you will discover how to convert your input or output sequence data to a one hot encoding for use in sequence classification problems with deep learning in Python.

After completing this tutorial, you will know:

  • What an integer encoding and one hot encoding are and why they are necessary in machine learning.
  • How to calculate an integer encoding and one hot encoding by hand in Python.
  • How to use the scikit-learn and Keras libraries to automatically encode your sequence data in Python.

Let’s get started.

How to One Hot Encode Sequence Classification Data in Python

How to One Hot Encode Sequence Classification Data in Python
Photo by Elias Levy, some rights reserved.

Tutorial Overview

This tutorial is divided into 4 parts; they are:

  1. What is One Hot Encoding?
  2. Manual One Hot Encoding
  3. One Hot Encode with scikit-learn
  4. One Hot Encode with Keras

What is One Hot Encoding?

A one hot encoding is a representation of categorical variables as binary vectors.

This first requires that the categorical values be mapped to integer values.

Then, each integer value is represented as a binary vector that is all zero values except the index of the integer, which is marked with a 1.

Worked Example of a One Hot Encoding

Let’s make this concrete with a worked example.

Assume we have a sequence of labels with the values ‘red’ and ‘green’.

We can assign ‘red’ an integer value of 0 and ‘green’ the integer value of 1. As long as we always assign these numbers to these labels, this is called an integer encoding. Consistency is important so that we can invert the encoding later and get labels back from integer values, such as in the case of making a prediction.

Next, we can create a binary vector to represent each integer value. The vector will have a length of 2 for the 2 possible integer values.

The ‘red’ label encoded as a 0 will be represented with a binary vector [1, 0] where the zeroth index is marked with a value of 1. In turn, the ‘green’ label encoded as a 1 will be represented with a binary vector [0, 1] where the first index is marked with a value of 1.

If we had the sequence:

1 'red', 'red', 'green'

We could represent it with the integer encoding:

1 0, 0, 1

And the one hot encoding of:

123 [1, 0][1, 0][0, 1]

Why Use a One Hot Encoding?

A one hot encoding allows the representation of categorical data to be more expressive.

Many machine learning algorithms cannot work with categorical data directly. The categories must be converted into numbers. This is required for both input and output variables that are categorical.

We could use an integer encoding directly, rescaled where needed. This may work for problems where there is a natural ordinal relationship between the categories, and in turn the integer values, such as labels for temperature ‘cold’, warm’, and ‘hot’.

There may be problems when there is no ordinal relationship and allowing the representation to lean on any such relationship might be damaging to learning to solve the problem. An example might be the labels ‘dog’ and ‘cat’

In these cases, we would like to give the network more expressive power to learn a probability-like number for each possible label value. This can help in both making the problem easier for the network to model. When a one hot encoding is used for the output variable, it may offer a more nuanced set of predictions than a single label.

Need help with LSTMs for Sequence Prediction?

Take my free 7-day email course and discover 6 different LSTM architectures (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

Manual One Hot Encoding

In this example, we will assume the case where we have an example string of characters of alphabet letters, but the example sequence does not cover all possible examples.

We will use the input sequence of the following characters:

1 hello world

We will assume that the universe of all possible inputs is the complete alphabet of lower case characters, and space. We will therefore use this as an excuse to demonstrate how to roll our own one hot encoding.

The complete example is listed below.

12345678910111213141516171819202122 from numpy import argmax# define input stringdata='hello world'print(data)# define universe of possible input valuesalphabet='abcdefghijklmnopqrstuvwxyz '# define a mapping of chars to integerschar_to_int=dict((c,i)fori,cinenumerate(alphabet))int_to_char=dict((i,c)fori,cinenumerate(alphabet))# integer encode input datainteger_encoded=[char_to_int[char]forcharindata]print(integer_encoded)# one hot encodeonehot_encoded=list()forvalue ininteger_encoded:letter=[0for_inrange(len(alphabet))]letter[value]=1onehot_encoded.append(letter)print(onehot_encoded)# invert encodinginverted=int_to_char[argmax(onehot_encoded[0])]print(inverted)

Running the example first prints the input string.

A mapping of all possible inputs is created from char values to integer values. This mapping is then used to encode the input string. We can see that the first letter in the input ‘h’ is encoded as 7, or the index 7 in the array of possible input values (alphabet).

The integer encoding is then converted to a one hot encoding. This is done one integer encoded character at a time. A list of 0 values is created the length of the alphabet so that any expected character can be represented.

Next, the index of the specific character is marked with a 1. We can see that the first letter ‘h’ integer encoded as a 7 is represented by a binary vector with the length 27 and the 7th index marked with a 1.

Finally, we invert the encoding of the first letter and print the result. We do this by locating the index of in the binary vector with the largest value using the NumPy argmax() function and then using the integer value in a reverse lookup table of character values to integers.

Note: output was formatted for readability.

1234567891011121314151617 hello world[7, 4, 11, 11, 14, 26, 22, 14, 17, 11, 3][[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]h

Now that we have seen how to roll our own one hot encoding from scratch, let’s see how we can use the scikit-learn library to perform this mapping automatically for cases where the input sequence fully captures the expected range of input values.

One Hot Encode with scikit-learn

In this example, we will assume the case where you have an output sequence of the following 3 labels:

123 "cold""warm""hot"

An example sequence of 10 time steps may be:

1 cold, cold, warm, cold, hot, hot, warm, cold, warm, hot

This would first require an integer encoding, such as 1, 2, 3. This would be followed by a one hot encoding of integers to a binary vector with 3 values, such as [1, 0, 0].

The sequence provides at least one example of every possible value in the sequence. Therefore we can use automatic methods to define the mapping of labels to integers and integers to binary vectors.

In this example, we will use the encoders from the scikit-learn library. Specifically, the LabelEncoder of creating an integer encoding of labels and the OneHotEncoder for creating a one hot encoding of integer encoded values.

The complete example is listed below.

1234567891011121314151617181920 from numpy import arrayfrom numpy import argmaxfrom sklearn.preprocessing import LabelEncoderfrom sklearn.preprocessing import OneHotEncoder# define exampledata=['cold','cold','warm','cold','hot','hot','warm','cold','warm','hot']values=array(data)print(values)# integer encodelabel_encoder=LabelEncoder()integer_encoded=label_encoder.fit_transform(values)print(integer_encoded)# binary encodeonehot_encoder=OneHotEncoder(sparse=False)integer_encoded=integer_encoded.reshape(len(integer_encoded),1)onehot_encoded=onehot_encoder.fit_transform(integer_encoded)print(onehot_encoded)# invert first exampleinverted=label_encoder.inverse_transform([argmax(onehot_encoded[0,:])])print(inverted)

Running the example first prints the sequence of labels. This is followed by the integer encoding of the labels and finally the one hot encoding.

The training data contained the set of all possible examples so we could rely on the integer and one hot encoding transforms to create a complete mapping of labels to encodings.

By default, the OneHotEncoder class will return a more efficient sparse encoding. This may not be suitable for some applications, such as use with the Keras deep learning library. In this case, we disabled the sparse return type by setting the sparse=False argument.

If we receive a prediction in this 3-value one hot encoding, we can easily invert the transform back to the original label.

First, we can use the argmax() NumPy function to locate the index of the column with the largest value. This can then be fed to the LabelEncoder to calculate an inverse transform back to a text label.

This is demonstrated at the end of the example with the inverse transform of the first one hot encoded example back to the label value ‘cold’.

Again, note that input was formatted for readability.

12345678910111213141516 ['cold' 'cold' 'warm' 'cold' 'hot' 'hot' 'warm' 'cold' 'warm' 'hot'][0 0 2 0 1 1 2 0 2 1][[ 1.  0.  0.] [ 1.  0.  0.] [ 0.  0.  1.] [ 1.  0.  0.] [ 0.  1.  0.] [ 0.  1.  0.] [ 0.  0.  1.] [ 1.  0.  0.] [ 0.  0.  1.] [ 0.  1.  0.]]['cold']

In the next example, we look at how we can directly one hot encode a sequence of integer values.

One Hot Encode with Keras

You may have a sequence that is already integer encoded.

You could work with the integers directly, after some scaling. Alternately, you can one hot encode the integers directly. This is important to consider if the integers do not have a real ordinal relationship and are really just placeholders for labels.

The Keras library offers a function called to_categorical() that you can use to one hot encode integer data.

In this example, we have 4 integer values [0, 1, 2, 3] and we have the input sequence of the following 10 numbers:

1 data = [1, 3, 2, 0, 3, 2, 2, 1, 0, 1]

The sequence has an example of all known values so we can use the to_categorical() function directly. Alternately, if the sequence was 0-based (started at 0) and was not representative of all possible values, we could specify the num_classes argument to_categorical(num_classes=4).

A complete example of this function is listed below.

12345678910111213 from numpy import arrayfrom numpy import argmaxfrom keras.utils import to_categorical# define exampledata=[1,3,2,0,3,2,2,1,0,1]data=array(data)print(data)# one hot encodeencoded=to_categorical(data)print(encoded)# invert encodinginverted=argmax(encoded[0])print(inverted)

Running the example first defines and prints the input sequence.

The integers are then encoded as binary vectors and printed. We can see that the first integer value 1 is encoded as [0, 1, 0, 0] just like we would expect.

We then invert the encoding by using the NumPy argmax() function on the first value in the sequence that returns the expected value 1 for the first integer.

1234567891011121314 [1 3 2 0 3 2 2 1 0 1][[ 0.  1.  0.  0.] [ 0.  0.  0.  1.] [ 0.  0.  1.  0.] [ 1.  0.  0.  0.] [ 0.  0.  0.  1.] [ 0.  0.  1.  0.] [ 0.  0.  1.  0.] [ 0.  1.  0.  0.] [ 1.  0.  0.  0.] [ 0.  1.  0.  0.]]1

Further Reading

This section lists some resources for further reading.

Summary

In this tutorial, you discovered how to encode your categorical sequence data for deep learning using a one hot encoding in Python.

Specifically, you learned:

  • What integer encoding and one hot encoding are and why they are necessary in machine learning.
  • How to calculate an integer encoding and one hot encoding by hand in Python.
  • How to use the scikit-learn and Keras libraries to automatically encode your sequence data in Python.

Do you have any questions about preparing your sequence data?
Ask your questions in the comments and I will do my best to answer.

Develop LSTMs for Sequence Prediction Today!

Long Short-Term Memory Networks with Python

Develop Your Own LSTM models in Minutes

…with just a few lines of python code

It provides self-study tutorials on topics like:
CNN LSTMs, Encoder-Decoder LSTMs, generative models, data preparation, making predictions and much more…

Finally Bring LSTM Recurrent Neural Networks to
Your Sequence Predictions Projects

Skip the Academics. Just Results.


相關推薦

How to One Hot Encode Sequence Data in Python

Tweet Share Share Google Plus Machine learning algorithms cannot work with categorical data dire

How To Load CSV Machine Learning Data in Weka (如何在Weka中載入CSV機器學習資料)

How To Load CSV Machine Learning Data in Weka 原文作者:Jason Brownlee 原文地址:https://machinelearningmastery.com/load-csv-machine-learning-data-weka/

How to Transform Your Machine Learning Data in Weka

Tweet Share Share Google Plus Often your raw data for machine learning is not in an ideal form f

【轉】How to initialize a two-dimensional array in Python?

use obj class amp example list tty address add 【wrong way:】 m=[[element] * numcols] * numrowsfor example: >>> m=[[‘a‘] *3] * 2&g

How to Create a Simple Neural Network in Python

Neural networks (NN), also called artificial neural networks (ANN) are a subset of learning algorithms within the machine learning field that are loosely b

How to Get Started with Machine Learning in Python

Tweet Share Share Google Plus The Python conference PyCon2014 has held recently and the videos f

How to Use Parametric Statistical Significance Tests in Python

Tweet Share Share Google Plus Parametric statistical methods often mean those methods that assum

How to Load Data in Python with Scikit

Tweet Share Share Google Plus Before you can build machine learning models, you need to load you

How to Load and Explore Time Series Data in Python

Tweet Share Share Google Plus The Pandas library in Python provides excellent, built-in support

How To Get Started With Machine Learning in R (get results in one weekend)

Tweet Share Share Google Plus How do you get started with machine learning in R? R is a large an

java 讀取配置文件工具類 (how to read values from properties file in java)

讀取 public resource fault .get exce ram trac stat Java 讀取配置文件工具類 使用 java.util.Properties import java.io.IOException; import java.io.Inpu

How to Get the Length of File in C

code class clas body position pre -c set == How to get length of file in C //=== int fileLen(FILE *fp) { int nRet = -1; int nPosB

在pycharm中以管理員身份運行/調試腳本(How to run / debug programs as root in Pycharm)

不想 http 設置 pan programs 額外 smi pytho 參考 如果想要在pycharm中以root的身份運行python腳本,因為pycharm本身好像沒有這個特性,目前只能通過一些額外的手段來實現。思路就是讓pycharm以root身份執行python編

Mysql Innodb 性能參數設置 https://www.rathishkumar.in/2017/01/how-to-allocate-innodb-buffer-pool-size-in-mysql.html

dea off variant sch 型號 pac san lin gin 參考原文: https://www.rathishkumar.in/2017/01/how-to-allocate-innodb-buffer-pool-size-in-mysql.html 查

How to setup a slave for replication in 6 simple steps with Percona XtraBackup

second path binlog ica direct isam fetch owin value Data is, by far, the most valuable part of a system. Having a backup done systema

[Tensorflow] 統計模型的引數數量 How to calculate the amount of parameters in my model?

import logging logging.basicConfig(level=logging.INFO, format='%(message)s', filemode='w', filename=config.logger) def _params_usage(): total

one-hot-encode編碼方式

有時特徵內容並不是數值,而是字串型別。如果直接將字串轉成一個對應的數值,造成原本的特徵具有大小關係。這是需要使用 one-hot-encode編碼格式。 兩種轉化方式: pandas.get_dummies():常用方法,功能強大,操作簡單; sklearn.preprocessing.On

How to detect and extract forest areas in a aerial image map with the knowledge of DIP

Signal processing is a common subject in electrical engineering, communication engineering and mathematics that deals with analysis and processing

how to mount /system as read/write in android? 在除錯RK3288的OV2718的驅動時,需要remount /system目錄為rw以push檔案到/system/lib/hw目錄下,常規的是用adb登入上去後以root許可權執行mount -o re

在除錯RK3288的OV2718的驅動時,需要remount  /system目錄為rw以push檔案到/system/lib/hw目錄下,常規的是用adb登入上去後以root許可權執行mount -o remount,rw /system即可,然而在拿的新板子後這麼做失效了,於是百

How to Install and Configure OpenSSH Server In Linux

標題:在Linux中安裝和配置OpenSSH伺服器 Install OpenSSH in Linux  & 在Linux計算機中安裝OpenSSH   Being a network administrator requires a deep kno