1. 程式人生 > >資料預處理—獨熱編碼

資料預處理—獨熱編碼

問題引入

在很多機器學習任務中,特徵並不總是連續值,而有可能是分類值。

例如,考慮一下的三個特徵:

["male", "female"]

["from Europe", "from US", "from Asia"]

["uses Firefox", "uses Chrome", "uses Safari", "uses Internet Explorer"]

如果將上述特徵用數字表示,效率會高很多。例如:

["male", "from US", "uses Internet Explorer"] 表示為[0, 1, 3]

["female", "from Asia
", "uses Chrome"]表示為[1, 2, 1]

但是,即使轉化為數字表示後,上述資料也不能直接用在我們的分類器中。這個的整數特徵表示並不能在分類器中直接使用,因為這樣的連續輸入,估計器會認為類別之間是有序的,但實際卻是無序的。(例如:瀏覽器的類別資料則是任意排序的)。

 

1、Why do we binarize categorical features?
We binarize the categorical input so that they can be thought of as a vector from the Euclidean space (we call this as embedding the vector in the Euclidean space).使用one-hot編碼,將離散特徵的取值擴充套件到了歐式空間

,離散特徵的某個取值就對應歐式空間的某個點。


2、Why do we embed the feature vectors in the Euclidean space?
Because many algorithms for classification/regression/clustering etc. requires computing distances between features or similarities between features. And many definitions of distances and similarities are defined over features in Euclidean space. So, we would like our features to lie in the Euclidean space as well.

將離散特徵通過one-hot編碼對映到歐式空間,是因為,在迴歸,分類,聚類等機器學習演算法中,特徵之間距離的計算或相似度的計算是非常重要的,而我們常用的距離或相似度的計算都是在歐式空間的相似度計算,計算餘弦相似性,基於的就是歐式空間。


3、Why does embedding the feature vector in Euclidean space require us to binarize categorical features?
Let us take an example of a dataset with just one feature (say job_type as per your example) and let us say it takes three values 1,2,3.
Now, let us take three feature vectors x_1 = (1), x_2 = (2), x_3 = (3). What is the euclidean distance between x_1 and x_2, x_2 and x_3 & x_1 and x_3? d(x_1, x_2) = 1, d(x_2, x_3) = 1, d(x_1, x_3) = 2. This shows that distance between job type 1 and job type 2 is smaller than job type 1 and job type 3. Does this make sense? Can we even rationally define a proper distance between different job types? In many cases of categorical features, we can properly define distance between different values that the categorical feature takes. In such cases, isn't it fair to assume that all categorical features are equally far away from each other?
Now, let us see what happens when we binary the same feature vectors. Then, x_1 = (1, 0, 0), x_2 = (0, 1, 0), x_3 = (0, 0, 1). Now, what are the distances between them? They are sqrt(2). So, essentially, when we binarize the input, we implicitly state that all values of the categorical features are equally away from each other.
將離散型特徵使用one-hot編碼,確實會讓特徵之間的距離計算更加合理。比如,有一個離散型特徵,代表工作型別,該離散型特徵,共有三個取值,不使用one-hot編碼,其表示分別是x_1 = (1), x_2 = (2), x_3 = (3)。兩個工作之間的距離是,(x_1, x_2) = 1, d(x_2, x_3) = 1, d(x_1, x_3) = 2。那麼x_1和x_3工作之間就越不相似嗎?顯然這樣的表示,計算出來的特徵的距離是不合理。那如果使用one-hot編碼,則得到x_1 = (1, 0, 0), x_2 = (0, 1, 0), x_3 = (0, 0, 1),那麼兩個工作之間的距離就都是sqrt(2).即每兩個工作之間的距離是一樣的,顯得更合理。


4、About the original question?
Note that our reason for why binarize the categorical features is independent of the number of the values the categorical features take, so yes, even if the categorical feature takes 1000 values, we still would prefer to do binarization.


5、Are there cases when we can avoid doing binarization?

沒必要用one-hot 編碼的情形
Yes. As we figured out earlier, the reason we binarize is because we want some meaningful distance relationship between the different values. As long as there is some meaningful distance relationship, we can avoid binarizing the categorical feature. For example, if you are building a classifier to classify a webpage as important entity page (a page important to a particular entity) or not and let us say that you have the rank of the webpage in the search result for that entity as a feature, then 1] note that the rank feature is categorical, 2] rank 1 and rank 2 are clearly closer to each other than rank 1 and rank 3, so the rank feature defines a meaningful distance relationship and so, in this case, we don't have to binarize the categorical rank feature.
More generally, if you can cluster the categorical values into disjoint subsets such that the subsets have meaningful distance relationship amongst them, then you don't have binarize fully, instead you can split them only over these clusters. For example, if there is a categorical feature with 1000 values, but you can split these 1000 values into 2 groups of 400 and 600 (say) and within each group, the values have meaningful distance relationship, then instead of fully binarizing, you can just add 2 features, one for each cluster and that should be fine.
將離散型特徵進行one-hot編碼的作用,是為了讓距離計算更合理,但如果特徵是離散的,並且不用one-hot編碼就可以很合理的計算出距離,那麼就沒必要進行one-hot編碼,比如,該離散特徵共有1000個取值,我們分成兩組,分別是400和600,兩個小組之間的距離有合適的定義,組內的距離也有合適的定義,那就沒必要用one-hot 編碼。
離散特徵進行one-hot編碼後,編碼後的特徵,其實每一維度的特徵都可以看做是連續的特徵。就可以跟對連續型特徵的歸一化方法一樣,對每一維特徵進行歸一化。比如歸一化到[-1,1]或歸一化到均值為0,方差為1。
有些情況不需要進行特徵的歸一化:
      It depends on your ML algorithms, some methods requires almost no efforts to normalize features or handle both continuous and discrete features, like tree based methods: c4.5, Cart, random Forrest, bagging or boosting. But most of parametric models (generalized linear models, neural network, SVM,etc) or methods using distance metrics (KNN, kernels, etc) will require careful work to achieve good results. Standard approaches including binary all features, 0 mean unit variance all continuous features, etc。
      基於樹的方法是不需要進行特徵的歸一化,例如隨機森林,bagging 和 boosting等。基於引數的模型或基於距離的模型,都是要進行特徵的歸一化。

 

Tree Model不太需要one-hot編碼

對於決策樹來說,one-hot的本質是增加樹的深度
tree-model是在動態的過程中生成類似 One-Hot + Feature Crossing 的機制
1. 一個特徵或者多個特徵最終轉換成一個葉子節點作為編碼 ,one-hot可以理解成三個獨立事件
2. 決策樹是沒有特徵大小的概念的,只有特徵處於他分佈的哪一部分的概念

獨熱編碼

為了解決上述問題,其中一種可能的解決方法是採用獨熱編碼(One-Hot Encoding)。獨熱編碼即 One-Hot 編碼,又稱一位有效編碼,其方法是使用N位狀態暫存器來對N個狀態進行編碼,每個狀態都由他獨立的暫存器位,並且在任意時候,其中只有一位有效。例如:

自然狀態碼為:000,001,010,011,100,101

獨熱編碼為:000001,000010,000100,001000,010000,100000

可以這樣理解,對於每一個特徵,如果它有m個可能值,那麼經過獨熱編碼後,就變成了m個二元特徵(如成績這個特徵有好,中,差變成one-hot就是100, 010, 001)。並且,這些特徵互斥,每次只有一個啟用。因此,資料會變成稀疏的。

這樣做的好處主要有:

1. 決了分類器不好處理屬性資料的問題

2. 一定程度上也起到了擴充特徵的作用

實際運用

kaggle中tianic問題中: 登陸的地點有三個,在資料中分別用 S,C,Q表示。

由於這三個值是沒有任何關聯的,可以對其進行編碼為 0 ,1,2。 理論上計算這三個特徵值之間的距離應該時相等的,但是這時在計算歐式距離時他們的距離並不相等。 所以採用獨熱碼進行編碼,python程式碼如下:

資料填充:

def dataPreprocess(df):
    df.loc[df['Sex'] == 'male', 'Sex'] = 0
    df.loc[df['Sex'] == 'female', 'Sex'] = 1

    # 由於 Embarked中有兩個資料未填充,需要先將資料填滿
    df['Embarked'] = df['Embarked'].fillna('S')
    # 部分年齡資料未空, 填充為 均值
    df['Age'] = df['Age'].fillna(df['Age'].median())

df.loc[df['Embarked']=='S', 'Embarked'] = 0 df.loc[df['Embarked'] == 'C', 'Embarked'] = 1 df.loc[df['Embarked'] == 'Q', 'Embarked'] = 2
    df['NewFare'] = df['Fare']
    df.loc[(df.Fare < 40), 'NewFare'] = 0
    df.loc[((df.Fare >= 40) & (df.Fare < 100)), 'NewFare'] = 1
    df.loc[((df.Fare >= 100) & (df.Fare < 150)), 'NewFare'] = 2
    df.loc[((df.Fare >= 150) & (df.Fare < 200)), 'NewFare'] = 3
    df.loc[(df.Fare >= 200), 'NewFare'] = 4
    return  df

利用獨熱碼對  'Embarked' 屬性進行編碼

def data_process_onehot(df):
    #copy_df = df.copy()
    train_Embarked = df["Embarked"].values.reshape(-1,1)

    onehot_encoder = OneHotEncoder(sparse=False)
    train_OneHotEncoded = onehot_encoder.fit_transform(train_Embarked)
    df["EmbarkedS"] = train_OneHotEncoded[:, 0]
    df["EmbarkedC"] = train_OneHotEncoded[:, 1]
    df["EmbarkedQ"] = train_OneHotEncoded[:, 2]
    return df

編碼後效果:

整個資料處理過程:

data_train = ReadData.readSourceData()
data_train = dataPreprocess(data_train)
data_train = data_process_onehot(data_train)
precent = linearRegression(data_train) 

 

參考:

https://blog.csdn.net/wl_ss/article/details/78508367