1. 程式人生 > >字元級或單詞級的one-hot編碼 VS 詞嵌入(keras實現)

字元級或單詞級的one-hot編碼 VS 詞嵌入(keras實現)

1. one-hot編碼

# 字符集的one-hot編碼
import string

samples = ['zzh is a pig','he loves himself very much','pig pig han']
characters = string.printable
token_index = dict(zip(range(1,len(characters)+1),characters))

max_length =20
results = np.zeros((len(samples),max_length,max(token_index.keys()) + 1))
for i,sample in enumerate(sample): for j,character in enumerate(sample): index = token_index.get(character) results[i,j,index] = 1 results

characters= '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVW

XYZ!"#$%&\'()*+,-./:;<=>[email protected]

[\\]^_`{|}~ \t\n\r\x0b\x0c'

 

# keras實現單詞級的one-hot編碼
from keras.preprocessing.text import Tokenizer
samples = ['zzh is a pig','he loves himself very much','pig pig han']

tokenizer = Tokenizer(num_words = 100)
#建立一個分詞器(tokenizer),設定為只考慮前1000個最常見的單詞
tokenizer.fit_on_texts(samples)#
構建單詞索引 sequences = tokenizer.texts_to_sequences(samples) one_hot_results = tokenizer.texts_to_matrix(samples,mode='binary') # one_hot_results.shape --> (3, 100) word_index = tokenizer.word_index print('發現%s個unique標記',len(word_index))

 

sequences = [[2, 3, 4, 1], 
[5, 6, 7, 8, 9, 10],
[1, 1, 11]]


發現10個unique標記

word_index =
{'pig': 1, 'zzh': 2, 'is': 3, 'a': 4, 'he': 5, 
'loves': 6,'himself': 7, 'very': 8, 'much': 9,
'han': 10}
  

one-hot 編碼的一種辦法是 one-hot雜湊技巧(one-hot hashing trick)

如果詞表中唯一標記的數量太大而無法直接處理,就可以使用這種技巧。

這種方法沒有為每個單詞顯示的分配一個索引並將這些索引儲存在一個字典中,而是將單詞雜湊編碼為固定長度的向量,通常用一個非常簡單的雜湊函式來實現。

優點:節省記憶體並允許資料的線上編碼(讀取完所有資料之前,你就可以立刻生成標記向量)

缺點:可能會出現雜湊衝突

如果雜湊空間的維度遠大於需要雜湊的唯一標記的個數,雜湊衝突的可能性會減小

 
import numpy as np

samples = ['the cat sat on the mat the cat sat on the mat the cat sat on the mat','the dog ate my homowork']
dimensionality = 1000#將單詞儲存為1000維的向量
max_length = 10

results = np.zeros((len(samples),max_length,dimensionality))
for i,sample in enumerate(samples):
    for j,word in list(enumerate(sample.split()))[:max_length]:
        index = abs(hash(word)) % dimensionality
        results[i,j,index] = 1
    

 

 

 

2. 詞嵌入