1. 程式人生 > >word2vec查詢詞向量時報錯:'utf-8' codec cann't decode bytes in position 96-07:unexpected end of data

word2vec查詢詞向量時報錯:'utf-8' codec cann't decode bytes in position 96-07:unexpected end of data

載入word2vec模型時報錯:

    model_path = "model/Hanlp_cut_news.bin"
    w2v_dict = word2vec.load(model_path)
    print(w2v_dict["奧運"])
Traceback (most recent call last):
  File "/home/iiip/PycharmProjects/smp_yinglish/demo1/data_preprocess.py", line 10, in <module>
    w2v_dict = word2vec.load(model_path)
  File "/home/iiip/.local/lib/python3.5/site-packages/word2vec/io.py"
, line 18, in load return word2vec.WordVectors.from_binary(fname, *args, **kwargs) File "/home/iiip/.local/lib/python3.5/site-packages/word2vec/wordvectors.py", line 202, in from_binary vocab[i] = word.decode(encoding) UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 96-97: unexpected end of data

查看了一下自己分詞檔案的編碼,utf-8的,沒問題:

$file hanlp_cut_news.txt
Hanlp_cut_news.txt: UTF-8 Unicode text, with very long lines, with no line terminators

再看了訓練出來的bin檔案編碼,data表示二進位制檔案,也沒問題:

$ file Hanlp_cut_news.bin 
Hanlp_cut_news.bin: data

回到報錯資訊,點開word2vec.py原始碼202行,注意到:

                if include:
                    vocab[i] = word
.decode(encoding)

修改一下原始碼為:

                if include:
                    try:
                            print (word)
                            print(word.encode(encoding)
                            vocab[i] = word.decode(encoding)
                        except:
                            vocab[i] = word

在執行出來的結果中,程式停在了一個特別長的二進位制輸出,可以推測,應該某個分詞結果存在編碼混亂或者過長的錯誤。

把那個很長的二進位制編碼copy出來測試一下:

line = '\xe9\x98\xbf\xe5\xb0\x94\xe6\xaf\x94\xe5\xb7\xb4\xe9\x87\x8c\xe5\xb8\x83\xe9\x9b\xb7\xe8\xa5\xbf\xe6\xa0\xbc\xe7\xbd\x97\xe7\x91\x9f\xe6\x9b\xbc\xe6\x89\x98\xe7\x93\xa6\xe6\xa2\x85\xe8\xa5\xbf\xe7\xba\xb3\xe6\x91\xa9\xe5\xbe\xb7\xe7\xba\xb3\xe7\x89\xb9\xe9\x87\x8c\xe5\x9f\x83\xe8\xb5\xab\xe5\xba\x93\xe6\x96\xaf\xe9\x98\xbf\xe6\x8b\x89\xe7\xbb\xb4\xe6\xb2\x99\xe6\x8b\x89\xe6\x9b\xbc\xe5\x8d'
print (line)

輸出是一堆亂碼……

解決原問題的方法就是把原始碼改為:

                if include:
                    # vocab[i] = word.decode(encoding)
                    try:
                        # print (word)
                        # print (word.decode(encoding))
                        vocab[i] = word.decode(encoding)
                    except:
                        # vocab[i] = word
                        vocab[i] = 'UNK'
                        print (word, 'UNK')

直接跳過這個出錯的詞語。額,當然了,其實最好應該是在分詞的時候做資料清洗(只不過我的分詞檔案很大,重新跑一遍分詞程式不划算)。