1. 程式人生 > >word2vec詞向量處理英文語料

word2vec詞向量處理英文語料

 

word2vec介紹

         word2vec官網https://code.google.com/p/word2vec/

  • word2vec是google的一個開源工具,能夠根據輸入的詞的集合計算出詞與詞之間的距離。
  • 它將term轉換成向量形式,可以把對文字內容的處理簡化為向量空間中的向量運算,計算出向量空間上的相似度,來表示文字語義上的相似度。
  • word2vec計算的是餘弦值,距離範圍為0-1之間,值越大代表兩個詞關聯度越高。
  • 詞向量:用Distributed Representation表示詞,通常也被稱為“Word Representation”或“Word Embedding(嵌入)”。

使用

執行和測試同樣需要text8、questions-words.txt檔案,語料下載地址:http://mattmahoney.net/dc/text8.zip
該語料編碼格式UTF-8,儲存為一行,語料訓練資訊:training on 85026035 raw words (62529137 effective words) took 197.4s, 316692 effective words/s

word2vec使用引數解釋

-train 訓練資料
-output 結果輸入檔案,即每個詞的向量
-cbow 是否使用cbow模型,0表示使用skip-gram模型,1表示使用cbow模型,預設情況下是skip-gram模型,cbow模型快一些,skip-gram模型效果好一些
-size 表示輸出的詞向量維數
-window 為訓練的視窗大小,8表示每個詞考慮前8個詞與後8個詞(實際程式碼中還有一個隨機選視窗的過程,視窗大小<=5)
-negative 表示是否使用NEG方,0表示不使用,其它的值目前還不是很清楚
-hs 是否使用HS方法,0表示不使用,1表示使用
-sample 表示 取樣的閾值,如果一個詞在訓練樣本中出現的頻率越大,那麼就越會被取樣
-binary 表示輸出的結果檔案是否採用二進位制儲存,0表示不使用(即普通的文字儲存,可以開啟檢視),1表示使用,即vectors.bin的儲存型別
-alpha 表示 學習速率
-min-count 表示設定最低頻率,預設為5,如果一個詞語在文件中出現的次數小於該閾值,那麼該詞就會被捨棄
-classes 表示詞聚類簇的個數,從相關原始碼中可以得出該聚類是採用k-means

 

 

程式碼——

 1 # -*- coding: utf-8 -*-
 2  
 3 """
 4 功能:測試gensim使用
 5 時間:2016年5月2日 18:00:00
 6 """
 7  
 8 from gensim.models import word2vec
 9 import logging
10  
11 # 主程式
12 logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
13 sentences = word2vec.Text8Corpus("
data/text8") # 載入語料 14 model = word2vec.Word2Vec(sentences, size=200) # 訓練skip-gram模型; 預設window=5 15 16 # 計算兩個詞的相似度/相關程度 17 y1 = model.similarity("woman", "man") 18 print u"woman和man的相似度為:", y1 19 print "--------\n" 20 21 # 計算某個詞的相關詞列表 22 y2 = model.most_similar("good", topn=20) # 20個最相關的 23 print u"和good最相關的詞有:\n" 24 for item in y2: 25 print item[0], item[1] 26 print "--------\n" 27 28 # 尋找對應關係 29 print ' "boy" is to "father" as "girl" is to ...? \n' 30 y3 = model.most_similar(['girl', 'father'], ['boy'], topn=3) 31 for item in y3: 32 print item[0], item[1] 33 print "--------\n" 34 35 more_examples = ["he his she", "big bigger bad", "going went being"] 36 for example in more_examples: 37 a, b, x = example.split() 38 predicted = model.most_similar([x, b], [a])[0][0] 39 print "'%s' is to '%s' as '%s' is to '%s'" % (a, b, x, predicted) 40 print "--------\n" 41 42 # 尋找不合群的詞 43 y4 = model.doesnt_match("breakfast cereal dinner lunch".split()) 44 print u"不合群的詞:", y4 45 print "--------\n" 46 47 # 儲存模型,以便重用 48 model.save("text8.model") 49 # 對應的載入方式 50 # model_2 = word2vec.Word2Vec.load("text8.model") 51 52 # 以一種C語言可以解析的形式儲存詞向量 53 model.save_word2vec_format("text8.model.bin", binary=True) 54 # 對應的載入方式 55 # model_3 = word2vec.Word2Vec.load_word2vec_format("text8.model.bin", binary=True) 56 57 if __name__ == "__main__": 58 pass

Ubuntu16.04系統下執行結果

  1 2016-5-2 18:56:19,332 : INFO : collecting all words and their counts
  2 2016-5-2 18:56:19,334 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
  3 2016-5-2 18:56:27,431 : INFO : collected 253854 word types from a corpus of 17005207 raw words and 1701 sentences
  4 2016-5-2 18:56:27,740 : INFO : min_count=5 retains 71290 unique words (drops 182564)
  5 2016-5-2 18:56:27,740 : INFO : min_count leaves 16718844 word corpus (98% of original 17005207)
  6 2016-5-2 18:56:27,914 : INFO : deleting the raw counts dictionary of 253854 items
  7 2016-5-2 18:56:27,947 : INFO : sample=0.001 downsamples 38 most-common words
  8 2016-5-2 18:56:27,947 : INFO : downsampling leaves estimated 12506280 word corpus (74.8% of prior 16718844)
  9 2016-5-2 18:56:27,947 : INFO : estimated required memory for 71290 words and 200 dimensions: 149709000 bytes
 10 2016-5-2 18:56:28,176 : INFO : resetting layer weights
 11 2016-5-2 18:56:29,074 : INFO : training model with 3 workers on 71290 vocabulary and 200 features, using sg=0 hs=0 sample=0.001 negative=5
 12 2016-5-2 18:56:29,074 : INFO : expecting 1701 sentences, matching count from corpus used for vocabulary survey
 13 2016-5-2 18:56:30,086 : INFO : PROGRESS: at 0.86% examples, 531932 words/s, in_qsize 6, out_qsize 0
 14 2016-5-2 18:56:31,088 : INFO : PROGRESS: at 1.72% examples, 528872 words/s, in_qsize 5, out_qsize 0
 15 2016-5-2 18:56:32,108 : INFO : PROGRESS: at 2.68% examples, 549248 words/s, in_qsize 6, out_qsize 0
 16 2016-5-2 18:56:33,113 : INFO : PROGRESS: at 3.47% examples, 534255 words/s, in_qsize 6, out_qsize 0
 17 2016-5-2 18:56:34,135 : INFO : PROGRESS: at 4.43% examples, 545575 words/s, in_qsize 5, out_qsize 0
 18 2016-5-2 18:56:35,145 : INFO : PROGRESS: at 5.40% examples, 555220 words/s, in_qsize 6, out_qsize 0
 19 2016-5-2 18:56:36,147 : INFO : PROGRESS: at 6.34% examples, 560815 words/s, in_qsize 5, out_qsize 0
 20 2016-5-2 18:56:37,155 : INFO : PROGRESS: at 7.28% examples, 564712 words/s, in_qsize 6, out_qsize 1
 21 2016-5-2 18:56:38,172 : INFO : PROGRESS: at 8.24% examples, 568088 words/s, in_qsize 5, out_qsize 0
 22 2016-5-2 18:56:39,169 : INFO : PROGRESS: at 9.19% examples, 570872 words/s, in_qsize 5, out_qsize 0
 23 2016-5-2 18:56:40,191 : INFO : PROGRESS: at 10.16% examples, 573068 words/s, in_qsize 6, out_qsize 0
 24 2016-5-2 18:56:41,203 : INFO : PROGRESS: at 11.12% examples, 575184 words/s, in_qsize 5, out_qsize 1
 25 2016-5-2 18:56:42,217 : INFO : PROGRESS: at 12.09% examples, 577227 words/s, in_qsize 5, out_qsize 0
 26 2016-5-2 18:56:43,220 : INFO : PROGRESS: at 13.04% examples, 578418 words/s, in_qsize 5, out_qsize 1
 27 2016-5-2 18:56:44,235 : INFO : PROGRESS: at 14.00% examples, 579574 words/s, in_qsize 5, out_qsize 1
 28 2016-5-2 18:56:45,239 : INFO : PROGRESS: at 14.96% examples, 580577 words/s, in_qsize 6, out_qsize 2
 29 2016-5-2 18:56:46,243 : INFO : PROGRESS: at 15.86% examples, 578374 words/s, in_qsize 6, out_qsize 0
 30 2016-5-2 18:56:47,252 : INFO : PROGRESS: at 16.70% examples, 574918 words/s, in_qsize 5, out_qsize 1
 31 2016-5-2 18:56:48,256 : INFO : PROGRESS: at 17.66% examples, 576221 words/s, in_qsize 5, out_qsize 0
 32 2016-5-2 18:56:49,258 : INFO : PROGRESS: at 18.61% examples, 577045 words/s, in_qsize 4, out_qsize 0
 33 2016-5-2 18:56:50,260 : INFO : PROGRESS: at 19.54% examples, 576947 words/s, in_qsize 4, out_qsize 1
 34 2016-5-2 18:56:51,261 : INFO : PROGRESS: at 20.47% examples, 577120 words/s, in_qsize 6, out_qsize 0
 35 2016-5-2 18:56:52,284 : INFO : PROGRESS: at 21.43% examples, 577251 words/s, in_qsize 5, out_qsize 1
 36 2016-5-2 18:56:53,287 : INFO : PROGRESS: at 22.34% examples, 576556 words/s, in_qsize 6, out_qsize 0
 37 2016-5-2 18:56:54,308 : INFO : PROGRESS: at 23.20% examples, 574618 words/s, in_qsize 6, out_qsize 1
 38 2016-5-2 18:56:55,306 : INFO : PROGRESS: at 24.15% examples, 575304 words/s, in_qsize 5, out_qsize 0
 39 2016-5-2 18:56:56,329 : INFO : PROGRESS: at 25.09% examples, 575610 words/s, in_qsize 5, out_qsize 1
 40 2016-5-2 18:56:57,333 : INFO : PROGRESS: at 26.04% examples, 576358 words/s, in_qsize 6, out_qsize 0
 41 2016-5-2 18:56:58,340 : INFO : PROGRESS: at 26.97% examples, 576745 words/s, in_qsize 5, out_qsize 0
 42 2016-5-2 18:56:59,337 : INFO : PROGRESS: at 27.91% examples, 577161 words/s, in_qsize 5, out_qsize 0
 43 2016-5-2 18:57:00,338 : INFO : PROGRESS: at 28.84% examples, 577303 words/s, in_qsize 5, out_qsize 0
 44 2016-5-2 18:57:01,346 : INFO : PROGRESS: at 29.65% examples, 575087 words/s, in_qsize 6, out_qsize 0
 45 2016-5-2 18:57:02,353 : INFO : PROGRESS: at 30.55% examples, 574516 words/s, in_qsize 5, out_qsize 1
 46 2016-5-2 18:57:03,356 : INFO : PROGRESS: at 31.36% examples, 572590 words/s, in_qsize 5, out_qsize 0
 47 2016-5-2 18:57:04,371 : INFO : PROGRESS: at 32.10% examples, 569320 words/s, in_qsize 6, out_qsize 0
 48 2016-5-2 18:57:05,380 : INFO : PROGRESS: at 32.95% examples, 568088 words/s, in_qsize 5, out_qsize 0
 49 2016-5-2 18:57:06,389 : INFO : PROGRESS: at 33.78% examples, 566886 words/s, in_qsize 6, out_qsize 1
 50 2016-5-2 18:57:07,399 : INFO : PROGRESS: at 34.60% examples, 565345 words/s, in_qsize 6, out_qsize 0
 51 2016-5-2 18:57:08,418 : INFO : PROGRESS: at 35.51% examples, 564685 words/s, in_qsize 5, out_qsize 0
 52 2016-5-2 18:57:09,432 : INFO : PROGRESS: at 36.39% examples, 564093 words/s, in_qsize 6, out_qsize 0
 53 2016-5-2 18:57:10,441 : INFO : PROGRESS: at 37.21% examples, 562778 words/s, in_qsize 5, out_qsize 1
 54 2016-5-2 18:57:11,453 : INFO : PROGRESS: at 38.14% examples, 563163 words/s, in_qsize 6, out_qsize 1
 55 2016-5-2 18:57:12,449 : INFO : PROGRESS: at 38.98% examples, 562072 words/s, in_qsize 6, out_qsize 0
 56 2016-5-2 18:57:13,461 : INFO : PROGRESS: at 39.88% examples, 561949 words/s, in_qsize 6, out_qsize 0
 57 2016-5-2 18:57:14,464 : INFO : PROGRESS: at 40.75% examples, 561493 words/s, in_qsize 6, out_qsize 0
 58 2016-5-2 18:57:15,482 : INFO : PROGRESS: at 41.60% examples, 560419 words/s, in_qsize 5, out_qsize 1
 59 2016-5-2 18:57:16,503 : INFO : PROGRESS: at 42.40% examples, 558807 words/s, in_qsize 6, out_qsize 0
 60 2016-5-2 18:57:17,520 : INFO : PROGRESS: at 43.27% examples, 558287 words/s, in_qsize 5, out_qsize 0
 61 2016-5-2 18:57:18,534 : INFO : PROGRESS: at 44.13% examples, 557685 words/s, in_qsize 6, out_qsize 0
 62 2016-5-2 18:57:19,538 : INFO : PROGRESS: at 44.93% examples, 556591 words/s, in_qsize 6, out_qsize 0
 63 2016-5-2 18:57:20,540 : INFO : PROGRESS: at 45.83% examples, 556881 words/s, in_qsize 5, out_qsize 0
 64 2016-5-2 18:57:21,541 : INFO : PROGRESS: at 46.75% examples, 557341 words/s, in_qsize 6, out_qsize 0
 65 2016-5-2 18:57:22,553 : INFO : PROGRESS: at 47.69% examples, 557860 words/s, in_qsize 5, out_qsize 1
 66 2016-5-2 18:57:23,557 : INFO : PROGRESS: at 48.51% examples, 557066 words/s, in_qsize 6, out_qsize 0
 67 2016-5-2 18:57:24,564 : INFO : PROGRESS: at 49.42% examples, 557201 words/s, in_qsize 5, out_qsize 0
 68 2016-5-2 18:57:25,571 : INFO : PROGRESS: at 50.31% examples, 557231 words/s, in_qsize 5, out_qsize 1
 69 2016-5-2 18:57:26,585 : INFO : PROGRESS: at 51.26% examples, 557820 words/s, in_qsize 6, out_qsize 1
 70 2016-5-2 18:57:27,586 : INFO : PROGRESS: at 52.22% examples, 558455 words/s, in_qsize 4, out_qsize 0
 71 2016-5-2 18:57:28,588 : INFO : PROGRESS: at 53.16% examples, 558932 words/s, in_qsize 6, out_qsize 1
 72 2016-5-2 18:57:29,609 : INFO : PROGRESS: at 54.11% examples, 559389 words/s, in_qsize 5, out_qsize 0
 73 2016-5-2 18:57:30,616 : INFO : PROGRESS: at 55.01% examples, 559415 words/s, in_qsize 6, out_qsize 0
 74 2016-5-2 18:57:31,642 : INFO : PROGRESS: at 55.87% examples, 558596 words/s, in_qsize 5, out_qsize 0
 75 2016-5-2 18:57:32,647 : INFO : PROGRESS: at 56.78% examples, 558665 words/s, in_qsize 6, out_qsize 0
 76 2016-5-2 18:57:33,656 : INFO : PROGRESS: at 57.57% examples, 557526 words/s, in_qsize 6, out_qsize 0
 77 2016-5-2 18:57:34,660 : INFO : PROGRESS: at 58.39% examples, 556830 words/s, in_qsize 4, out_qsize 0
 78 2016-5-2 18:57:35,664 : INFO : PROGRESS: at 59.31% examples, 557019 words/s, in_qsize 6, out_qsize 0
 79 2016-5-2 18:57:36,670 : INFO : PROGRESS: at 60.12% examples, 556187 words/s, in_qsize 6, out_qsize 0
 80 2016-5-2 18:57:37,683 : INFO : PROGRESS: at 60.94% examples, 555461 words/s, in_qsize 6, out_qsize 0
 81 2016-5-2 18:57:38,686 : INFO : PROGRESS: at 61.78% examples, 554836 words/s, in_qsize 6, out_qsize 0
 82 2016-5-2 18:57:39,705 : INFO : PROGRESS: at 62.54% examples, 553555 words/s, in_qsize 6, out_qsize 0
 83 2016-5-2 18:57:40,710 : INFO : PROGRESS: at 63.35% examples, 552863 words/s, in_qsize 6, out_qsize 0
 84 2016-5-2 18:57:41,719 : INFO : PROGRESS: at 64.12% examples, 551760 words/s, in_qsize 6, out_qsize 0
 85 2016-5-2 18:57:42,726 : INFO : PROGRESS: at 64.93% examples, 551152 words/s, in_qsize 5, out_qsize 0
 86 2016-5-2 18:57:43,741 : INFO : PROGRESS: at 65.74% examples, 550535 words/s, in_qsize 6, out_qsize 0
 87 2016-5-2 18:57:44,743 : INFO : PROGRESS: at 66.51% examples, 549746 words/s, in_qsize 6, out_qsize 0
 88 2016-5-2 18:57:45,743 : INFO : PROGRESS: at 67.23% examples, 548498 words/s, in_qsize 6, out_qsize 0
 89 2016-5-2 18:57:46,773 : INFO : PROGRESS: at 67.98% examples, 547297 words/s, in_qsize 6, out_qsize 0
 90 2016-5-2 18:57:47,786 : INFO : PROGRESS: at 68.81% examples, 546808 words/s, in_qsize 6, out_qsize 0
 91 2016-5-2 18:57:48,792 : INFO : PROGRESS: at 69.58% examples, 546028 words/s, in_qsize 6, out_qsize 0
 92 2016-5-2 18:57:49,798 : INFO : PROGRESS: at 70.37% examples, 545344 words/s, in_qsize 6, out_qsize 0
 93 2016-5-2 18:57:50,807 : INFO : PROGRESS: at 71.19% examples, 545012 words/s, in_qsize 6, out_qsize 1
 94 2016-5-2 18:57:51,802 : INFO : PROGRESS: at 72.09% examples, 545184 words/s, in_qsize 6, out_qsize 0
 95 2016-5-2 18:57:52,806 : INFO : PROGRESS: at 72.98% examples, 545315 words/s, in_qsize 5, out_qsize 0
 96 2016-5-2 18:57:53,827 : INFO : PROGRESS: at 73.92% examples, 545714 words/s, in_qsize 5, out_qsize 0
 97 2016-5-2 18:57:54,827 : INFO : PROGRESS: at 74.86% examples, 546256 words/s, in_qsize 5, out_qsize 0
 98 2016-5-2 18:57:55,840 : INFO : PROGRESS: at 75.79% examples, 546379 words/s, in_qsize 5, out_qsize 0
 99 2016-5-2 18:57:56,851 : INFO : PROGRESS: at 76.73% examples, 546823 words/s, in_qsize 5, out_qsize 0
100 2016-5-2 18:57:57,843 : INFO : PROGRESS: at 77.66% examples, 547189 words/s, in_qsize 6, out_qsize 0
101 2016-5-2 18:57:58,847 : INFO : PROGRESS: at 78.50% examples, 546858 words/s, in_qsize 6, out_qsize 0
102 2016-5-2 18:57:59,849 : INFO : PROGRESS: at 79.39% examples, 546959 words/s, in_qsize 5, out_qsize 0
103 2016-5-2 18:58:00,854 : INFO : PROGRESS: at 80.27% examples, 546954 words/s, in_qsize 5, out_qsize 1
104 2016-5-2 18:58:01,856 : INFO : PROGRESS: at 81.22% examples, 547394 words/s, in_qsize 3, out_qsize 0
105 2016-5-2 18:58:02,875 : INFO : PROGRESS: at 82.13% examples, 547429 words/s, in_qsize 6, out_qsize 0
106 2016-5-2 18:58:03,888 : INFO : PROGRESS: at 83.07% examples, 547815 words/s, in_qsize 6, out_qsize 0
107 2016-5-2 18:58:04,880 : INFO : PROGRESS: at 84.00% examples, 548153 words/s, in_qsize 5, out_qsize 0
108 2016-5-2 18:58:05,895 : INFO : PROGRESS: at 84.91% examples, 548428 words/s, in_qsize 5, out_qsize 0
109 2016-5-2 18:58:06,888 : INFO : PROGRESS: at 85.77% examples, 548357 words/s, in_qsize 6, out_qsize 0
110 2016-5-2 18:58:07,901 : INFO : PROGRESS: at 86.64% examples, 548365 words/s, in_qsize 6, out_qsize 0
111 2016-5-2 18:58:08,897 : INFO : PROGRESS: at 87.50% examples, 548265 words/s, in_qsize 6, out_qsize 0
112 2016-5-2 18:58:09,902 : INFO : PROGRESS: at 88.42% examples, 548504 words/s, in_qsize 6, out_qsize 0
113 2016-5-2 18:58:10,916 : INFO : PROGRESS: at 89.18% examples, 547765 words/s, in_qsize 5, out_qsize 0
114 2016-5-2 18:58:11,921 : INFO : PROGRESS: at 89.94% examples, 547006 words/s, in_qsize 5, out_qsize 0
115 2016-5-2 18:58:12,923 : INFO : PROGRESS: at 90.81% examples, 546992 words/s, in_qsize 6, out_qsize 0
116 2016-5-2 18:58:13,930 : INFO : PROGRESS: at 91.72% examples, 547225 words/s, in_qsize 6, out_qsize 0
117 2016-5-2 18:58:14,935 : INFO : PROGRESS: at 92.59% examples, 547187 words/s, in_qsize 5, out_qsize 0
118 2016-5-2 18:58:15,939 : INFO : PROGRESS: at 93.46% examples, 547133 words/s, in_qsize 6, out_qsize 0
119 2016-5-2 18:58:16,944 : INFO : PROGRESS: at 94.18% examples, 546224 words/s, in_qsize 6, out_qsize 0
120 2016-5-2 18:58:17,953 : INFO : PROGRESS: at 94.93% examples, 545497 words/s, in_qsize 6, out_qsize 0
121 2016-5-2 18:58:18,959 : INFO : PROGRESS: at 95.70% examples, 544697 words/s, in_qsize 6, out_qsize 0
122 2016-5-2 18:58:19,967 : INFO : PROGRESS: at 96.40% examples, 543702 words/s, in_qsize 5, out_qsize 0
123 2016-5-2 18:58:20,974 : INFO : PROGRESS: at 97.26% examples, 543612 words/s, in_qsize 5, out_qsize 0
124 2016-5-2 18:58:21,978 : INFO : PROGRESS: at 98.17% examples, 543801 words/s, in_qsize 5, out_qsize 0
125 2016-5-2 18:58:22,994 : INFO : PROGRESS: at 99.07% examples, 543908 words/s, in_qsize 4, out_qsize 2
126 2016-5-2 18:58:23,989 : INFO : PROGRESS: at 99.91% examples, 543692 words/s, in_qsize 6, out_qsize 0
127 2016-5-2 18:58:24,067 : INFO : worker thread finished; awaiting finish of 2 more threads
128 2016-5-2 18:58:24,083 : INFO : worker thread finished; awaiting finish of 1 more threads
129 2016-5-2 18:58:24,086 : INFO : worker thread finished; awaiting finish of 0 more threads
130 2016-5-2 18:58:24,086 : INFO : training on 85026035 raw words (62534095 effective words) took 115.0s, 543725 effective words/s
131 2016-5-2 18:58:24,086 : INFO : precomputing L2-norms of word weight vectors
132 <span style="color:#FF0000;">woman和man的相似度為: 0.699695936218
133 --------
134 和good最相關的詞有:
135  
136 bad 0.721469461918
137 poor 0.567566931248
138 safe 0.534923613071
139 luck 0.518905758858
140 courage 0.510788619518
141 useful 0.498157411814
142 quick 0.497716665268
143 easy 0.497328162193
144 everyone 0.485905945301
145 pleasure 0.483758479357
146 true 0.482762247324
147 simple 0.480014979839
148 practical 0.479516804218
149 fair 0.479104012251
150 happy 0.476968646049
151 wrong 0.476797521114
152 reasonable 0.476701617241
153 you 0.475801795721
154 fun 0.472196519375
155 helpful 0.471719056368
156 --------
157  
158  "boy" is to "father" as "girl" is to ...? 
159  
160 mother 0.76334130764
161 grandmother 0.690031766891
162 daughter 0.684129178524
163 --------
164  
165 'he' is to 'his' as 'she' is to 'her'
166 'big' is to 'bigger' as 'bad' is to 'worse'
167 'going' is to 'went' as 'being' is to 'was'
168 --------
169  
170 不合群的詞: cereal
171 --------</span>
172  
173 2016-5-2 18:58:24,185 : INFO : saving Word2Vec object under text8.model, separately None
174 2016-5-2 18:58:24,185 : INFO : storing numpy array 'syn1neg' to text8.model.syn1neg.npy
175 2016-5-2 18:58:24,235 : INFO : not storing attribute syn0norm
176 2016-5-2 18:58:24,235 : INFO : storing numpy array 'syn0' to text8.model.syn0.npy
177 2016-5-2 18:58:24,278 : INFO : not storing attribute cum_table
178 2016-5-2 18:58:25,083 : INFO : storing 71290x200 projection weights into text8.model.bin

常用語料資源

下面提供一些網上能下載到的中文的好語料,供研究人員學習使用。
(1).中科院自動化所的中英文新聞語料庫 http://www.datatang.com/data/13484
中文新聞分類語料庫從鳳凰、新浪、網易、騰訊等版面蒐集。英語新聞分類語料庫為Reuters-21578的ModApte版本。
(2).搜狗的中文新聞語料庫 http://www.sogou.com/labs/dl/c.html
包括搜狐的大量新聞語料與對應的分類資訊。有不同大小的版本可以下載。
(3).李榮陸老師的中文語料庫 http://www.datatang.com/data/11968
壓縮後有240M大小
(4).譚鬆波老師的中文文字分類語料 http://www.datatang.com/data/11970
不僅包含大的分類,例如經濟、運動等等,每個大類下面還包含具體的小類,例如運動包含籃球、足球等等。能夠作為層次分類的語料庫,非常實用。這個網址免積分(譚鬆波老師的主頁):http://www.searchforum.org.cn/tansongbo/corpus1.PHP
(5).網易分類文字資料 http://www.datatang.com/data/11965
包含運動、汽車等六大類的4000條文字資料。
(6).中文文字分類語料 http://www.datatang.com/data/11963
包含Arts、Literature等類別的語料文字。
(7).更全的搜狗文字分類語料 http://www.sogou.com/labs/dl/c.html
搜狗實驗室釋出的文字分類語料,有不同大小的資料版本供免費下載
(8).2002年中文網頁分類訓練集 http://www.datatang.com/data/15021

2002年秋天北京大學網路與分散式實驗室天網小組通過動員不同專業的幾十個學生,人工選取形成了一個全新的基於層次模型的大規模中文網頁樣本集。它包括11678個訓練網頁例項和3630個測試網頁例項,分佈在11個大類別中。

 

常用分詞工具

將預料庫進行分詞並去掉停用詞,常用分詞工具有:

StandardAnalyzer(中英文)、ChineseAnalyzer(中文)、CJKAnalyzer(中英文)、IKAnalyzer(中英文,相容韓文,日文)、paoding(中文)、MMAnalyzer(中英文)、MMSeg4j(中英文)、imdict(中英文)、NLTK(中英文)、Jieba(中英文)。

提供一份DEMO語料資源

原始語料 http://pan.baidu.com/s/1nviuFc1
訓練語料 http://pan.baidu.com/s/1kVEmNTd