1. 程式人生 > >技術文檔翻譯-------glove readme(1)

技術文檔翻譯-------glove readme(1)

並排 ext bsp rep cor ren contents into ssi

 1 Package Contents
 2 To train your own GloVe vectors, first youll need to prepare your corpus as a single text file with all words separated by a single space. If your corpus has multiple documents, simply concatenate documents together with a single space. If your documents are particularly short, it
s possible that padding the gap between documents with e.g. 5 "dummy" words will produce better vectors. Once you create your corpus, you can train GloVe vectors using the following 4 tools. An example is included in demo.sh, which you can modify as necessary. 3 4 This four main tools in this package are:
5 6 1) vocab_count 7 This tool requires an input corpus that should already consist of whitespace-separated tokens. Use something like the Stanford Tokenizer first on raw text. From the corpus, it constructs unigram counts from a corpus, and optionally thresholds the resulting vocabulary based on total vocabulary size or minimum frequency count.
8 9 2) cooccur 10 Constructs word-word cooccurrence statistics from a corpus. The user should supply a vocabulary file, as produced by vocab_count, and may specify a variety of parameters, as described by running ./build/cooccur. 11 12 3) shuffle 13 Shuffles the binary file of cooccurrence statistics produced by cooccur. For large files, the file is automatically split into chunks, each of which is shuffled and stored on disk before being merged and shuffled together. The user may specify a number of parameters, as described by running ./build/shuffle. 14 15 4) glove 16 Train the GloVe model on the specified cooccurrence data, which typically will be the output of the shuffle tool. The user should supply a vocabulary file, as given by vocab_count, and may specify a number of other parameters, which are described by running ./build/glove.
 1 如果你要訓練你自己的glove詞向量,那麽你首先需要把準備一個包含你語料集的單獨文件,格式要求,文件中的詞都用一個空格隔開。如果你的語料集有多個文檔,請用兩兩之間用空格連接起來。如果你的文檔都非常的短,你可以用5個"dummy"單詞來填充文檔,這樣可以產生更好的詞向量。一旦你創建了語料庫,你就可以用以下4個工具進行glove詞向量訓練了。demo.sh中包含一個示例,可以再必要的時候修改它。
 2 
 3 攻擊包中主要的四個工具如下所示:
 41) vocab_count
 5         這個工具要求輸入的語料庫已經是以空格分隔的標準格式。它會首先使用類似Stanford  Tokenizer 的方式作用在文本上,它會對語料庫中的一元詞進行統計計數,並根據總詞匯量或者最小詞頻計數來選擇閾值得到最終結果
 62)ooccur 
 7         從語聊庫構建詞-詞共生統計,用戶應該提供一個由vocab_count得到的詞匯表文件,同時需要指定一系列參數, 就像運行./build/cooccur時顯示的描述樣
 83)shuffle  
 9         混洗由cooccur生成二進制的共生統計結果文件。對於大文件,每個塊都會在混合並混洗在一起然後存儲並排列在磁盤陣列上。用戶需要指定一些參數,如運行 ./build/shuffle時顯示的那樣。
10         
114) glove
12     
13         在指定的共生數據上訓練glove模型,這通常是混洗工具(shuffle)輸出的結果。用戶應該提供一個由vocab_count得出的文件並指定一系列參數,如運行./build/glove描述的那樣        

技術文檔翻譯-------glove readme(1)