在英文維基百科上的實驗

阿新 • • 發佈：2019-01-02

本文翻譯自
這個文件介紹獲取和處理維基百科的過程，以至於每個人都可以複製其結果；
準備語料庫（Preparing the corpus）
1、首先，下載所有維基百科文章的轉儲地址（(you want the file enwiki-latest-pages-articles.xml.bz2, or enwiki-YYYYMMDD-pages-articles.xml.bz2 for date-specific dumps）這是一個檔案，大概8G，包括所有維基百科中的文章；
2、把所有的檔案轉化為純文字（處理WiKi標記），把結果儲存為稀疏TF-IDF向量的形式；在Python中，這是很容易做的，我們甚至不需要解壓縮整個文件到磁碟，這是Gensim中包含的一個指令碼用來做這些，執行如下：

$ python -m gensim.scripts.make_wiki

這個預處理步驟遍歷了兩遍這個8.2G的壓縮wiki轉儲（一次用來構建字典，一次用來構建和儲存稀疏矩陣）
而且，你大約需要35GB的磁碟空間來儲存這個稀疏生成矩陣,建議立即壓縮它們。Genism能夠直接處理壓縮檔案；
潛在語義分析（Latent Semantic Analysis）
讓我們載入語料庫迭代器和字典，在上面構建的：

>>> import logging, gensim, bz2
>>> logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

>>> # load id->word mapping (the dictionary), one of the results of step 2 above
>>> id2word = gensim.corpora.Dictionary.load_from_text('wiki_en_wordids.txt')
>>> # load corpus iterator
>>> mm = gensim.corpora.MmCorpus('wiki_en_tfidf.mm')
>>> # mm = gensim.corpora.MmCorpus(bz2.BZ2File('wiki_en_tfidf.mm.bz2')) # use this if you compressed the TFIDF output (recommended)

>>> print(mm)
MmCorpus(3931787 documents, 100000 features, 756379027 non-zero entries)

構建LSI模型要花費幾小時
能夠看出，總處理時間主要有預處理步驟決定；
在Gensim中使用的演算法僅僅需要檢視每個輸入文件一次，這適用於不可重複流文件的情況；

潛在的狄利克雷分配（Latent Dirichlet Allocation）
首先載入語料迭代器和詞典：

>>> import logging, gensim, bz2
>>> logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

>>> # load id->word mapping (the dictionary), one of the results of step 2 above
>>> id2word = gensim.corpora.Dictionary.load_from_text('wiki_en_wordids.txt')
>>> # load corpus iterator
>>> mm = gensim.corpora.MmCorpus('wiki_en_tfidf.mm')
>>> # mm = gensim.corpora.MmCorpus(bz2.BZ2File('wiki_en_tfidf.mm.bz2')) # use this if you compressed the TFIDF output

>>> print(mm)
MmCorpus(3931787 documents, 100000 features, 756379027 non-zero entries)

我們將要執行線上LDA，這是一個需要文字塊的演算法，更新LDA模型，再獲取檔案塊，再更新LDA模型…
線上LDA可以與批量LDA相對比，後者處理整個語料庫，之後更新模型，在進行另外一次迭代，之後更新模型…
之間的差別在於給予一個合理的固定的檔案流，線上更新基於更小的文字塊（子庫），所以這個模型估計收斂更快；所以，可能我們僅僅需要完整的遍歷所有的檔案一次：如果這個語料庫包含3million文章，我們需要每100，000文章對模型更新一次，這意味著整個遍歷過程中，需要300次更新，很有可能主題有更精確的預測：

>>> # extract 100 LDA topics, using 1 pass and updating once every 1 chunk (10,000 documents)
>>> lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=100, update_every=1, chunksize=10000, passes=1)
using serial LDA version on this node
running online LDA training, 100 topics, 1 passes over the supplied corpus of 3931787 documents, updating model once every 10000 documents
...

不像LSA，LDA中的主題更容易理解：

>>> # print the most contributing words for 20 randomly selected topics
>>> lda.print_topics(20)
topic #0: 0.009*river + 0.008*lake + 0.006*island + 0.005*mountain + 0.004*area + 0.004*park + 0.004*antarctic + 0.004*south + 0.004*mountains + 0.004*dam
topic #1: 0.026*relay + 0.026*athletics + 0.025*metres + 0.023*freestyle + 0.022*hurdles + 0.020*ret + 0.017*divisão + 0.017*athletes + 0.016*bundesliga + 0.014*medals
topic #2: 0.002*were + 0.002*he + 0.002*court + 0.002*his + 0.002*had + 0.002*law + 0.002*government + 0.002*police + 0.002*patrolling + 0.002*their
topic #3: 0.040*courcelles + 0.035*centimeters + 0.023*mattythewhite + 0.021*wine + 0.019*stamps + 0.018*oko + 0.017*perennial + 0.014*stubs + 0.012*ovate + 0.011*greyish
topic #4: 0.039*al + 0.029*sysop + 0.019*iran + 0.015*pakistan + 0.014*ali + 0.013*arab + 0.010*islamic + 0.010*arabic + 0.010*saudi + 0.010*muhammad
topic #5: 0.020*copyrighted + 0.020*northamerica + 0.014*uncopyrighted + 0.007*rihanna + 0.005*cloudz + 0.005*knowles + 0.004*gaga + 0.004*zombie + 0.004*wigan + 0.003*maccabi
topic #6: 0.061*israel + 0.056*israeli + 0.030*sockpuppet + 0.025*jerusalem + 0.025*tel + 0.023*aviv + 0.022*palestinian + 0.019*ifk + 0.016*palestine + 0.014*hebrew
topic #7: 0.015*melbourne + 0.014*rovers + 0.013*vfl + 0.012*australian + 0.012*wanderers + 0.011*afl + 0.008*dinamo + 0.008*queensland + 0.008*tracklist + 0.008*brisbane
topic #8: 0.011*film + 0.007*her + 0.007*she + 0.004*he + 0.004*series + 0.004*his + 0.004*episode + 0.003*films + 0.003*television + 0.003*best
topic #9: 0.019*wrestling + 0.013*château + 0.013*ligue + 0.012*discus + 0.012*estonian + 0.009*uci + 0.008*hockeyarchives + 0.008*wwe + 0.008*estonia + 0.007*reign
topic #10: 0.078*edits + 0.059*notability + 0.035*archived + 0.025*clearer + 0.022*speedy + 0.021*deleted + 0.016*hook + 0.015*checkuser + 0.014*ron + 0.011*nominator
topic #11: 0.013*admins + 0.009*acid + 0.009*molniya + 0.009*chemical + 0.007*ch + 0.007*chemistry + 0.007*compound + 0.007*anemone + 0.006*mg + 0.006*reaction
topic #12: 0.018*india + 0.013*indian + 0.010*tamil + 0.009*singh + 0.008*film + 0.008*temple + 0.006*kumar + 0.006*hindi + 0.006*delhi + 0.005*bengal
topic #13: 0.047*bwebs + 0.024*malta + 0.020*hobart + 0.019*basa + 0.019*columella + 0.019*huon + 0.018*tasmania + 0.016*popups + 0.014*tasmanian + 0.014*modèle
topic #14: 0.014*jewish + 0.011*rabbi + 0.008*bgwhite + 0.008*lebanese + 0.007*lebanon + 0.006*homs + 0.005*beirut + 0.004*jews + 0.004*hebrew + 0.004*caligari
topic #15: 0.025*german + 0.020*der + 0.017*von + 0.015*und + 0.014*berlin + 0.012*germany + 0.012*die + 0.010*des + 0.008*kategorie + 0.007*cross
topic #16: 0.003*can + 0.003*system + 0.003*power + 0.003*are + 0.003*energy + 0.002*data + 0.002*be + 0.002*used + 0.002*or + 0.002*using
topic #17: 0.049*indonesia + 0.042*indonesian + 0.031*malaysia + 0.024*singapore + 0.022*greek + 0.021*jakarta + 0.016*greece + 0.015*dord + 0.014*athens + 0.011*malaysian
topic #18: 0.031*stakes + 0.029*webs + 0.018*futsal + 0.014*whitish + 0.013*hyun + 0.012*thoroughbred + 0.012*dnf + 0.012*jockey + 0.011*medalists + 0.011*racehorse
topic #19: 0.119*oblast + 0.034*uploaded + 0.034*uploads + 0.033*nordland + 0.025*selsoviet + 0.023*raion + 0.022*krai + 0.018*okrug + 0.015*hålogaland + 0.015*russiae + 0.020*manga + 0.017*dragon + 0.012*theme + 0.011*dvd + 0.011*super + 0.011*hunter + 0.009*ash + 0.009*dream + 0.009*angel

注意執行LDA和LSA之間的差別：我們要求LSA獲取400個主題，LDA僅僅100個主題（所以速度的差異可能更大），第二，LSA在Gensim中的實現真正的online:在一個小量的更新中，如果隨著時間，輸入流在性質上有了變化，LSA模型能夠重新定位自身，反映這些變化。相反，LDA不是真正的online，後期的更新對模型的影響逐漸減少；如果在文件流中，存在主題的變化，LDA會出現問題；
總之，如果隨著時間增量式的有新的文件加入，使用LDA時要注意；LDA批量使用，要不需要事先了解整個訓練語料庫，要不沒有主題的變化；
執行批量LDA模型：

# extract 100 LDA topics, using 20 full passes, no online updates
>>> lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=100, update_every=0, passes=20)

通常，一個被訓練的模型能夠用來轉換新的，不可見的文件（詞袋計數向量）為LDA主題分佈：

>>> doc_lda = lda[doc_bow]

在英文維基百科上的實驗

在英文維基百科上的實驗

Gensim官方教程翻譯（五）——英文維基百科的實驗

用gensim對中文維基百科語料上的word2Vec相似度計算實驗

Atitit Java製作VCARD vcf 以上就是關於vCard的基本介紹,維基百科(英文)https://en.wikipedia.org/wiki/VCard寫的比較全,可惜我看不懂。

維基百科

wikipedia 維基百科語料獲取與提取處理 by python3.5

Sqlite3,維基百科中的練習：

Jenkins + Github持續集成構建Docker容器，維基百科&人工自能（AI）模塊

復數的輻角（維基百科）

小專案（Gensim庫）--維基百科中文資料處理

百度百科與維基百科

Gensim訓練維基百科語料庫

設計模式（1）——簡介（翻譯自維基百科wiki）

算術基本定理（維基百科）

半加器和全加器的維基百科

安裝使用離線版本的維基百科(Wikipedia)

維基百科中的資料科學：手把手教你用Python讀懂全球最大百科全書

維基百科被國內封殺

word2vec訓練維基百科中文詞向量

我的維基百科wikipedia的配置(中文維基百科配置)

在英文維基百科上的實驗

相關推薦