1. 程式人生 > >自然語言交流系統 phxnet團隊 創新實訓 個人博客 (十四)

自然語言交流系統 phxnet團隊 創新實訓 個人博客 (十四)

reg ssa then soci mile pic fin lan tle

關於WikiExtractor的學習筆記:

WikiExtractor是一個Python 腳本,專門用於提取和清洗Wikipedia的dump數據,支持Python 2.7 或者 Python 3.3+,無額外依賴,安裝和使用都非常方便:

安裝:

git clone https://github.com/attardi/wikiextractor.git
cd wikiextractor/
sudo python setup.py install

使用:

WikiExtractor.py -o enwiki enwiki-latest-pages-articles.xml.bz2
......
INFO: 53665431  Pampapaul
INFO: 53665433  Charles Frederick Zimpel
INFO: Finished 11-process extraction of 5375019 articles in 8363.5s (642.7 art/s)

這個過程總計花了2個多小時,提取了大概537萬多篇文章。關於我的機器配置,可參考:《深度學習主機攢機小記》

提取後的文件按一定順序切分存儲在多個子目錄下:

技術分享

每個子目錄下的又存放若幹個以wiki_num命名的文件,每個大小在1M左右,這個大小可以通過參數 -b 控制:

-b n[KMG], --bytes n[KMG] maximum bytes per output file (default 1M)

技術分享

我們看一下wiki_00裏的具體內容:

<doc id="12" url="https://en.wikipedia.org/wiki?curid=12" title="Anarchism">
Anarchism

Anarchism is a political philosophy that advocates self-governed societies based on voluntary institutions. These are often described as stateless societies, although several authors have defined them more specifically as institutions based on non-hierarchical free associations. Anarchism holds the state to be undesirable, unnecessary, and harmful.
...
Criticisms of anarchism include moral criticisms and pragmatic criticisms. Anarchism is often evaluated as unfeasible or utopian by its critics.


</doc>
<doc id="25" url="https://en.wikipedia.org/wiki?curid=25" title="Autism">
Autism

Autism is a neurodevelopmental disorder characterized by impaired social interaction, verbal and non-verbal communication, and restricted and repetitive behavior. Parents usually notice signs in the first two years of their child‘s life. These signs often develop gradually, though some children with autism reach their developmental milestones at a normal pace and then regress. The diagnostic criteria require that symptoms become apparent in early childhood, typically before age three.
...
</doc>
...

每個wiki_num文件裏又存放若幹個doc,每個doc都有相關的tag標記,包括id, url, title等,很好區分。

自然語言交流系統 phxnet團隊 創新實訓 個人博客 (十四)