1. 程式人生 > >中文分詞,詞頻統計,詞雲圖製作

中文分詞,詞頻統計,詞雲圖製作

from collections import Counter

import jieba

#jieba的安裝就不多介紹,網上相應的教程比較多

import matplotlib.pyplot as plt

from wordcloud import WordCloud

#wordcloud安裝出現了bug,解決的方案就是另外一篇的blog

  http://blog.csdn.net/qq_35273499/article/details/79078692

##建立停用詞list

#此處分詞之後去除多餘的停用詞def Stopwordlist(filepath):

    stopwords = []

    for line in open(filepath,'r').readlines():

        stopwords.append( line.strip())

    #print(stopwords)

    return stopwords

##對句子進行分詞

def Cut_Sentence(rawfile,stopwordpath):

    outstr = []

    stopwords = Stopwordlist(stopwordpath)

    for line in rawfile:

        sentence_seged = jieba.cut(line.strip(),cut_all = False)#line.strip()去除換行符

        for word in sentence_seged:

            if word not in stopwords:

                if word!= '\t':

                    outstr.append(word)

    #print(outstr)

    return outstr

#詞頻統計

def Countword(outstrlist):

    data = dict(Counter(outstrlist))

    data1 = sorted(data.items(), key=lambda d:d[1], reverse = True)

    data2 = dict((key ,values)for key,values in data1)

    '''

 data.iteritems() 得到[(鍵,值)]的列表。

然後用sorted方法,通過key這個引數,指定排序是按照value,

也就是第一個元素d[1的值來排序。reverse = True表示是需要翻轉的,

 預設是從小到大,翻轉的話,那就是從大到小。'''

    return data2

#製作詞雲圖

def Wordcloud(text):

    wc = WordCloud(

                background_color = 'white',    # 設定背景顏色

                max_words = 2000,            # 設定最大現實的字數

                font_path = r'H:\cutword\msyhbd.ttf',# 設定字型格式,如不設定顯示不了中文

               # mask = trump_coloring,

                width =800,

                height = 600,

                max_font_size = 50,            # 設定字型最大值

                random_state = 30,            # 設定有多少種隨機生成狀態,即有多少種配色方案

                )

    wc.generate(text)

    wc.to_file('H:\cutword\wordcloud.png')

    #my_wordcloud = WordCloud().generate(str(strlist))

    plt.imshow(wc)

    plt.axis("off")#不新增座標軸

    plt.show()

if __name__ == '__main__':

    stopwordpath ='H:\cutword\stopwords.txt'#停用詞路徑

    rawfile = open('H:\cutword\dqdg\dqdg.txt','r')#需要進行分詞的原文字

    outfile = open("H:\cutword\outwords.txt",'w+')#分詞並進行去除停用詞後的文字

    countfile = open('H:\cutword\wordcount.txt','w')#將詞頻統計寫入檔案中

    outstrlist = Cut_Sentence(rawfile,stopwordpath)

    countword = Countword(outstrlist)

    Wordcloud(str(outstrlist))

    for line in outstrlist:

        outfile.write(line + " ")#將分詞結果寫入檔案

    for key in countword.keys():  

            countfile.write(key + '  ' + str(countword[key]) + '\n') #寫入txt文件  

    #Wordcloud()

    rawfile.close()

    outfile.close()

    countfile.close()

   進行分詞的預料是孫皓暉的《大秦帝國》:https://pan.baidu.com/s/1o94kRGY     密碼:2q2t   

   停用的詞的部分截圖    https://pan.baidu.com/s/1dGMeivn  

 最終分詞的效果   

進行分詞之後,我們還進行了詞頻的統計,按詞頻從大到小進行排序並將結果儲存在檔案中。如下圖

 

所需的字型(若無字型,無法識別中文):https://pan.baidu.com/s/1oAj2wJ4     密碼:mq2a

將部分詞製作詞雲圖,最終的結果:可以看到,‘秦國’,‘衛鞅,‘龐涓’,‘商鞅’,‘國君’等出現比較高