1. 程式人生 > >Python數據挖掘-詞雲

Python數據挖掘-詞雲

pen agg val nump columns 背景 sort wordcloud 分享圖片

詞雲繪制

1、語料庫的搭建、分詞來源、移除停用詞、詞頻統計

使用方法:os.path.join(path,name) #連接目錄與文件名或目錄 結果為path/name

技術分享圖片
import os
import os.path
import codecs

filePaths=[]
fileContents=[]
for root,dirs,files in os.walk("D:\\Python\\Python數據挖掘\\Python數據挖掘實戰課程課件\\2.4\\SogouC.mini\\Sample"):
    for name in files:
        filePath
=os.path.join(root,name) filePaths.append(filePath) f=codecs.open(filePath,"r","utf-8") fileContent=f.read() f.close() fileContents.append(fileContent) import pandas corpos=pandas.DataFrame({ "filePath":filePaths,
"fileContent":fileContents}) #分詞來源哪個文章 import jieba segments=[] filePaths=[] for index,row in corpos.iterrows(): filePath=row["filePath"] fileContent=row["fileContent"] segs=jieba.cut(fileContent) for seg in segs: segments.append(seg) filePaths.append(filePath) segmentDataFrame
=pandas.DataFrame({ "segment":segments, "filepath":filePaths}) import numpy #進行詞頻統計 #by是要分組的列,[]是要統計的列 segStat=segmentDataFrame.groupby( by="segment" )["segment"].agg({ "計數":numpy.size }).reset_index().sort(columns=["計數"], ascending=False) #移除停用詞 stopwords=pandas.read_csv( "D:\\Python\\Python數據挖掘\\Python數據挖掘實戰課程課件\\2.4\\StopwordsCN.txt", encoding="utf-8", index_col=False) fSegStat=segStat[ ~segStat.segment.isin(stopwords.stopword)] #第二種去除分詞的方法 import jieba segments=[] filePaths=[] for index,row in corpos.iterrows(): filePath=row["filePath"] fileContent=row["fileContent"] segs=jieba.cut(fileContent) for seg in segs: if seg not in stopwords.stopword.values and len(seg.strip())>0: segments.append(seg) filePaths.append(filePath) segmentDataFrame=pandas.DataFrame({ "segment":segments, "filePath":filePaths}) segStat=segmentDataFrame.groupby( by="segment" )["segment"].agg({ "計數":numpy.size }).reset_index().sort( columns=["計數"], ascending=False)
View Code

2、詞雲繪制

首先要引入WordCloud,然後在引入畫圖模塊matplotlib中pyplot函數

一般先設定詞雲的背景和字體,用到background和font_path

詞雲統計的話,一般是字典形式,這時候分詞就需要作為序列,然後統計的詞頻數作為列,然後再作為參數傳入fit_words

圖形的展示通過plt函數的方法imshow()來展示

技術分享圖片
from wordcloud import WordCloud
import matplotlib.pyplot as plt

wordcloud =WordCloud(
    font_path="D:\\Python\\Python數據挖掘\\Python數據挖掘實戰課程課件\\2.4\\simhei.ttf",
    background_color="black"
    )

words=fSegStat.set_index("segment").to_dict()

wordcloud.fit_words(words["計數"])
plt.imshow(wordcloud)
plt.close()
View Code

Python數據挖掘-詞雲