詞雲wordcloud類介紹&python制作詞雲圖&詞雲圖亂碼問題等小坑

阿新 • • 發佈：2017-07-12

fan ask 其中文字 doesn 說明 bool 精神則表達式

詞雲圖，大家一定見過，大數據時代大家經常見，我們今天就來用python的第三方庫wordcloud，來制作一個大數據詞雲圖，同時會降到這個過程中遇到的各種坑，

舉個例子，下面是我從自己的微信上抓的微信好友簽名，制作的詞雲圖：看來用的做多的還是“方得始終”啊

技術分享

首先我們需要幾個庫，pip完了導入

1 import chardet                         #檢測字符類型的類
2 from wordcloud import WordCloud        #詞雲庫
3 import matplotlib.pyplot as plt        #數學繪圖庫

咱們這個例子分2步，第一步：從文件讀取一段文字，第二步制作詞雲圖並顯示出來

看代碼：從桌面讀取一個文件

1 with open("C:\\Users\\fyc\\Desktop\\virgo.txt", "r") as f:
2     text = f.read()
3 type = chardet.detect(text)
4 text1 = text.decode(type["encoding"])

在這要做一個編碼的工作，應為詞雲的generate函數接受的是一個Unicode類的對象，其他的對象會導致異常，經過層層跟進，終於在wordcloud.py文件裏發現了這一行代碼：

1 stopwords = set(map(str.lower, self.stopwords))
 
2 
3 flags = (re.UNICODE if sys.version < ‘3‘ and type(text) is unicode
4                  else 0)
5 regexp = self.regexp if self.regexp is not None else r"\w[\w‘]+"
6 
7 words = re.findall(regexp, text, flags)

問題出在正則表達式：如果不是unicode類型，進來的text經過re.findall計算，將什麽也匹配不到，words為一個空list，隨後就會拋出異常

所以在generate（）之前一定要轉碼成“Unicode”類型。

第二步：生成詞雲，並顯示：

 1 wc1 = WordCloud(
 2     background_color="white",
 3     width=1000,
 4     height=860,
 5     font_path="C:\\Windows\\Fonts\\STFANGSO.ttf",#不加這一句顯示口字形亂碼
 6     margin=2)
 7 wc2 = wc1.generate(text1)         #我們觀察到generate()接受一個Unicode的對象，所以之前要把文本處理成unicode類型
 8 
 9 plt.imshow(wc2)
10 plt.axis("off")
11 plt.show()

wordcloud構造處一個詞雲對象，然後generate（）方法把傳進來的文本“text"按照詞出現的頻率安排詞的大小，其中text，我是找了一篇關於白羊座的介紹，文字如下：

豪放率真的白羊座女生，富有強大的想象力，熱情勇敢，女漢子味十足。勇往直前，是你們最大的特點。所以即便面對困難挫折，白羊女都敢於迎接挑戰。可以說，這是個極具戰鬥精神的新時代女性。如此強悍個性的白羊女，在異性的眼中卻永遠都少了點女人味專屬的溫柔，往往都是稱兄道弟的份。 

　　如果你期待一個林黛玉般的女孩來滿足你的男子氣概，碰到了她，你可真是門兒都沒有了，識相的話，趕快找個窗子開溜吧！在火星守護下的白羊座女子，通常是積極而且堅強的。像小鳥依人、楚楚可憐這一類的形容詞很難加諸在她身上。

　　白羊座的女子應該算得上十二個星座中，獨立性最強的女性。她絕對不是那種整天守在家裏，等著你來接她、送她，完全缺乏獨立行動能力的典型。對於大多數的白羊座女孩來說，她寧願相信如果沒有你在身邊礙手礙腳，她辦事效率或許要高得多。聽我這麽說，你或許會以為這樣的女人是不需要男人的！那可就錯啦！自信而驕傲的白羊座女性確有著堅強的獨立生存能力。但她們內心都深深地渴望著她夢中的白馬王子快點出現呢！很難相信吧！看起來那麽銳利的她，其實是充滿著童話般夢想的。而對所有的白羊座女子來說，她們心裏最大的矛盾就是渴望征服對方，又期待著被對方征服的微妙心裏。

　　你現在可能有些擔心，不知道該如何扮演好自已的角色了，是嗎？別慌！先把你的“真心”準備好，以後的辦法就好商量了，雖然有一點點辛苦，不過保證值得。

　　首先，你必須認清，白羊座的女子基本上是“英雄主義”的。她會傾心於一個令她佩服的男人。她要嫁一個讓她引以為傲的丈夫。她或許會比較欣賞事業有成的男人，但這並不表示她是個拜金主義者。家財萬貫的花花公子是不會讓她心動的，滿腔理想的熱血青年反而會受她的青睞。因此，如果你愛上了一個白羊座的女子，請先不必展開熱烈的追求。哈巴狗一樣的男人會讓她既討厭又害怕，她深怕給你一個禮貌的微笑之後，你就會死纏著她不放了。你最好先讓她了解你的才幹、你的魅力，引起她對你的好奇（或者應該說引起她征服你的興趣），等你感受到她對你真有好感之後，你再誠懇的對她表示愛意，那麽前途就大有可為了！

　　你要像個“大男人”的樣子（我說過她很英雄主義），但是你絕不能對她頤指氣使。你一定要真心的關懷她，但絕對不要太縱容她。我想你應該用一種“英雄惜英雄”的態度來待你的白羊座女人，才是比較適當的。

　　絕大多數的白羊座女子都很好強，她們堅持要在對方的心目中保持最重要的地位。當然她也會把你擺在她心中最重要的位置。而且她很忠實、很大方。她願意與你分享她的一切。當然，她也會認為你應該與她分享一切，包括你的秘密。欺騙你的白羊座女人，有如犯了欺君之罪一樣的嚴重。這一點你可千萬要記住，她情願聽你令她心碎的懺悔，也不要接受美麗的謊言。你最好少在她面前誇獎其她的女孩，尤其是那種由衷的贊美，極可能會引起雷霆大火。

　　由於她們那麽積極堅強的個性，許多白羊座的女子都會給人一種尖銳、而且愛找麻煩的印象。在表面上，她們是不會讓別人（尤其是男人）占便宜的。很多人會認為白羊座的女人總是尖嘴利牙的得理不饒人。正因為如此，她們常常會吃些暗虧，受到挫折，她們總是比其它的女孩活得辛苦一些。其實，你應該明白，她們的內心多半是正直、善良，而且脆弱的。只要你真心的關懷她，在她受了委屈的時侯，給她一個溫暖的懷抱，她會成為你一生忠實可靠的伴侶。

　　白羊座的女人幾乎都可以成為出色的職業婦女，也同時能做稱職的家庭主婦。事實上，讓她擁有自已的事業對你們的婚姻是有幫助的。當她盡量在工作上發揮了她的好勝心和征服欲之後，回到家裏做綿羊的機會就比較大了。如果要一個精力旺盛的白羊座女子，把心思全放在“你”的身上，我擔心你會有點受不了的。至於家庭，你大可放心，她雖然柴米油鹽之類的瑣事，不是那麽有興趣，但好強的她，不會讓自已成為一個失敗的主婦的。

　　還有件事你該慶幸，那就是我很少看到一個邋遢的白羊座妻子，她們大多都在婚後依然保持光鮮亮麗，不願意別人譏笑她老公娶了個黃臉婆。就算偶爾懶散一下，只要老公稍加提醒，她們立刻就會警覺。我有個產後稍微發福的白羊座女友，就因為老公說了一句“以前我最欣賞你那雙修長的腿了。”她硬是兩個月減肥了二十磅。就憑她這股堅強的毅力，你能不相信她會全力以赴的做個好妻子嗎？

關於構造方法的介紹需要說明幾點：第一，用的是關鍵詞參數，無需記住參數位置，技術參數的關鍵詞就行。

關於參數的含義，在pycharm中查看快速文檔，說明如下：

class WordCloud(object)  def __init__(self, font_path=None, width=400, height=200, margin=2, ranks_only=None, prefer_horizontal=.9, mask=None, scale=1, color_func=None, max_words=200, min_font_size=4, stopwords=None, random_state=None, background_color=‘black‘, max_font_size=None, font_step=1, mode="RGB", relative_scaling=.5, regexp=None, collocations=True, colormap=None, normalize_plurals=True)  Documentation is missing. The following is copied from class WordCloud.  
Word cloud object for generating and drawing.
  
font_path:
(string) Font path to the font that will be used (OTF or TTF). Defaults to DroidSansMono path on a Linux machine. If you are on another OS or don‘t have this font, you need to adjust this path.
width:
(int (default=400)) Width of the canvas.
height:
(int (default=200)) Height of the canvas.
prefer_horizontal:
(float (default=0.90)) The ratio of times to try horizontal fitting as opposed to vertical. If prefer_horizontal < 1, the algorithm will try rotating the word if it doesn‘t fit. (There is currently no built-in way to get only vertical words.)
mask:
(nd-array or None (default=None)) If not None, gives a binary mask on where to draw words. If mask is not None, width and height will be ignored and the shape of mask will be used instead. All white (#FF or #FFFFFF) entries will be considerd "masked out" while other entries will be free to draw on. [This changed in the most recent version!]
scale:
(float (default=1)) Scaling between computation and drawing. For large word-cloud images, using scale instead of larger canvas size is significantly faster, but might lead to a coarser fit for the words.
min_font_size:
(int (default=4)) Smallest font size to use. Will stop when there is no more room in this size.
font_step:
(int (default=1)) Step size for the font. font_step > 1 might speed up computation but give a worse fit.
max_words:
(number (default=200)) The maximum number of words.
stopwords:
(set of strings or None) The words that will be eliminated. If None, the build-in STOPWORDS list will be used.
background_color:
(color value (default="black")) Background color for the word cloud image.
max_font_size:
(int or None (default=None)) Maximum font size for the largest word. If None, height of the image is used.
mode:
(string (default="RGB")) Transparent background will be generated when mode is "RGBA" and background_color is None.
relative_scaling:
(float (default=.5)) Importance of relative word frequencies for font-size. With relative_scaling=0, only word-ranks are considered. With relative_scaling=1, a word that is twice as frequent will have twice the size. If you want to consider the word frequencies and not only their rank, relative_scaling around .5 often looks good. 
color_func:
(callable, default=None) Callable with parameters word, font_size, position, orientation, font_path, random_state that returns a PIL color for each word. Overwrites "colormap". See colormap for specifying a matplotlib colormap instead.
regexp:
(string or None (optional)) Regular expression to split the input text into tokens in process_text. If None is specified, r"\w[\w‘]+" is used.
collocations:
(bool, default=True) Whether to include collocations (bigrams) of two words. 
colormap:
(string or matplotlib colormap, default="viridis") Matplotlib colormap to randomly draw colors from for each word. Ignored if "color_func" is specified. 
normalize_plurals:
(bool, default=True) Whether to remove trailing ‘s‘ from words. If True and a word appears with and without a trailing ‘s‘, the one with trailing ‘s‘ is removed and its counts are added to the version without trailing ‘s‘ – unless the word ends with ‘ss‘.
Notes
Larger canvases with make the code significantly slower. If you need a large word cloud, try a lower canvas size, and set the scale parameter.
The algorithm might give more weight to the ranking of the words than their actual frequencies, depending on the max_font_size and the scaling heuristic.

大家使用百度翻譯應該能看明白，這裏說明幾個比較關鍵的參數：

font_path：這個是在詞雲圖中顯示文字的字體存放的路徑，特別是在顯示中文的時候，這個參數尤為重要，如果缺省的話容易造成亂碼，如下：

技術分享

width，height 顧名思義，畫布的長寬。

prefer_horizontal ：詞雲的字體優先水平放置

mask:這是背景的形狀，缺省是畫布的形狀。

其他幾個參數就不說了

第三步就是用matplotlib庫將詞雲圖顯示出來，這一段代碼比較固定，沒什麽變化，死記硬背了

1 plt.imshow(wc2)
2 plt.axis("off")
3 plt.show()

其中，axis是顯示坐標，這裏我們選擇不現實坐標。整體效果如下：
技術分享

詞雲wordcloud類介紹&python制作詞雲圖&詞雲圖亂碼問題等小坑

fan ask 其中文字 doesn 說明 bool 精神則表達式詞雲圖，大家一定見過，大數據時代大家經常見，我們今天就來用python的第三方庫wordcloud，來制作一個大數據詞雲圖，同時會降到這個過程中遇到的各種坑，舉個例子，下面是我從自己的微信上抓的微信好

Python詞雲 wordcloud

https://blog.csdn.net/FontThrone/article/details/72775865 整體簡介基於Python的詞雲生成類庫,很好用,而且功能強大.博主個人比較推薦 github:https://github.com/amueller

scrapy-redis爬取豆瓣電影短評，使用詞雲wordcloud展示

1、資料是使用scrapy-redis爬取的，存放在redis裡面，爬取的是最近大熱電影《海王》 2、使用了jieba中文分詞解析庫 3、使用了停用詞stopwords，過濾掉一些無意義的詞 4、使用matplotlib+wordcloud繪圖展示 from redis import Redis impor

詞雲 wordcloud

wordcloud引數簡單介紹 font_path : string //字型路徑，需要展現什麼字型就把該字型路徑+字尾名寫上，如：font_path = '黑體.ttf' width : int (default=400)&

pyecharts 詞雲(WordCloud)

from pyecharts import WordCloud name = [u"Python",u"data analysis",u"hadoop",u"falsk"]#大概可以改變字型的顏色 value = [10000,6000,4000,3000]#對應名字的權重 wd = WordClo

用pyecharts繪製詞雲WordCloud

詞雲圖詞雲圖主要用熱詞的熱度進行視覺化。 WordCloud.add() 方法簽名 add(name, attr, value, shape=”circle”, word_gap=20, word_size_range=

python 制作wordcloud詞雲

ont ima plot 完成 .com span 文件 help 來源 pip install wordcloud 需要用到numpy pillow matplotlib 安裝完成以後 wordcloud_cli --text in.txt --imagefile

用Python和WordCloud繪制詞雲（內附讓字體清晰的秘笈）

txt文件 gen 擴展其中詞匯平臺 jieba分詞名稱個數環境及模塊：　　Win7 64位　　Python 3.6.4 　　WordCloud 1.5.0 　　Pillow 5.0.0 　　Jieba 0.39 目標：　　繪制安徽省2018年某些科技項

python爬蟲——京東評論、jieba分詞、wordcloud詞雲統計

nbsp cnblogs code utf-8 col type callback 結果處理接上一章，抓取京東評論區內容。 url=‘https://club.jd.com/comment/productPageComments.action?callback=fetc

詞雲繪制wordcloud

mage 文本技術 iyu 讀取頻率字號 eight color wordcloud是優秀的第三方詞雲展示庫，該庫以空格為分割線，按照單詞出現的頻率自動設置字號與顏色實例如下 import wordcloud#詞雲庫 import jieba#分詞庫 a=open(

python（wordcloud）實現中文詞雲

bold pytho 作圖 back 垂直背景數值內置顯示 # 這是一個處理圖像的函數from scipy.misc import imreadfrom wordcloud import WordCloud,STOPWORDS,ImageColorGenerat

[python] 詞雲：wordcloud包的安裝、使用、原理（源碼分析）、中文詞雲生成、代碼重寫

possible 渲染 alias com 表達問題 compute ural pty 詞雲，又稱文字雲、標簽雲，是對文本數據中出現頻率較高的“關鍵詞”在視覺上的突出呈現，形成關鍵詞的渲染形成類似雲一樣的彩色圖片，從而一眼就可以領略文本數據的主要表

Python小程式——利用wordcloud庫生成詞雲（二）

wordcloud庫利用wordcloud物件生成詞雲，其中可以配置很多屬性，讓你的詞雲更加個性化。 w_cloud = wordcloud.WordCloud( font_path=font, background_color=None, mode="RGBA", # 背

Python小程式——利用wordcloud庫生成詞雲（一）

最近自學Python的中文處理，其中用到了wordcloud庫生成一篇文章的詞雲，能更直觀的表現出文章的主題，是一個不錯的工具。雖然現在網上有很多詞雲線上生成的應用，不過為了更個性化一點，還是寫一個自己的詞雲生成工具吧。 import jieba import wordcloud from

python 微信生成詞雲(itchat,jieba,wordcloud)

完整程式碼： import itchat import re#正則匹配 # 先登入，掃二維碼登入微信 itchat.login() #獲取好友列表，返回的是json資訊 friends = itchat.get_friends(update=True)[0:] #列印好

Python詞雲庫wordcloud中文顯示問題

問題 wordcloud預設是不支援顯示中文的，中文會被顯示成方框。解決經過測試發現不支援顯示中文的原因是因為wordcloud的預設字型不支援中文，那就好辦了，我們設定一種支援中文的字型即可， wordlcloud.WordCloud類初始化函式有個設定字型的引數font_

基於python的wordcloud庫生成中文詞雲

安裝 pip install wordcloud -i https://mirrors.aliyun.com/pypi/simple/ wordcloud中主要有三個類： WordCloud([font_path, width, height, …])

利用Python的WordCloud生成詞雲

python程式碼： from wordcloud import WordCloud,ImageColorGenerator import matplotlib.pyplot as plt from scipy.misc import imread #載入圖片 #讀

【Python】Windows下用Jieba分詞和WordCloud庫生成中文詞雲

一、開啟Anaconda Prompt，用activate命令啟用環境二、從清華映象下載所需庫： jieba分詞庫 wordcloud繪製詞雲庫 numpy常用於處理陣列 PIL為影象處理標準庫 pip install jieba -i https://pypi

python實戰專案詞雲生成器(wordcloud+jieba+pyinstaller打包)——詞雲生成軟體

最近學習了python的jieba分詞庫和wordcloud詞雲庫，誕生了想寫個小demo，使用python實現文章的詞雲圖的繪製，然後需要具有互動介面，並且能夠在沒有python環境的電腦下執行，方便不懂程式設計的人直接使用。全部程式碼實現的打包exe檔案：WordCl

詞雲wordcloud類介紹&python制作詞雲圖&詞雲圖亂碼問題等小坑

相關推薦