爬蟲——爬取人民網資料生成詞雲圖

阿新 • • 發佈：2018-12-13

1、以人民網的新聞資料為例，簡單介紹的利用python進行爬蟲，並生成詞雲圖的過程。

首先介紹python的requests庫，它就好像是一個“爬手”，負責到使用者指定的網頁上將所需要的內容爬取下來，供之後的使用。

我們可以利用python的pip功能下載requests庫，在cmd視窗輸入pip install requests命令進行安裝，之後用到的庫也使用這種方法下載（由於我已經安裝了，所以顯示已經存在）。

下載後之後會存在lib\site-packages目錄下。

2、接下來我們就可以使用requests爬取網頁了。

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) '
                             'AppleWebKit/537.36 (KHTML, like Gecko)'
                             ' Chrome/55.0.2883.87 Safari/537.36'}  # 請求頭部可到相應網頁中查詢
url = 'http://www.people.com.cn/'
r = requests.get(url, headers=headers)

requests.get()中含有兩個引數，分別為url與headers。

url表示使用者需要爬取的網頁連結。請求頭header提供了關於請求、響應或其他傳送實體的資訊（如果沒有請求頭或請求頭與實際網頁不一致，就可能無法爬取內容）。

以人民網為例，演示如何獲取headers。

開啟http://www.people.com.cn/（人民網網址），按F12開啟開發人員工具。選擇Network選項卡，之後重新整理頁面。在左側查詢到我們的網頁，點選後右邊將顯示headers內容。需要使用到其中的user-agent，複製其中的內容作為請求頭headers。

我們將爬取的網頁內容複製給r，只需要輸出r.text就可以在python的IDE裡檢視網頁內容了。

3、網頁中不是所有內容都對我們有幫助的，所以需要對它進行解析與內容提取，在這個過程中我們使用Beautifulsoup方法解析。

在本例中想要爬取網頁上的新聞標題，點選人民網檢視網頁的原始碼。

發現標題都在class=“list14”的ul的列表裡。於是，我們使用find_all方法提取所有class=“list14”的ul列表內容。

soup = BeautifulSoup(r.text, "html.parser")
for news_list in soup.find_all(class_="list14"):
    content = news_list.text.strip()

4、為了使用方便，我們將爬取的內容存在txt文字中，接下來，需要讀取文字進行詞頻統計。

jieba庫根據自己的詞庫對文字資料進行切分，並統計之後將詞頻返回給開發者。

我們利用一個字典keywords來存下切分後的關鍵字及詞頻。

coco = open('renming.txt', encoding="utf-8").read()  # 讀取檔案
result = jieba.analyse.textrank(coco, topK=50, withWeight=True)  # 詞頻分析
keywords = dict()
for i in result:
    keywords[i[0]] = i[1]

5、經常利用python繪製圖表的朋友可能會了解wordcloud庫，幫助開發者根據詞頻進行繪圖，並且可以人性化地設定字型顏色等等。

 wc = WordCloud(font_path='./fonts/simhei.ttf', background_color='White', max_words=50)
 wc.generate_from_frequencies(keywords)

詞雲圖的樣式預設為正方形，可以通過np.array等操作進行設定，修改樣式（具體實現可參考完整程式碼註釋部分）。

6、最後為了程式碼的簡潔美觀，將上述程式碼封裝成函式，完整程式碼如下。

# coding:utf-8
import requests
from bs4 import BeautifulSoup
import jieba.analyse
import matplotlib.pyplot as plt
from PIL import Image,ImageSequence
import numpy as np
from wordcloud import WordCloud
def spider(url,headers):
    with open('renming.txt', 'w', encoding='utf-8') as fp:
        r = requests.get(url, headers=headers)
        r.encoding = 'gb2312'
        # test=re.findall('<li><a href=.*?>(.*?)</a></li>',r.text)#利用正則進行解析
        # print(test)
        soup = BeautifulSoup(r.text, "html.parser")
        for news_list in soup.find_all(class_="list14"):
            content = news_list.text.strip()
            fp.write(content)
    fp.close()
def analyse():
    coco = open('renming.txt', encoding="utf-8").read()  # 讀取檔案
    result = jieba.analyse.textrank(coco, topK=50, withWeight=True)  # 詞頻分析
    keywords = dict()
    for i in result:
        keywords[i[0]] = i[1]
    print(keywords)
    wc = WordCloud(font_path='./fonts/simhei.ttf', background_color='White', max_words=50)
    wc.generate_from_frequencies(keywords)
    # image = Image.open('./reqiqiu.jpg')  # 生成雲圖，設定樣式
    # graph = np.array(image)
    # wc = WordCloud(font_path='./fonts/simhei.ttf', background_color='White', max_words=50, mask=graph)
    # wc.generate_from_frequencies(keywords)
    # image_color = ImageColorGenerator(graph)
    plt.imshow(wc)
    plt.axis("off")
    plt.show()
    wc.to_file('dream.png')
if __name__=="__main__":
    headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) '
                             'AppleWebKit/537.36 (KHTML, like Gecko)'
                             ' Chrome/55.0.2883.87 Safari/537.36'}  # 請求頭部可到相應網頁中查詢
    url = 'http://www.people.com.cn/'
    spider(url, headers)
    analyse()

爬蟲——爬取人民網資料生成詞雲圖

爬蟲——爬取人民網資料生成詞雲圖

python爬蟲爬取QQ說說並且生成詞雲圖，回憶滿滿！

用Python爬取微博資料生成詞雲圖片

Python爬取微博資料生成詞雲圖片

python 爬取視頻評論生成詞雲圖

[轉載]Python爬取豆瓣影評並生成詞雲圖程式碼

Python3網路爬蟲：requests+mongodb+wordcloud 爬取豆瓣影評並生成詞雲

爬取外網資料（twitter、facebook）-易數雲視覺化爬蟲軟體

20180213 爬蟲爬取空氣質量資料

Python 利用BeautifulSoup和正則表示式來爬取旅遊網資料

Python爬蟲-爬取開心網主頁(有登入介面-利用cookie)

Scrapy+Seleium爬蟲爬取天眼查資料

Python爬取動態說說，生成詞雲，看看朋友的現狀

Python爬蟲--爬取歷史天氣資料

Python 爬蟲第三步 -- 多執行緒爬蟲爬取噹噹網書籍資訊

（8）Python爬蟲——爬取豆瓣影評資料

python爬蟲爬取貓眼電影資料

python爬蟲——爬取知網體育學刊引證論文資訊

python 3.5：爬蟲--爬取人民日報1946-2003年所有新聞

Springboot+JPA下實現簡易爬蟲--爬取豆瓣電視劇資料

爬蟲——爬取人民網資料生成詞雲圖

相關推薦