Python3網路爬蟲:requests+mongodb+wordcloud 爬取豆瓣影評並生成詞雲
Python版本: python3.+
執行環境: Mac OS
IDE: pycharm
一 前言
之前搗鼓了幾日wordcloud詞雲,覺得很有意思,能自定義背景圖、設定各式各樣的字型、還能設定詞的顏色。生成詞雲的時候也很有成就感。(233333)但是哪來的資料來源呢?於是就想到了豆瓣網的影評。
順帶(裝模作樣地)嘗試了一下自頂向下的設計原則:
- 爬取豆瓣網指定電影的5000條影評;
- 存到資料庫中;
- 從資料庫中獲取並處理評論,獲得詞頻;
- 利用詞頻生成圖雲。
大(我)道(不)至(會)簡(寫),下面就按這個原則一步一步來。
不過在施工之前,要先準備好工具:
- mongodb的安裝:這個安裝攻略網上有很多,在這裡推薦菜鳥教程的, MAC OS/Linux/Windows的教程都有
- jieba分詞庫的匯入及簡單使用:Github地址
- wordcloud庫的匯入及簡單使用:Github地址
二 豆瓣網影評爬取
1 網頁分析
我們以春宵苦短,少女前進吧! 夜は短し歩けよ乙女 這部電影為例。
URL:https://movie.douban.com/subject/26935251/
在這個頁往下拉拉到分評論,點選更多短評
,跳轉到評論頁。
能發現URL的變化:
https://movie.douban.com/subject/26935251/comments?sort=new_score&status=P
其核心部分就是https://movie.douban.com/subject/26935251/comments
,後面的?sort=new_score&status=P
說明該網頁以GET方式傳送引數
Chrome瀏覽器下,通過工具欄進入開發者工具(快捷鍵:command+alt+i)
能發現評論的資訊全在一個個的//div[@class="comment-item"]
再拉倒頁面底部,點選後頁,在NetWork中抓包分析
能觀察到傳送資料的內容。再結合頁面資訊,不難發現start就是從第start條評論開始,limit就是當前頁顯示limit條評論;那這樣,我們就能直接在迴圈中,在每次傳送的資料中,更新start=start+limit
直至最後接收到的頁面評論數不大於0。
2 程式碼編寫
利用這一點,就能構造出我們想要的爬蟲了,程式碼如下:
import requests
from lxml import etree
import time
def get_comments(url,headers,start,max_restart_num,movie_name,collection):
'''
:param url: 請求頁面的url
:param headers: 請求頭
:param start: 第start條資料開始
:param max_restart_num: 當獲取失敗時,最大重新請求次數
:param movie_name: 電影名字
:param collection: mongodb資料庫的集合
:return:
'''
if start >= 5000:
print("已爬取5000條評論,結束爬取")
return
data = {
'start': start,
'limit': 20,
'sort': 'new_score',
'status': 'P',
}
response = requests.get(url=url, headers=headers, params=data)
tree = etree.HTML(response.text)
comment_item = tree.xpath('//div[@id ="comments"]/div[@class="comment-item"]')
len_comments = len(comment_item)
if len_comments > 0:
for i in range(1, len_comments + 1):
votes = tree.xpath('//div[@id ="comments"]/div[@class="comment-item"][{}]//span[@class="votes"]'.format(i))
commenters = tree.xpath(
'//div[@id ="comments"]/div[@class="comment-item"][{}]//span[@class="comment-info"]/a'.format(i))
ratings = tree.xpath(
'//div[@id ="comments"]/div[@class="comment-item"][{}]//span[@class="comment-info"]/span[contains(@class,"rating")]/@title'.format(
i))
comments_time = tree.xpath(
'//div[@id ="comments"]/div[@class="comment-item"][{}]//span[@class="comment-info"]/span[@class="comment-time "]'.format(
i))
comments = tree.xpath(
'//div[@id ="comments"]/div[@class="comment-item"][{}]/div[@class="comment"]/p'.format(i))
vote = (votes[0].text.strip())
commenter = (commenters[0].text.strip())
try:
rating = (str(ratings[0]))
except:
rating = 'null'
comment_time = (comments_time[0].text.strip())
comment = (comments[0].text.strip())
comment_dict = {}
comment_dict['vote'] = vote
comment_dict['commenter'] = commenter
comment_dict['rating'] = rating
comment_dict['comments_time'] = comment_time
comment_dict['comments'] = comment
comment_dict['movie_name'] = movie_name
#存入資料庫
print("正在存取第{}條資料".format(start+i))
print(comment_dict)
# collection.update({'commenter': comment_dict['commenter']}, {'$setOnInsert': comment_dict}, upsert=True)
headers['Referer'] = response.url
start += 20
data['start'] = start
time.sleep(5)
return get_comments(url, headers, start, max_restart_num,movie_name,collection)
else:
# print(response.status_code)
if max_restart_num>0 :
if response.status_code != 200:
print("fail to crawl ,waiting 10s to restart continuing crawl...")
time.sleep(10)
# headers['User-Agent'] = Headers.getUA()
# print(start)
return get_comments(url, headers, start, max_restart_num-1, movie_name, collection)
else:
print("finished crawling")
return
else:
print("max_restart_num has run out")
with open('log.txt',"a") as fp:
fp.write('\n{}--latest start:{}'.format(time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time())), start))
return
if __name__ =='__main__':
base_url = 'https://movie.douban.com/subject/26935251'
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36',
'Upgrade-Insecure-Requests': '1',
'Connection':'keep-alive',
'Upgrade-Insecure-Requests':'1',
'Host':'movie.douban.com',
}
start = 0
response = requests.get(base_url,headers)
tree = etree.HTML(response.text)
movie_name = tree.xpath('//div[@id="content"]/h1/span')[0].text.strip()
# print(movie_name)
url = base_url+'/comments'
try:
get_comments(url, headers,start, 5, movie_name,None)
finally:
pass
執行這個程式就能在控制檯輸不斷地輸出獲取到的評論資訊了,但是才爬了200多條評論,就會出現以下資訊:
這是因為沒有登入。這個問題有2種解決方法:
1. 用程式模擬登入豆瓣網並儲存cookies。
2. 直接在網頁上登入,再通過抓包,獲得登入後的cookie值,直接加在我們傳送的請求頭中。
在這裡為了方便,我選擇了第二種方法,改造headers:
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36',
'Cookie': '**********************',#請自行修改各自的cookie放入其中
'Upgrade-Insecure-Requests': '1',
'Connection':'keep-alive',
'Upgrade-Insecure-Requests':'1',
'Host':'movie.douban.com',
}
為了驗證可行性,我們將以上程式的start
修改成上次訪問結束的值220
,再次執行程式,發現程式又能繼續顯示評論了,至此…爬蟲的核心部分已經搞定了。下面只要再加上儲存資料庫,中文分詞,詞頻篩選、統計,繪製詞雲。那麼這個任務就完成了。
三 資料庫實裝
關於python操作mongodb 使用的是pymongo
庫,具體的教程需要自行谷歌百度了,這裡只介紹具體的用法
#資料庫連線
client = MongoClient('localhost', 27017)#連結資料庫 引數分別為 ip/埠號
db = client.douban #相當於 use douban 進入到 douban db中
db.authenticate('douban_sa','sa') #db授權,如果沒有在啟動mongod的時候加上 --auth 可以忽略這一步
collection = db.movie_comments #選中db庫中的具體table
#資料庫的寫入
collection.update({'commenter': comment_dict['commenter']}, {'$setOnInsert': comment_dict}, upsert=True) #簡單的說 就是不存在時插入,存在時不更新;
#具體來說:向表中插入comment_dict資料,如果表中已經存在 commenter == comment_dict['commenter'],則不更新;若不存在,則插入這條資料
#這裡'$setOnInsert'和upsert=True是一致的,只有當upsert=True時,才會執行 存在時不更新
四 中文分詞
在這裡我使用了jieba分詞,它的官方描述很誘人‘做最好的中文分詞’ 哈哈,感覺也不是不可能,就我的使用體驗來說,簡單,輕巧,可擴充套件性高。
def get_words_frequency(collection,stop_set):
'''
中文分詞並返回詞頻
:param collection: 資料庫的table表
:param stop_set: 停用詞集
:return:
'''
array = collection.find({"movie_name": "春宵苦短,少女前進吧! 夜は短し歩けよ乙女","rating":{"$in":['力薦','推薦']}},{"comments":1})
num = 0
words_list = []
for doc in array:
num+=1
# print(doc['comments'])
comment = doc['comments']
t_list = jieba.lcut(str(comment),cut_all=False)
for word in t_list: #當詞不在停用詞集中出現,並且長度大於1小於5,將之視為課作為詞頻統計的詞
if word not in stop_set and 5>len(word)>1:
words_list.append(word)
words_dict = dict(Counter(words_list))
return words_dict
def classify_frequenc(word_dict,minment=5):
'''
詞頻篩選,將詞頻統計中出現次數小於minment次的次剔除出去,獲取更精確的詞頻統計
:param word_dict:
:param minment:
:return:
'''
num = minment - 1
dict = {k:v for k,v in word_dict.items() if v > num}
return dict
def load_stopwords_set(stopwords_path):
'''
載入停詞集
:param stopwords_path: 文字存放路徑
:return:集合
'''
stop_set = set()
with open(str(stopwords_path),'r') as fp:
line=fp.readline()
while line is not None and line!= "":
# print(line.strip())
stop_set.add(line.strip())
line = fp.readline()
# time.sleep(2)
return stop_set
五 詞雲生成
本著簡潔、強大、易用的目的,我選擇用wordcloud庫來製作詞雲。
def get_wordcloud(dict,title,save=False):
'''
:param dict: 詞頻字典
:param title: 標題(電影名)
:param save: 是否儲存到本地
:return:
'''
# 詞雲設定
mask_color_path = "bg_1.png" # 設定背景圖片路徑
font_path = '*****' # 為matplotlib設定中文字型路徑;各作業系統字型路徑不同,以mac ox為例:'/Library/Fonts/華文黑體.ttf'
imgname1 = "color_by_defualut.png" # 儲存的圖片名字1(只按照背景圖片形狀)
imgname2 = "color_by_img.png" # 儲存的圖片名字2(顏色按照背景圖片顏色佈局生成)
width = 1000
height = 860
margin = 2
# 設定背景圖片
mask_coloring = imread(mask_color_path)
# 設定WordCloud屬性
wc = WordCloud(font_path=font_path, # 設定字型
background_color="white", # 背景顏色
max_words=150, # 詞雲顯示的最大詞數
mask=mask_coloring, # 設定背景圖片
max_font_size=200, # 字型最大值
# random_state=42,
width=width, height=height, margin=margin, # 設定圖片預設的大小,但是如果使用背景圖片的話,那麼儲存的圖片大小將會按照其大小儲存,margin為詞語邊緣距離
)
# 生成詞雲
wc.generate_from_frequencies(dict)
bg_color = ImageColorGenerator(mask_coloring)
# 重定義字型顏色
wc.recolor(color_func=bg_color)
# 定義自定義字型,檔名從1.b檢視系統中文字型中來
myfont = FontProperties(fname=font_path)
plt.figure()
plt.title(title, fontproperties=myfont)
plt.imshow(wc)
plt.axis("off")
plt.show()
if save is True:#儲存到
wc.to_file(imgname2)
六 程式碼合併
import requests,time
from lxml import etree
import time
from all_headers import Headers
from pymongo import MongoClient
import jieba
from collections import Counter
from wordcloud import WordCloud,ImageColorGenerator
from scipy.misc import imread
import matplotlib.pyplot as plt
from matplotlib.font_manager import FontProperties
def get_comments(url,headers,start,max_restart_num,movie_name,collection):
if start >= 5000:
print("已爬取5000條評論,結束爬取")
return
data = {
'start': start,
'limit': 20,
'sort': 'new_score',
'status': 'P',
}
response = requests.get(url=url, headers=headers, params=data)
tree = etree.HTML(response.text)
comment_item = tree.xpath('//div[@id ="comments"]/div[@class="comment-item"]')
len_comments = len(comment_item)
if len_comments > 0:
for i in range(1, len_comments + 1):
votes = tree.xpath('//div[@id ="comments"]/div[@class="comment-item"][{}]//span[@class="votes"]'.format(i))
commenters = tree.xpath(
'//div[@id ="comments"]/div[@class="comment-item"][{}]//span[@class="comment-info"]/a'.format(i))
ratings = tree.xpath(
'//div[@id ="comments"]/div[@class="comment-item"][{}]//span[@class="comment-info"]/span[contains(@class,"rating")]/@title'.format(
i))
comments_time = tree.xpath(
'//div[@id ="comments"]/div[@class="comment-item"][{}]//span[@class="comment-info"]/span[@class="comment-time "]'.format(
i))
comments = tree.xpath(
'//div[@id ="comments"]/div[@class="comment-item"][{}]/div[@class="comment"]/p'.format(i))
vote = (votes[0].text.strip())
commenter = (commenters[0].text.strip())
try:
rating = (str(ratings[0]))
except:
rating = 'null'
comment_time = (comments_time[0].text.strip())
comment = (comments[0].text.strip())
comment_dict = {}
comment_dict['vote'] = vote
comment_dict['commenter'] = commenter
comment_dict['rating'] = rating
comment_dict['comments_time'] = comment_time
comment_dict['comments'] = comment
comment_dict['movie_name'] = movie_name
#存入資料庫
print("正在存取第{}條資料".format(start+i))
print(comment_dict)
# collection.update({'commenter': comment_dict['commenter']}, {'$setOnInsert': comment_dict}, upsert=True)
headers['Referer'] = response.url
start += 20
data['start'] = start
time.sleep(5)
return get_comments(url, headers, start, max_restart_num,movie_name,collection)
else:
# print(response.status_code)
if max_restart_num>0 :
if response.status_code != 200:
print("fail to crawl ,waiting 10s to restart continuing crawl...")
time.sleep(10)
headers['User-Agent'] = Headers.getUA()
print(start)
return get_comments(url, headers, start, max_restart_num-1, movie_name, collection)
else:
print("finished crawling")
return
else:
print("max_restart_num has run out")
with open('log.txt',"a") as fp:
fp.write('\n{}--latest start:{}'.format(time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time())), start))
return
def get_words_frequency(collection,stop_set):
'''
中文分詞並返回詞頻
:param collection: 資料庫的table表
:param stop_set: 停用詞集
:return:
'''
# array = collection.find({"movie_name": "春宵苦短,少女前進吧! 夜は短し歩けよ乙女","rating":{"$in":['力薦','推薦']}},{"comments":1})
array = collection.find({"movie_name": "春宵苦短,少女前進吧! 夜は短し歩けよ乙女","$or":[{'rating':'力薦'},{'rating':'推薦'}]},{"comments":1})
num = 0
words_list = []
for doc in array:
num+=1
# print(doc['comments'])
comment = doc['comments']
t_list = jieba.lcut(str(comment),cut_all=False)
for word in t_list:
if word not in stop_set and 5>len(word)>1:
words_list.append(word)
words_dict = dict(Counter(words_list))
return words_dict
def classify_frequenc(word_dict,minment=5):
num = minment - 1
dict = {k:v for k,v in word_dict.items() if v > num}
return dict
def load_stopwords_set(stopwords_path):
stop_set = set()
with open(str(stopwords_path),'r') as fp:
line=fp.readline()
while line is not None and line!= "":
# print(line.strip())
stop_set.add(line.strip())
line = fp.readline()
# time.sleep(2)
return stop_set
def get_wordcloud(dict,title,save=False):
'''
:param dict: 詞頻字典
:param title: 標題(電影名)
:param save: 是否儲存到本地
:return:
'''
# 詞雲設定
mask_color_path = "bg_1.png" # 設定背景圖片路徑
font_path = '/Library/Fonts/華文黑體.ttf' # 為matplotlib設定中文字型路徑沒
imgname1 = "color_by_defualut.png" # 儲存的圖片名字1(只按照背景圖片形狀)
imgname2 = "color_by_img.png" # 儲存的圖片名字2(顏色按照背景圖片顏色佈局生成)
width = 1000
height = 860
margin = 2
# 設定背景圖片
mask_coloring = imread(mask_color_path)
# 設定WordCloud屬性
wc = WordCloud(font_path=font_path, # 設定字型
background_color="white", # 背景顏色
max_words=150, # 詞雲顯示的最大詞數
mask=mask_coloring, # 設定背景圖片
max_font_size=200, # 字型最大值
# random_state=42,
width=width, height=height, margin=margin, # 設定圖片預設的大小,但是如果使用背景圖片的話,那麼儲存的圖片大小將會按照其大小儲存,margin為詞語邊緣距離
)
# 生成詞雲
wc.generate_from_frequencies(dict)
bg_color = ImageColorGenerator(mask_coloring)
# 重定義字型顏色
wc.recolor(color_func=bg_color)
# 定義自定義字型,檔名從1.b檢視系統中文字型中來
myfont = FontProperties(fname=font_path)
plt.figure()
plt.title(title, fontproperties=myfont)
plt.imshow(wc)
plt.axis("off")
plt.show()
if save is True:#儲存到
wc.to_file(imgname2)
if __name__ =='__main__':
base_url = 'https://movie.douban.com/subject/26935251'
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36',
'Upgrade-Insecure-Requests': '1',
'Cookie': '******', #各位根據自己賬號的Cookie進行填充
'Connection':'keep-alive',
'Upgrade-Insecure-Requests':'1',
'Host':'movie.douban.com',
}
start = 0
response = requests.get(base_url,headers)
tree = etree.HTML(response.text)
movie_name = tree.xpath('//div[@id="content"]/h1/span')[0].text.strip()
# print(movie_name)
url = base_url+'/comments'
stopwords_path = 'stopwords.txt'
stop_set = load_stopwords_set(stopwords_path)
#資料庫連線
client = MongoClient('localhost', 27017)
db = client.douban
db.authenticate('douban_sa','sa')
collection = db.movie_comments
# # print(Headers.getUA())
try:
#抓取評論 儲存到資料庫
get_comments(url, headers,start, 5, movie_name,None)
#從資料庫獲取評論 並分好詞
frequency_dict = get_words_frequency(collection,stop_set)
# 對詞頻進一步篩選
cl_dict = classify_frequenc(frequency_dict,5)
# print(frequency_dict)
# 根據詞頻 生成詞雲
get_wordcloud(cl_dict,movie_name)
finally:
# pass
client.close()
這裡的背景圖bg_1.png
圖片如下:
最後生成的詞雲如下:
七 小結
哈哈 是不是很簡單,但是最終實現出來,還是覺得滿滿的成就感 2333
github連結:https://github.com/hylusst/requests_douban