[Python] 通過採集兩萬條資料，對《無名之輩》影評分析

Python · 發表 2018-12-05 14:06:00

摘要：一、說明本文主要講述採集貓眼電影使用者評論進行分析，相關爬蟲採集程式可以爬取多個電影評論。執行環境：Win10/Python3.5。分析工具：jieba、wordcloud、pyecharts、matplotlib。基本流程：下載內容 ---> 分析獲取關鍵...

一 、說明

本文主要講述採集貓眼電影使用者評論進行分析，相關爬蟲採集程式可以爬取多個電影評論。

執行環境：Win10/Python3.5。

分析工具：jieba、wordcloud、pyecharts、matplotlib。

基本流程：下載內容 ---> 分析獲取關鍵資料 ---> 儲存本地檔案 ---> 分析本地檔案製作圖表

注意：本文所有圖文和原始碼僅供學習，請勿他用，轉發請註明出處！

本文主要參考：https://mp.weixin.qq.com/s/mTxxkwRZPgBiKC3Sv-jo3g

二、開始採集

2.1、分析資料介面：

為了健全資料樣本，資料直接從移動端介面進行採集，連線如下，其中橙色部分為貓眼電影ID，修改即可爬取其他電影。

連結地址：http://m.maoyan.com/mmdb/comments/movie/ 1208282 .json?v=yes&offset=15&startTime=

介面返回的資料如下，主要採集（暱稱、城市、評論、評分和時間），使用者評論在 json['cmts'] 中：

2.2、爬蟲程式核心內容（詳細可以看後面原始碼）：

>啟動指令碼需要的引數如下（指令碼名+貓眼電影ID+上映日期+資料儲存的檔名）：.\myMovieComment.py 1208282 2016-11-16 myCmts2.txt

>下載html內容：download(self, url)，通過python的requests模組進行下載，將下載的資料轉成json格式

 1def download(self, url):
 2"""下載html內容"""
 3 
 4print("正在下載URL: "+url)
 5# 下載html內容
 6response = requests.get(url, headers=self.headers)
 7 
 8# 轉成json格式資料
 9if response.status_code == 200:
10return response.json()
11else:
12# print(html.status_code)
13print('下載資料為空！')
14return ""

>然後就是對已下載的內容進行分析，就是取出我們需要的資料：

 1def parse(self, content):
 2"""分析資料"""
 3 
 4comments = []
 5try:
 6for item in content['cmts']:
 7comment = {
 8'nickName': item['nickName'],# 暱稱
 9'cityName': item['cityName'],# 城市
10'content': item['content'],# 評論內容
11'score': item['score'],# 評分
12'startTime': item['startTime'],# 時間
13}
14comments.append(comment)
15 
16except Exception as e:
17print(e)
18 
19finally:
20return comments

>將分析出來的資料，進行本地儲存，方便後續的分析工作：

1def save(self, data):
2"""寫入檔案"""
3 
4print("儲存資料，寫入檔案中...")
5self.save_file.write(data)

> 爬蟲的核心控制也即爬蟲的程式啟動入口，管理上面幾個方法的有序執行：

 1def start(self):
 2"""啟動控制方法"""
 3 
 4print("爬蟲開始...\r\n")
 5 
 6start_time = self.start_time
 7end_time = self.end_time
 8 
 9num = 1
10while start_time > end_time:
11print("執行次數:", num)
12# 1、下載html
13content = self.download(self.target_url + str(start_time))
14 
15# 2、分析獲取關鍵資料
16comments = ''
17if content != "":
18comments = self.parse(content)
19 
20if len(comments) <= 0:
21print("本次資料量為：0，退出爬取！\r\n")
22break
23 
24# 3、寫入檔案
25res = ''
26for cmt in comments:
27res += "%s###%s###%s###%s###%s\n" % (cmt['nickName'], cmt['cityName'], cmt['content'], cmt['score'], cmt['startTime'])
28self.save(res)
29 
30print("本次資料量：%s\r\n" % len(comments))
31 
32# 獲取最後一條資料的時間 ，然後減去一秒
33start_time = datetime.strptime(comments[len(comments) - 1]['startTime'], "%Y-%m-%d %H:%M:%S") + timedelta(seconds=-1)
34# start_time = datetime.strptime(start_time, "%Y-%m-%d %H:%M:%S")
35 
36# 休眠3s
37num += 1
38time.sleep(3)
39 
40self.save_file.close()
41print("爬蟲結束...")

2.3 資料樣本，最終爬取將近2萬條資料，每條記錄的每個資料使用 ### 進行分割：

三、圖形化分析資料

3.1、製作觀眾城市分佈熱點圖，( pyecharts-geo )：

從圖表可以輕鬆看出，使用者主要分佈地區，主要以沿海一些發達城市群為主：

 1def createCharts(self):
 2"""生成圖表"""
 3 
 4# 讀取資料,格式：[{"北京", 10}, {"上海",10}]
 5data = self.readCityNum()
 6 
 7# 1 熱點圖
 8geo1 = Geo("《無名之輩》觀眾位置分佈熱點圖", "資料來源：貓眼，Fly採集", title_color="#FFF", title_pos="center", width="100%", height=600, background_color="#404A59")
 9 
10attr1, value1 = geo1.cast(data)
11 
12geo1.add("", attr1, value1, type="heatmap", visual_range=[0, 1000], visual_text_color="#FFF", symbol_size=15, is_visualmap=True, is_piecewise=False, visual_split_number=10)
13geo1.render("files/無名之輩-觀眾位置熱點圖.html")
14 
15# 2 位置圖
16geo2 = Geo("《無名之輩》觀眾位置分佈", "資料來源：貓眼，Fly採集", title_color="#FFF", title_pos="center", width="100%", height=600,
17background_color="#404A59")
18 
19attr2, value2 = geo1.cast(data)
20geo2.add("", attr2, value2, visual_range=[0, 1000], visual_text_color="#FFF", symbol_size=15,
21is_visualmap=True, is_piecewise=False, visual_split_number=10)
22geo2.render("files/無名之輩-觀眾位置圖.html")
23 
24# 3、top20 柱狀圖
25data_top20 = data[:20]
26bar = Bar("《無名之輩》觀眾來源排行 TOP20", "資料來源：貓眼，Fly採集", title_pos="center", width="100%", height=600)
27attr, value = bar.cast(data_top20)
28bar.add('', attr, value, is_visualmap=True, visual_range=[0, 3500], visual_text_color="#FFF", is_more_utils=True, is_label_show=True)
29bar.render("files/無名之輩-觀眾來源top20.html")
30 
31print("圖表生成完成")

3.2、製作觀眾人數TOP20的柱形圖,( pyecharts-bar )：

3.3、製作評論詞雲,( jieba、wordcloud )：

生成詞雲核心程式碼：

 1def createWordCloud(self):
 2"""生成評論詞雲"""
 3comments = self.readAllComments()# 19185
 4 
 5# 使用 jieba 分詞
 6commens_split = jieba.cut(str(comments), cut_all=False)
 7words = ''.join(commens_split)
 8 
 9# 給詞庫新增停止詞
10stopwords = STOPWORDS.copy()
11stopwords.add("電影")
12stopwords.add("一部")
13stopwords.add("無名之輩")
14stopwords.add("一部")
15stopwords.add("一個")
16stopwords.add("有點")
17stopwords.add("覺得")
18 
19# 載入背景圖片
20bg_image = plt.imread("files/2048_bg.png")
21 
22# 初始化 WordCloud
23wc = WordCloud(width=1200, height=600, background_color='#FFF', mask=bg_image, font_path='C:/Windows/Fonts/STFANGSO.ttf', stopwords=stopwords, max_font_size=400, random_state=50)
24 
25# 生成，顯示圖片
26wc.generate_from_text(words)
27plt.imshow(wc)
28plt.axis('off')
29plt.show()

四、修改pyecharts原始碼

4.1、樣本資料的城市簡稱與資料集完整城市名匹配不上：

使用位置熱點圖時候，由於採集資料城市是一些簡稱，與pyecharts的已存在資料的城市名對不上，所以對原始碼進行一些修改，方便匹配一些簡稱。

黔南 => 黔南布依族苗族自治州

模組自帶的全國主要市縣經緯度在：[python安裝路徑]\Lib\site-packages\pyecharts\datasets\city_coordinates.json

由於預設情況下，一旦城市名不能完全匹配就會報異常，程式會停止，所以對原始碼修改如下（報錯方法為 Geo.add() ）,其中添加註析為個人修改部分：

 1def get_coordinate(self, name, region="中國", raise_exception=False):
 2"""
 3Return coordinate for the city name.
 4 
 5:param name: City name or any custom name string.
 6:param raise_exception: Whether to raise exception if not exist.
 7:return: A list like [longitude, latitude] or None
 8"""
 9if name in self._coordinates:
10return self._coordinates[name]
11 
12 
13coordinate = get_coordinate(name, region=region)
14 
15# [ 20181204 新增
16# print(name, coordinate)
17if coordinate is None:
18# 如果字典key匹配不上，嘗試進行模糊查詢
19search_res = search_coordinates_by_region_and_keyword(region, name)
20# print("###",search_res)
21if search_res:
22coordinate = sorted(search_res.values())[0]
23# 20181204 新增 ]
24 
25if coordinate is None and raise_exception:
26raise ValueError("No coordinate is specified for {}".format(name))
27 
28return coordinate

相應的需要對 __add()方法進行如下修改：

五、附錄-原始碼

*說明：原始碼為本人所寫，資料來源為貓眼，全部內容僅供學習，拒絕其他用途！轉發請註明出處！

5.1 採集原始碼


1 # -*- coding:utf-8 -*-
2 
3 import requests
4 from datetime import datetime, timedelta
5 import os
6 import time
7 import sys
8 
9 
 10 class MaoyanFilmReviewSpider:
 11"""貓眼影評爬蟲"""
 12 
 13def __init__(self, url, end_time, filename):
 14# 頭部
 15self.headers = {
 16'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1'
 17}
 18 
 19# 目標URL
 20self.target_url = url
 21 
 22# 資料獲取時間段，start_time:截止日期，end_time:上映時間
 23now = datetime.now()
 24 
 25# 獲取當天的 零點
 26self.start_time = now + timedelta(hours=-now.hour, minutes=-now.minute, seconds=-now.second)
 27self.start_time = self.start_time.replace(microsecond=0)
 28self.end_time = datetime.strptime(end_time, "%Y-%m-%d %H:%M:%S")
 29 
 30# 開啟寫入檔案, 建立目錄
 31self.save_path = "files/"
 32if not os.path.exists(self.save_path):
 33os.makedirs(self.save_path)
 34self.save_file = open(self.save_path + filename, "a", encoding="utf-8")
 35 
 36def download(self, url):
 37"""下載html內容"""
 38 
 39print("正在下載URL: "+url)
 40# 下載html內容
 41response = requests.get(url, headers=self.headers)
 42 
 43# 轉成json格式資料
 44if response.status_code == 200:
 45return response.json()
 46else:
 47# print(html.status_code)
 48print('下載資料為空！')
 49return ""
 50 
 51def parse(self, content):
 52"""分析資料"""
 53 
 54comments = []
 55try:
 56for item in content['cmts']:
 57comment = {
 58'nickName': item['nickName'],# 暱稱
 59'cityName': item['cityName'],# 城市
 60'content': item['content'],# 評論內容
 61'score': item['score'],# 評分
 62'startTime': item['startTime'],# 時間
 63}
 64comments.append(comment)
 65 
 66except Exception as e:
 67print(e)
 68 
 69finally:
 70return comments
 71 
 72def save(self, data):
 73"""寫入檔案"""
 74 
 75print("儲存資料，寫入檔案中...")
 76self.save_file.write(data)
 77 
 78def start(self):
 79"""啟動控制方法"""
 80 
 81print("爬蟲開始...\r\n")
 82 
 83start_time = self.start_time
 84end_time = self.end_time
 85 
 86num = 1
 87while start_time > end_time:
 88print("執行次數:", num)
 89# 1、下載html
 90content = self.download(self.target_url + str(start_time))
 91 
 92# 2、分析獲取關鍵資料
 93comments = ''
 94if content != "":
 95comments = self.parse(content)
 96 
 97if len(comments) <= 0:
 98print("本次資料量為：0，退出爬取！\r\n")
 99break
100 
101# 3、寫入檔案
102res = ''
103for cmt in comments:
104res += "%s###%s###%s###%s###%s\n" % (cmt['nickName'], cmt['cityName'], cmt['content'], cmt['score'], cmt['startTime'])
105self.save(res)
106 
107print("本次資料量：%s\r\n" % len(comments))
108 
109# 獲取最後一條資料的時間 ，然後減去一秒
110start_time = datetime.strptime(comments[len(comments) - 1]['startTime'], "%Y-%m-%d %H:%M:%S") + timedelta(seconds=-1)
111# start_time = datetime.strptime(start_time, "%Y-%m-%d %H:%M:%S")
112 
113# 休眠3s
114num += 1
115time.sleep(3)
116 
117self.save_file.close()
118print("爬蟲結束...")
119 
120 
121 if __name__ == "__main__":
122# 確保輸入引數
123if len(sys.argv) != 4:
124print("請輸入相關引數：[moveid]、[上映日期]和[儲存檔名]，如：xxx.py 42962 2018-11-09 text.txt")
125exit()
126 
127# 貓眼電影ID
128mid = sys.argv[1] # "1208282"# "42964"
129# 電影上映日期
130end_time = sys.argv[2]# "2018-11-16"# "2018-11-09"
131# 每次爬取條數
132offset = 15
133# 儲存檔名
134filename = sys.argv[3]
135 
136spider = MaoyanFilmReviewSpider(url="http://m.maoyan.com/mmdb/comments/movie/%s.json?v=yes&offset=%d&startTime=" % (mid, offset), end_time="%s 00:00:00" % end_time, filename=filename)
137# spider.start()
138 
139spider.start()
140# t1 = "2018-11-09 23:56:23"
141# t2 = "2018-11-25"
142#
143# res = datetime.strptime(t1, "%Y-%m-%d %H:%M:%S") + timedelta(days=-1)
144# print(type(res))

MaoyanFilmReviewSpider.py

5.2 分析製圖原始碼


1 # -*- coding:utf-8 -*-
2 from pyecharts import Geo, Bar, Bar3D
3 import jieba
4 from wordcloud import STOPWORDS, WordCloud
5 import matplotlib.pyplot as plt
6 
7 
8 class ACoolFishAnalysis:
9"""無名之輩 --- 資料分析"""
 10def __init__(self):
 11pass
 12 
 13def readCityNum(self):
 14"""讀取觀眾城市分佈數量"""
 15d = {}
 16 
 17with open("files/myCmts2.txt", "r", encoding="utf-8") as f:
 18row = f.readline()
 19 
 20while row != "":
 21arr = row.split('###')
 22 
 23# 確保每條記錄長度為 5
 24while len(arr) < 5:
 25row += f.readline()
 26arr = row.split('###')
 27 
 28# 記錄每個城市的人數
 29if arr[1] in d:
 30d[arr[1]] += 1
 31else:
 32d[arr[1]] = 1# 首次加入字典，為 1
 33 
 34row = f.readline()
 35 
 36 
 37# print(len(comments))
 38# print(d)
 39 
 40# 字典 轉 元組陣列
 41res = []
 42for ks in d.keys():
 43if ks == "":
 44continue
 45tmp = (ks, d[ks])
 46res.append(tmp)
 47 
 48# 按地點人數降序
 49res = sorted(res, key=lambda x: (x[1]),reverse=True)
 50return res
 51 
 52def readAllComments(self):
 53"""讀取所有評論"""
 54comments = []
 55 
 56# 開啟檔案讀取資料
 57with open("files/myCmts2.txt", "r", encoding="utf-8") as f:
 58row = f.readline()
 59 
 60while row != "":
 61arr = row.split('###')
 62 
 63# 每天記錄長度為 5
 64while len(arr) < 5:
 65row += f.readline()
 66arr = row.split('###')
 67 
 68if len(arr) == 5:
 69comments.append(arr[2])
 70 
 71# if len(comments) > 20:
 72#break
 73row = f.readline()
 74 
 75return comments
 76 
 77def createCharts(self):
 78"""生成圖表"""
 79 
 80# 讀取資料,格式：[{"北京", 10}, {"上海",10}]
 81data = self.readCityNum()
 82 
 83# 1 熱點圖
 84geo1 = Geo("《無名之輩》觀眾位置分佈熱點圖", "資料來源：貓眼，Fly採集", title_color="#FFF", title_pos="center", width="100%", height=600, background_color="#404A59")
 85 
 86attr1, value1 = geo1.cast(data)
 87 
 88geo1.add("", attr1, value1, type="heatmap", visual_range=[0, 1000], visual_text_color="#FFF", symbol_size=15, is_visualmap=True, is_piecewise=False, visual_split_number=10)
 89geo1.render("files/無名之輩-觀眾位置熱點圖.html")
 90 
 91# 2 位置圖
 92geo2 = Geo("《無名之輩》觀眾位置分佈", "資料來源：貓眼，Fly採集", title_color="#FFF", title_pos="center", width="100%", height=600,
 93background_color="#404A59")
 94 
 95attr2, value2 = geo1.cast(data)
 96geo2.add("", attr2, value2, visual_range=[0, 1000], visual_text_color="#FFF", symbol_size=15,
 97is_visualmap=True, is_piecewise=False, visual_split_number=10)
 98geo2.render("files/無名之輩-觀眾位置圖.html")
 99 
100# 3、top20 柱狀圖
101data_top20 = data[:20]
102bar = Bar("《無名之輩》觀眾來源排行 TOP20", "資料來源：貓眼，Fly採集", title_pos="center", width="100%", height=600)
103attr, value = bar.cast(data_top20)
104bar.add('', attr, value, is_visualmap=True, visual_range=[0, 3500], visual_text_color="#FFF", is_more_utils=True, is_label_show=True)
105bar.render("files/無名之輩-觀眾來源top20.html")
106 
107print("圖表生成完成")
108 
109def createWordCloud(self):
110"""生成評論詞雲"""
111comments = self.readAllComments()# 19185
112 
113# 使用 jieba 分詞
114commens_split = jieba.cut(str(comments), cut_all=False)
115words = ''.join(commens_split)
116 
117# 給詞庫新增停止詞
118stopwords = STOPWORDS.copy()
119stopwords.add("電影")
120stopwords.add("一部")
121stopwords.add("無名之輩")
122stopwords.add("一部")
123stopwords.add("一個")
124stopwords.add("有點")
125stopwords.add("覺得")
126 
127# 載入背景圖片
128bg_image = plt.imread("files/2048_bg.png")
129 
130# 初始化 WordCloud
131wc = WordCloud(width=1200, height=600, background_color='#FFF', mask=bg_image, font_path='C:/Windows/Fonts/STFANGSO.ttf', stopwords=stopwords, max_font_size=400, random_state=50)
132 
133# 生成，顯示圖片
134wc.generate_from_text(words)
135plt.imshow(wc)
136plt.axis('off')
137plt.show()
138 
139 
140 
141 if __name__ == "__main__":
142demo = ACoolFishAnalysis()
143demo.createWordCloud()

View Code

[Python] 通過採集兩萬條資料，對《無名之輩》影評分析

二、開始採集

三、圖形化分析資料

四、修改pyecharts原始碼

五、附錄-原始碼

您可能也會喜歡…