scrapy框架下的豆瓣電影評論爬取以及登入，以及生成詞雲和柱狀圖

阿新 • • 發佈：2019-01-17

由於豆瓣在今年5月份已經禁止展示所有短評，只展示最熱的500條資料，並且在爬取到240條的時候，如果沒有登入的話，會提示登入。
因此幾天的爬蟲，包括豆瓣的自動登入和資料爬取後批量存入pymysql資料庫。

在這個爬蟲完成後，其實我也在頁面上找了下，在全部評論裡還是能看到帶有頁數分頁的評論的，在下面程式碼的基礎上修改下路徑和爬取資料的邏輯，其實也是能爬取的。

本文是基於scrapy框架，python 3.x下完成的。爬取了9月3日前碟中諜6的最熱短評資料
這是爬蟲檔案結構：
這裡寫圖片描述
這是爬取的資料截圖：

下面上程式碼：
dzd-content

# -*- coding: utf-8 -*- 

import scrapy
from dzd.items import DzdItem
import time
import random
from faker import Factory
from urllib import parse

f = Factory.create()


class DzdContentSpider(scrapy.Spider):
    name = 'dzd'
    allowed_domains = ['movie.douban.com']
    #構建豆瓣的登陸資料，詳細的資料豆瓣可能會根據時間不同，修改相關的一些欄位，但登陸的賬號和密碼是不會變的 

    formdata = {'source': 'index_nav',
                # 'redir': 'https://www.douban.com',
                # 'login': '登入',
                'form_email': '你的賬號',
                'form_password': '你的密碼'}
#構建訪問的標頭檔案
    headers = {
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' 
,
        'Accept-Encoding': 'gzip, deflate, br',
        'Accept-Language': 'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3',
        'Connection': 'keep-alive',
        #使用faker中的Factory，動態生成不同的userAgent
        'User-Agent': f.user_agent()
    }
#重寫了start_request方法
    def start_requests(self):
        print('------爬取開始--------')
#豆瓣有大量資料觀看後的遮蔽策略，需要登入，所以要從一開始訪問就要帶上cookie進行訪問
        return [scrapy.Request(url='https://www.douban.com/accounts/login',
                               headers=self.headers,
                               meta={'cookiejar': 1},
                               callback=self.parse_login)]

    def parse_login(self, response):
        # 獲取登入頁面是否有驗證碼
        img_url = response.xpath('//img[@class="captcha_image"]/@src').extract_first()
        if img_url is not None:
            print('Copy the link:')
            link = response.xpath('//img[@class="captcha_image"]/@src').extract()[0]
            print(link)
            captcha_solution = input('captcha-solution:')
            captcha_id = parse.parse_qs(parse.urlparse(link).query, True)['id']
            #繼續構造有驗證碼的formdata結構，將驗證碼的id和應該輸入的值加入進去
            self.formdata['captcha-solution'] = captcha_solution
            self.formdata['captcha-id'] = captcha_id
        #使用FormRequest進行直接提交訪問
        return [scrapy.FormRequest.from_response(response,
                                                 formdata=self.formdata,
                                                 headers=self.headers,
                                                 meta={'cookiejar': response.meta['cookiejar']},
                                                 callback=self.after_login,
                                                 dont_filter=True
                                                 )]

    def after_login(self, response):
        print('判斷是否登陸成功。。。。')
        # 這個位置寫上你登入後自己的主頁地址，用來判斷是不是已經正確登入的
        test_url = "https://www.douban.com/people/181618569/"
        if response.url == test_url:
            if response.status == 200:
                print('***************')
                print(u'登入成功')
                print('***************\n')

            else:
                print('***************')
                print(u'登入失敗')
                print('***************\n')

        yield scrapy.Request(test_url,
                             meta={'cookiejar': response.meta['cookiejar']},
                             headers=self.headers,
                             callback=self.after_login)
        #這個裡的url就是需要在登入成功後爬取的地址了
        yield scrapy.Request(
            url='https://movie.douban.com/subject/26336252/comments?start=0&limit=20&sort=new_score&status=P&percent_type=',
            meta={'cookiejar': response.meta['cookiejar']},
            headers=self.headers,
            callback=self.parse)
#這個是爬取資料結構的方法，就不詳細說，都能看懂的，因為沒有做分佈和代理池，為了防止非同步
#瞬時訪問次數哦過多導致封IP，在下面我做了一個延時的操作。使用time.sleep()。
    def parse(self, response):
        item = DzdItem()
        next_url = ''
        if response.status == 200:
            comments_list = response.css('#comments div[class="comment-item"]')
            # print(len(comments_list))
            next_url = response.css('#paginator a[class="next"]::attr(href)').extract_first()
            for comments in comments_list:
                user_id = comments.css('::attr(data-cid)').extract_first()
                comment = comments.css('.comment p span.short::text').extract_first()
                nick_name = comments.css('span[class = "comment-info"] a::text').extract_first()
                rating = comments.css(
                    'span[class = "comment-info"] span:nth-child(3)::attr(class)').extract_first().replace(' ', '')[
                         7:9]
                comment_time = comments.css(
                    'span[class = "comment-info"] span:nth-child(4)::attr(title)').extract_first()

                item['user_id'] = user_id
                item['comment'] = comment.replace(' ', '')
                item['nick_name'] = nick_name
                item['rating'] = rating
                item['comment_time'] = comment_time

                yield item
            if next_url is not None:
                next_url = 'https://movie.douban.com/subject/26336252/comments' + next_url

                time.sleep(random.random() * 3)
                #因為是帶著cookie狀態進行訪問的，所以不能按照以前的那種直接
                #request（url,callback=）的方式，需要帶上已經從登陸後的cookie
                yield scrapy.Request(url=next_url,
                                     meta={'cookiejar': response.meta['cookiejar']},#也就是說它
                                     headers=self.headers,
                                     callback=self.parse)
            else:
                print('已經沒有更多的評論了')
                print('評論爬取完畢')
        else:
            print('Request訪問錯誤,正在嘗試重新訪問。。。')
            time.sleep(5)
            yield scrapy.Request(url=next_url,
                                 meta={'cookiejar': response.meta['cookiejar']},
                                 headers=self.headers,
                                 callback=self.parse)

下面的item.py

import scrapy


class DzdItem(scrapy.Item):
#在這裡宣告爬取的資料有哪些需要在spider中流轉的
    # define the fields for your item here like:
    # name = scrapy.Field()
    user_id = scrapy.Field() #使用者id
    nick_name = scrapy.Field()#使用者暱稱
    comment = scrapy.Field()#評論
    comment_time = scrapy.Field()#評論時間
    rating = scrapy.Field()#評分

pipelines.py
這部分就是講資料批量存入資料庫中了
在實際執行中，會因為短評的欄位過長，導致同一批插入失敗導致資料回滾，可以嘗試在建表的時候，將comments列的varchar設定大寫

import pymysql



class DzdPipeline(object):
     comments = []

     def open_spider(self, spider):
         self.conn = pymysql.connect(host="localhost", user="root", passwd="Cs123456.", db="movie", charset="utf8")
         self.cursor = self.conn.cursor()

         # 批量插入mysql資料庫
     def bulk_insert_to_mysql(self, bulkdata):
             try:
                 sql = "insert into movie_comments (user_id,nick_name,comment,comment_time,rating) values(%s, %s,%s,%s,%s)"
                 self.cursor.executemany(sql, bulkdata)
                 self.conn.commit()
             except:
                 print('資料插入有誤。。')
                 self.conn.rollback()

     def process_item(self, item, spider):
         self.comments.append([item['user_id'], item['nick_name'],item['comment'],item['comment_time'],item['rating']])
         comments2=[]
         comments2.append([item['user_id'], item['nick_name'],item['comment'],item['comment_time'],item['rating']])
         if len(self.comments) == 5:
             self.bulk_insert_to_mysql(comments2)
             # 清空緩衝區
             self.comments.clear()
         return item

     def close_spider(self, spider):
         #print( "closing spider,last commit", len(self.comments))
         self.bulk_insert_to_mysql(self.comments)
         self.conn.commit()
         self.cursor.close()
         self.conn.close()

setting.py檔案裡，主要一步就是要設定COOKIES_ENABLED = True為True，因為我這裡使用了的帶著cookie訪問的，scrapy預設的事False。如果想用代理池的話，可以在中介軟體中，如何使用代理池更換IP，百度下你就知道了。

好了，今天的程式碼就到這裡。小夥伴如果有什麼疑問或者建議，歡迎評論。我看到了一定會回覆的。

我又來了，剛順手又把資料的詞雲圖和統計圖做了

import numpy as np
import pandas as pd
import jieba
import wordcloud
from scipy.misc import imread
import matplotlib.pyplot as plt
from pylab import mpl
import seaborn as sns
from PIL import Image
import pymysql

mpl.rcParams['font.sans-serif'] = ['SimHei']  # 指定預設字型
mpl.rcParams['axes.unicode_minus']


def txt_cut(novel, stop_list):
    return [w for w in jieba.cut(novel) if w not in stop_list and len(w) > 1]


def Statistics(txtcut,save_path):
    # Series是指pandas的一維，獲取txtcut中按照降序排列後0~20的資料
    word_count = pd.Series(txtcut).value_counts().sort_values(ascending=False)[0:20]
    # print(word_count)
    # 是以這種形式展現的資料，

    # 建立一個圖形是咧 大小是15*8（長*寬）單位是英寸
    plt.figure(figsize=(15, 8))
    x = word_count.index.tolist()  # 獲取的是index列，轉換成list
    y = word_count.values.tolist()  # 獲取的是values列，轉換成list
    # barplot是作圖方法，傳入xy值，palette="BuPu_r" 設定的是柱狀圖的顏色樣式
    # BuPu_r 從左到右，顏色由深到淺，BuPu與之相反
    sns.barplot(x, y, palette="BuPu_r")
    plt.title('詞頻Top20')  # 標題
    plt.ylabel('count')  # Y軸標題
    # 如果不加這局，那麼出現的就是個四方的框，這個是用來溢位軸脊柱的，加上bottom=tur，意思就是連下方的軸脊柱也溢位
    sns.despine(bottom=True)
    # 圖片儲存
    plt.savefig(save_path, dpi=400)
    plt.show()


def cloud(result, img_path, cloud_path, cloud_name):
    result = " ".join(result)  # 必須給個符號分隔開分詞結果,否則不能繪製詞雲
    # 1、初始化自定義背景圖片
    image = Image.open(img_path)
    graph = np.array(image)

    # 2、產生詞雲圖
    # 有自定義背景圖：生成詞雲圖由自定義背景影象素大小決定
    wc = wordcloud.WordCloud(font_path=r"I:\word-ttf\XingKai.ttf",  # 字型地址
                             background_color='white',  # 背景色
                             max_font_size=100,  # 顯示字型最大值
                             max_words=100,  # 最大詞數
                             mask=graph)  # 匯入的圖片
    wc.generate(result)

    # 3、繪製文字的顏色以背景圖顏色為參考
    image_color = wordcloud.ImageColorGenerator(graph)  # 從背景圖片生成顏色值
    wc.recolor(color_func=image_color)
    wc.to_file(cloud_path)  # 按照背景圖大小儲存繪製好的詞雲圖，比下面程式顯示更清晰

    # 4、顯示圖片
    plt.title(cloud_name)  # 指定所繪圖名稱
    plt.imshow(wc)  # 以圖片的形式顯示詞雲
    plt.axis("off")  # 關閉影象座標系
    plt.show()

#連結資料庫，獲取存入的資料
def open_db():
    conn = pymysql.connect(host="localhost", user="root", passwd="Cs123456.", db="movie", charset="utf8")
    cursor = conn.cursor()
    sql = 'select * from movie_comments'
    cursor.execute(sql)
    data = cursor.fetchall()
    cursor.close()
    conn.close()
    return data

#對資料處理，獲取想要的資料，我這裡的list[3]就是查詢出的資料獲取的
def data_process(data):
    comment_ob = ''
    if data is not None:
        for list in data:
            comment_ob += list[3] + ','
        return comment_ob


def stop_list(stop_path):
    stopwords_path = stop_path
    stop_list = []
    stop_list1 = open(stopwords_path, encoding="utf-8").readlines()
    for line in stop_list1:
        stop_list.append(line.strip('\n').strip())
    return stop_list


if __name__ == '__main__':
    #停詞地址
    stop_path = 'I:/鬼吹燈小說分析/stop_word2.txt'
    stop_list = stop_list(stop_path)
    #詞雲背景圖地址
    img_path = "C:/Users/Administrator/Desktop/timg.jpg"
    #詞雲圖儲存地址
    cloud_path = 'C:/Users/Administrator/Desktop/dzd6_cloud.jpg'
    #柱狀圖儲存地址
    stat_path = 'C:/Users/Administrator/Desktop/dzd6_stat.jpg'
    #詞雲名稱
    cloud_name = '碟中諜6詞雲圖'
    data = open_db()
    ob = data_process(data)
    jb_comment = txt_cut(ob, stop_list)
    Statistics(jb_comment,stat_path)
    cloud(jb_comment, img_path, cloud_path, cloud_name)

這裡寫圖片描述

好了。這篇文章真的就只到這裡了。
咱們下次見。

scrapy框架下的豆瓣電影評論爬取以及登入，以及生成詞雲和柱狀圖

scrapy框架下的豆瓣電影評論爬取以及登入，以及生成詞雲和柱狀圖

豆瓣電影top250爬取並保存在MongoDB裏

豆瓣電影簡易爬取

運用scrapy框架通過splash無頭瀏覽器爬取之settings檔案配置

使用scrapy框架+模擬瀏覽器方法實現爬取智聯的職位資訊

豆瓣電影資訊爬取並儲存到excel

豆瓣電影Top250爬取的資料的一些簡單視覺化筆記

Scrapy框架的學習(5.scarpy實現翻頁爬蟲，以及scrapy.Request的相關引數介紹)

nodejs 爬取前端面經並生成詞雲

python 爬取豆瓣電影評論，並進行詞雲展示及出現的問題解決辦法

scrapy爬蟲之item/itemloader機制爬取豆瓣電影top250

scrapy爬蟲框架（三）：爬取桌布儲存並命名

python2 scrapy-redisd搭建,簡單使用。爬取豆瓣點評

Python爬蟲：Scrapy框架基礎框架結構及騰訊爬取

豆瓣電視劇評論的爬取以及進行情感分析+生成詞雲

Springboot+JPA下實現簡易爬蟲--爬取豆瓣電視劇資料

scrapy實戰1分布式爬取有緣網：

scrapy通過自定義類給爬取的url去重

豆瓣網post 爬取帶驗證碼

在scrapy框架下爬蟲中如何實現翻頁請求

scrapy框架下的豆瓣電影評論爬取以及登入，以及生成詞雲和柱狀圖

相關推薦