Python爬蟲入門教程 32-100 B站博人傳評論資料抓取 scrapy

Scrapy · 發表 2019-02-13 08:20:00

摘要： 1. B站博人傳評論資料爬取簡介今天想了半天不知道抓啥，去B站看跳舞的小姐姐，忽然看到了評論，那就抓取一下B站的評論資料，視訊動畫那麼多，也不知道抓取哪個，選了一個博人傳跟火影相關的，抓取看看。網址： https://www.bilibili.com/bangumi/media/md5...

1. B站博人傳評論資料爬取簡介

今天想了半天不知道抓啥，去B站看跳舞的小姐姐，忽然看到了評論，那就抓取一下B站的評論資料，視訊動畫那麼多，也不知道抓取哪個，選了一個博人傳跟火影相關的，抓取看看。網址： https://www.bilibili.com/bangumi/media/md5978/?from=search&seid=16013388136765436883#short
在這個網頁看到了18560條短評，資料量也不大，抓取看看，使用的還是scrapy。

2. B站博人傳評論資料案例---獲取連結

從開發者工具中你能輕易的得到如下連結，有連結之後就好辦了，如何建立專案就不在囉嗦了，我們直接進入主題。

我在程式碼中的 parse 函式中，設定了兩個 yield 一個用來返回 items 一個用來返回 requests 。

然後實現一個新的功能，每次訪問切換 UA ，這個點我們需要使用到中介軟體技術。

class BorenSpider(scrapy.Spider):
BASE_URL = "https://bangumi.bilibili.com/review/web_api/short/list?media_id=5978&folded=0&page_size=20&sort=0&cursor={}"
name = 'Boren'
allowed_domains = ['bangumi.bilibili.com']

start_urls = [BASE_URL.format("76742479839522")]

def parse(self, response):
print(response.url)
resdata = json.loads(response.body_as_unicode())

if resdata["code"] == 0:
# 獲取最後一個數據
if len(resdata["result"]["list"]) > 0:
data = resdata["result"]["list"]
cursor = data[-1]["cursor"]
for one in data:
item = BorenzhuanItem()

item["author"]= one["author"]["uname"]
item["content"] = one["content"]
item["ctime"] = one["ctime"]
item["disliked"] = one["disliked"]
item["liked"] = one["liked"]
item["likes"] = one["likes"]
item["user_season"] = one["user_season"]["last_ep_index"] if "user_season" in one else ""
item["score"] = one["user_rating"]["score"]
yield item

yield scrapy.Request(self.BASE_URL.format(cursor),callback=self.parse)

3. B站博人傳評論資料案例---實現隨機UA

第一步，在settings檔案中新增一些UserAgent,我從網際網路找了一些

USER_AGENT_LIST=[
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]

第二步，在settings檔案中設定 “DOWNLOADER_MIDDLEWARES”

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
#'borenzhuan.middlewares.BorenzhuanDownloaderMiddleware': 543,
'borenzhuan.middlewares.RandomUserAgentMiddleware': 400,
}

第三步，在 middlewares.py 檔案中匯入 settings模組中的 USER_AGENT_LIST 方法

from borenzhuan.settings import USER_AGENT_LIST # 匯入中介軟體
import random

class RandomUserAgentMiddleware(object):
def process_request(self, request, spider):
rand_use= random.choice(USER_AGENT_LIST)
if rand_use:
request.headers.setdefault('User-Agent', rand_use)

好了，隨機的UA已經實現，你可以在 parse 函式中編寫如下程式碼進行測試

print(response.request.headers)

4. B站博人傳評論資料----完善item

這個操作相對簡單，這些資料就是我們要儲存的資料了。！

author = scrapy.Field()
content = scrapy.Field()
ctime = scrapy.Field()
disliked = scrapy.Field()
liked = scrapy.Field()
likes = scrapy.Field()
score = scrapy.Field()
user_season = scrapy.Field()

5. B站博人傳評論資料案例---提高爬取速度

在settings.py中設定如下引數：

# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 1
# The download delay setting will honor only one of:
CONCURRENT_REQUESTS_PER_DOMAIN = 16
CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
COOKIES_ENABLED = False

解釋說明

一、降低下載延遲

DOWNLOAD_DELAY = 0

將下載延遲設為0，這時需要相應的防ban措施，一般使用user agent輪轉，構建user agent池，輪流選擇其中之一來作為user agent。

二、多執行緒

CONCURRENT_REQUESTS = 32

CONCURRENT_REQUESTS_PER_DOMAIN = 16

CONCURRENT_REQUESTS_PER_IP = 16

scrapy網路請求是基於Twisted，而Twisted預設支援多執行緒，而且scrapy預設也是通過多執行緒請求的，並且支援多核CPU的併發，我們通過一些設定提高scrapy的併發數可以提高爬取速度。

三、禁用cookies

COOKIES_ENABLED = False

6. B站博人傳評論資料案例---儲存資料

最後在 pipelines.py 檔案中，編寫儲存程式碼即可

import os
import csv

class BorenzhuanPipeline(object):


def __init__(self):
store_file = os.path.dirname(__file__)+'/spiders/bore.csv'
self.file = open(store_file,"a+",newline="",encoding="utf-8")
self.writer = csv.writer(self.file)

def process_item(self, item, spider):
try:

self.writer.writerow((
item["author"],
item["content"],
item["ctime"],
item["disliked"],
item["liked"],
item["likes"],
item["score"],
item["user_season"]
))

except Exception as e:
print(e.args)

def close_spider(self, spider):
self.file.close()

執行程式碼之後，發現過了一會報錯了

去看了一眼，原來是資料爬取完畢~！！！