Python爬蟲入門教程 32-100 B站博人傳評論資料抓取 scrapy
1. B站博人傳評論資料爬取簡介
今天想了半天不知道抓啥,去B站看跳舞的小姐姐,忽然看到了評論,那就抓取一下B站的評論資料,視訊動畫那麼多,也不知道抓取哪個,選了一個博人傳跟火影相關的,抓取看看。網址: https://www.bilibili.com/bangumi/media/md5978/?from=search&seid=16013388136765436883#short
在這個網頁看到了18560條短評,資料量也不大,抓取看看,使用的還是scrapy。
2. B站博人傳評論資料案例---獲取連結
從開發者工具中你能輕易的得到如下連結,有連結之後就好辦了,如何建立專案就不在囉嗦了,我們直接進入主題。

我在程式碼中的 parse
函式中,設定了兩個 yield
一個用來返回 items
一個用來返回 requests
。
然後實現一個新的功能,每次訪問切換 UA
,這個點我們需要使用到中介軟體技術。
class BorenSpider(scrapy.Spider): BASE_URL = "https://bangumi.bilibili.com/review/web_api/short/list?media_id=5978&folded=0&page_size=20&sort=0&cursor={}" name = 'Boren' allowed_domains = ['bangumi.bilibili.com'] start_urls = [BASE_URL.format("76742479839522")] def parse(self, response): print(response.url) resdata = json.loads(response.body_as_unicode()) if resdata["code"] == 0: # 獲取最後一個數據 if len(resdata["result"]["list"]) > 0: data = resdata["result"]["list"] cursor = data[-1]["cursor"] for one in data: item = BorenzhuanItem() item["author"]= one["author"]["uname"] item["content"] = one["content"] item["ctime"] = one["ctime"] item["disliked"] = one["disliked"] item["liked"] = one["liked"] item["likes"] = one["likes"] item["user_season"] = one["user_season"]["last_ep_index"] if "user_season" in one else "" item["score"] = one["user_rating"]["score"] yield item yield scrapy.Request(self.BASE_URL.format(cursor),callback=self.parse)
3. B站博人傳評論資料案例---實現隨機UA
第一步, 在settings檔案中新增一些UserAgent,我從網際網路找了一些
USER_AGENT_LIST=[ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1", "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24" ]
第二步,在settings檔案中設定 “DOWNLOADER_MIDDLEWARES”
# Enable or disable downloader middlewares # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html DOWNLOADER_MIDDLEWARES = { #'borenzhuan.middlewares.BorenzhuanDownloaderMiddleware': 543, 'borenzhuan.middlewares.RandomUserAgentMiddleware': 400, }
第三步,在 middlewares.py
檔案中匯入 settings模組中的 USER_AGENT_LIST 方法
from borenzhuan.settings import USER_AGENT_LIST # 匯入中介軟體 import random class RandomUserAgentMiddleware(object): def process_request(self, request, spider): rand_use= random.choice(USER_AGENT_LIST) if rand_use: request.headers.setdefault('User-Agent', rand_use)
好了,隨機的UA已經實現,你可以在 parse
函式中編寫如下程式碼進行測試
print(response.request.headers)
4. B站博人傳評論資料----完善item
這個操作相對簡單,這些資料就是我們要儲存的資料了。!
author = scrapy.Field() content = scrapy.Field() ctime = scrapy.Field() disliked = scrapy.Field() liked = scrapy.Field() likes = scrapy.Field() score = scrapy.Field() user_season = scrapy.Field()
5. B站博人傳評論資料案例---提高爬取速度
在settings.py中設定如下引數:
# Configure maximum concurrent requests performed by Scrapy (default: 16) CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs DOWNLOAD_DELAY = 1 # The download delay setting will honor only one of: CONCURRENT_REQUESTS_PER_DOMAIN = 16 CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) COOKIES_ENABLED = False
解釋說明
一、降低下載延遲
DOWNLOAD_DELAY = 0
將下載延遲設為0,這時需要相應的防ban措施,一般使用user agent輪轉,構建user agent池,輪流選擇其中之一來作為user agent。
二、多執行緒
CONCURRENT_REQUESTS = 32
CONCURRENT_REQUESTS_PER_DOMAIN = 16
CONCURRENT_REQUESTS_PER_IP = 16
scrapy網路請求是基於Twisted,而Twisted預設支援多執行緒,而且scrapy預設也是通過多執行緒請求的,並且支援多核CPU的併發,我們通過一些設定提高scrapy的併發數可以提高爬取速度。
三、禁用cookies
COOKIES_ENABLED = False
6. B站博人傳評論資料案例---儲存資料
最後在 pipelines.py
檔案中,編寫儲存程式碼即可
import os import csv class BorenzhuanPipeline(object): def __init__(self): store_file = os.path.dirname(__file__)+'/spiders/bore.csv' self.file = open(store_file,"a+",newline="",encoding="utf-8") self.writer = csv.writer(self.file) def process_item(self, item, spider): try: self.writer.writerow(( item["author"], item["content"], item["ctime"], item["disliked"], item["liked"], item["likes"], item["score"], item["user_season"] )) except Exception as e: print(e.args) def close_spider(self, spider): self.file.close()
執行程式碼之後,發現過了一會報錯了

去看了一眼,原來是資料爬取完畢~!!!