scrapy——3 crawlSpider

阿新 • • 發佈：2018-11-10

req name ofo 有時 pat 打印 spider 技術 extra

scrapy——3 crawlSpider

crawlSpider

爬取一般網站常用的爬蟲類。其定義了一些規則(rule)來提供跟進link的方便的機制。
也許該spider並不是完全適合您的特定網站或項目，但其對很多情況都使用。因此您可以以其為起點，根據需求修改部分方法。當然您也可以實現自己的spider。

CrawlSpider使用rules來決定爬蟲的爬取規則，並將匹配後的url請求提交給引擎。所以在正常情況下，CrawlSpider不需要單獨手動返回請求了。

在rules中包含一個或多個Rule對象，每個Rule對爬取網站的動作定義了某種特定操作，比如提取當前相應內容裏的特定鏈接，是否對提取的鏈接跟進爬取，對提交的請求設置回調函數等。

如果多個rule匹配了相同的鏈接，則根據規則在本集合中被定義的順序，第一個會被使用。

link_extractor：是一個Link Extractor對象，用於定義需要提取的鏈接。
callback：從link_extractor中每獲取到鏈接時，參數所指定的值作為回調函數，該回調函數接受一個response作為其第一個參數。
註意：當編寫爬蟲規則時，避免使用parse作為回調函數。由於CrawlSpider使用parse方法來實現其邏輯，如果覆蓋了 parse方法，crawl spider將會運行失敗。

follow：是一個布爾(boolean)值，指定了根據該規則從response提取的鏈接是否需要跟進。如果callback為None，follow 默認設置為True ，否則默認為False。

process_links：指定該spider中哪個的函數將會被調用，從link_extractor中獲取到鏈接列表時將會調用該函數。該方法主要用來過濾。
process_request：指定該spider中哪個的函數將會被調用，該規則提取到每個request時都會調用該函數。 (用來過濾request)

技術分享圖片

實戰愛問網站數據爬取

我們需要用crawlScrapy的規則匹配出每個問題的鏈接，對連接內的提問標題，和提問人進行爬取，以及匹配下一頁的url

前面講過scrapy shell ，可以在scrapy shell https://iask.sina.com.cn/c/1073.html

中，先進行匹配測試

先在scrapycrawl中導入LineExtractor再匹配，用extract_links(response)取出數據

In [1]: from scrapy.linkextractors import LinkExtractor
 
In [2]: page = LinkExtractor(allow=‘/c/1073-all-\d+-new\.html‘).extract_links(response) # 匹配下一頁url

In [3]: page
Out[3]:
[Link(url=‘https://iask.sina.com.cn/c/1073-all-180-new.html‘, text=‘2‘, fragment=‘‘, nofollow=False),
 Link(url=‘https://iask.sina.com.cn/c/1073-all-191-new.html‘, text=‘3‘, fragment=‘‘, nofollow=False),
.........
 Link(url=‘https://iask.sina.com.cn/c/1073-all-8608-new.html‘, text=‘20‘, fragment=‘‘, nofollow=False),
 Link(url=‘https://iask.sina.com.cn/c/1073-all-8618-new.html‘, text=‘30‘, fragment=‘‘, nofollow=False)]

In [4]: page = LinkExtractor(restrict_xpaths=‘//li[@class="list"]‘).extract_links(response) # 要獲取標題和提問人，需要先找到這個貼的url

In [5]: page
Out[5]:
[Link(url=‘https://iask.sina.com.cn/b/1SXKZurG8ST9.html‘, text=‘avg說獵殺潛航3主程序是backdoor.seed‘, fragment=‘‘, nofollow=False),
 Link(url=‘https://iask.sina.com.cn/b/1SWo9FvedMVJ.html‘, text=‘怎麽去掉關於應用程序錯誤的提示???‘, fragment=‘‘, nofollow=False),
 Link(url=‘https://iask.sina.com.cn/b/gWP5Ttnm8NDB.html‘, text=‘在超聲波測距儀的設計中用到了cx20106a,在protel中怎麽找不 到啊？‘, fragment=‘‘, nofollow=False),
 ..........

 Link(url=‘https://iask.sina.com.cn/b/87xMZOVEB3Dr.html‘, text=‘景德鎮哪家公司做網頁設計比較靠譜？有電話嗎？‘, fragment=‘‘, nofollow=False),
 Link(url=‘https://iask.sina.com.cn/b/87L5oCMvbsKB.html‘, text=‘景德鎮有專業的做網頁設計的公司嗎？‘, fragment=‘‘, nofollow=False)]

方便看的話還可以用.url提取出url

In [7]: page[0].url
Out[7]: ‘https://iask.sina.com.cn/b/1SXKZurG8ST9.html‘

In [8]: page[1].url
Out[8]: ‘https://iask.sina.com.cn/b/1SWo9FvedMVJ.html‘

In [9]: page[2].url
Out[9]: ‘https://iask.sina.com.cn/b/gWP5Ttnm8NDB.html‘

隨便挑一個url繼續scrapy shell url解析我們需要的數據

scrapy shell https://iask.sina.com.cn/b/gWP5Ttnm8NDB.html

In [1]: question = response.xpath(‘//h2[@class="question-title "]/text()‘).extract_first()

In [2]: question
Out[2]: ‘在超聲波測距儀的設計中用到了cx20106a,在protel中怎麽找不到啊？‘

In [3]: ask_people = response.xpath(‘//span[@class="user-name"]/text()‘).extract_first()

In [4]: ask_people
Out[4]: ‘距離會產生美只有時間不要太長‘

準備完畢就可以開始寫代碼了

先創建項目
註意：crawl spider 創建項目方法略有不同 Scrapy genspider –t crawl “spider_name”“url”

技術分享圖片

IAsk\items.py 確定需要的數據

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class IaskItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    question_title = scrapy.Field()
    ask_name = scrapy.Field()

IAsk\settings.py 激活管道，以及設置忽略爬蟲協議（有些網站會設置爬蟲協議，禮貌式反爬，可無視）

技術分享圖片

IAsk\spiders\iask.py 編寫代碼

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

from ..items import IaskItem


class IaskSpider(CrawlSpider):
    name = ‘iask‘
    allowed_domains = [‘iask.sina.com.cn‘]
    start_urls = [‘https://iask.sina.com.cn/c/1073.html‘]

    rules = (
        Rule(LinkExtractor(allow=‘/c/1073-all-\d+-new\.html‘), callback=‘parse_item‘, follow=True), # 設置規則匹配下一頁url，無需跳轉方法，此處只是打印出來看
        Rule(LinkExtractor(restrict_xpaths=‘//li[@class="list"]‘), callback=‘parse_item1‘, follow=True),  # 設置匹配每一個貼的url，再跳轉匹配問題和提問人
    )

    def parse_item(self, response):
        print(response.url,)

    def parse_item1(self, response):
        ask_item = IaskItem()  # 創建管道對象
        ask_item[‘question_title‘] = response.xpath(‘//h2[@class="question-title "]/text()‘).extract_first()
        ask_item[‘ask_name‘] = response.xpath(‘//span[@class="user-name"]/text()‘).extract_first()
        yield ask_item  # 將數據以字典形式傳給管道

IAsk\pipelines.py 在保存數據，json格式

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don‘t forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import json

class IaskPipeline(object):
    def __init__(self):
        self.f = open(‘ask.json‘, ‘w‘, encoding=‘utf-8‘)

    def start_spider(self):
        pass

    def process_item(self, item, spider):
        s = json.dumps(dict(item), ensure_ascii=False) + ‘\n‘
        self.f.write(s)
        return item

    def close_spider(self):
        self.f.close()

在scrapy——2中，實戰介紹的是scrapy spider 的實現方法，點此查看，這裏展示crawl spider的方法，做個對比

技術分享圖片

scrapy——3 crawlSpider

req name ofo 有時 pat 打印 spider 技術 extra scrapy——3 crawlSpider crawlSpider 爬取一般網站常用的爬蟲類。其定義了一些規則(rule)來提供跟進link的方便的機制。也許該spider並不是完全適合

scrapy import CrawlSpider 報錯

imp esp 一個 spider wls spi ide 一個個 module from scrapy.spider import CrawlSpider 報錯 import module CrawlSpider error 看了下以前一直用的scrapy0.14.1 使

爬蟲Scrapy框架-Crawlspider鏈接提取器與規則解析器

一個 htm turn 創建 for tin Coding lines spi 一：Crawlspider簡介　　　　CrawlSpider其實是Spider的一個子類，除了繼承到Spider的特性和功能外，還派生除了其自己獨有的更加強大的特性和功能。其中最顯著的功能就是

scrapy -->CrawlSpider 介紹

scrapy -->CrawlSpider 介紹 1、首先，通過crawl 模板新建爬蟲： scrapy genspider -t crawl lagou www.lagou.com 創建出來的爬蟲檔案lagou.py： # -

Scrapy 使用CrawlSpider整站抓取文章內容實現

剛接觸Scrapy框架，不是很熟悉，之前用webdriver+selenium實現過頭條的抓取，但是感覺對於整站抓取，之前的這種用無GUI的瀏覽器方式，效率不夠高，所以嘗試用CrawlSpider來實

三十三、scrapy的crawlspider爬蟲

1.crawlspider是什麼回顧之前的程式碼中，我們有很大一部分時間在尋找下一頁的url地址或者是內容的url地址上面，這個過程能更簡單一些麼？思路：從response中提取所有的滿足規則的url地址自動的構造自己requests請求，

Scrapy框架CrawlSpider類爬蟲例項

CrawlSpider類爬蟲中： rules用於定義提取URl地址規則，元祖資料有順序 #LinkExtractor 連線提取器，提取url地址　 #callback 提取出來的url地址的response會交給callback處理　#follow 當前url

Scrapy學習——CrawlSpider詳解

首先，說是詳解，其實也並不是多麼深入，只是自己的一些學習筆記。其次，本文適合一邊翻原始碼，一邊閱讀。從CrawlSpider的原始碼（crawl.py）中我們可以看到，CrawlSpider是繼承Spider類的。在scrapy的官方文件中對

【Scrapy】CrawlSpider 單頁面Ajax爬取

專案目標爬取拉勾網職位列表基本資訊+職位描述專案思考拉勾網的招聘崗位列表，這是Ajax非同步載入的。我想把崗位列表所顯示的資訊爬取下來，同時還需要崗位的工作詳情。爬取流程就是一開始就不斷獲取職位列表的json，然後從json中提取對應的職位詳情頁，再進

同時裝有py2 和3,運行scrapy如何區分

com www. itl 腳本 href 麻煩 tle 情況下 aqi 1未區分環境 python2 -m scrapy startproject xxx python3 -m scrapy startproject xxx 當然，執行的時候也是 python2 -m

scrapy框架系列 (3) Item Pipline

comment 順序 spi .py ini params config con ensure item pipeline 當Item在Spider中被收集之後，它將會被傳遞到Item Pipeline，這些Item Pipeline組件按定義的順序處理Item。每個It

python 3.6.1 安裝scrapy踩坑之旅

ext href sta 版本 deb targe IE src pyw 系統環境：win10 64位系統安裝 python基礎環境配置不做過多的介紹 window環境安裝scrapy需要依賴pywin32，下載對應python版本的exe文件執行安裝，下載的pywin

Scrapy安裝-環境python 3.6.5

成功 .com fin exe engine bat 技術分享 cpp sta 執行命令：pip install scrapy 提示錯誤：資料參考：https://blogs.msdn.microsoft.com/pythonengineering/2016/04/11

[Python3網絡爬蟲開發實戰] 1.8.3-Scrapy-Splash的安裝

original plugin 5.4 ima asc spl python min 8.4 Scrapy-Splash是一個Scrapy中支持JavaScript渲染的工具，本節來介紹它的安裝方式。 Scrapy-Splash的安裝分為兩部分。一個是Splash服務的安裝

Scrapy 爬蟲模擬登陸的3種策略

除了 size 是我 settings extra art 代碼 erro 自定義 1 Scrapy 爬蟲模擬登陸策略前面學習了爬蟲的很多知識，都是分析 HTML、json 數據，有很多的網站為了反爬蟲，除了需要高可用代理 IP 地址池外，還需要登錄，登錄的時候不僅僅

scrapy進階（CrawlSpider爬蟲__爬取整站小說）

bool rap val 正則表達 attr 種類 python list false # -*- coding: utf-8 -*- import scrapy,re from scrapy.linkextractors import LinkExtractor f

Python爬蟲從入門到成妖之3-----Scrapy框架的命令行詳解

參數成了 openssl 入門文件中 crawler 1.0 使用 lob 創建爬蟲項目 scrapy startproject 項目名例子如下： E:\crawler>scrapy startproject test1 New Scrapy pro

scrapy框架之CrawlSpider

提問：如果想要通過爬蟲程式去爬取”糗百“全站資料新聞資料的話，有幾種實現方法？方法一：基於Scrapy框架中的Spider的遞迴爬取進行實現（Request模組遞歸回調parse方法）。方法二：基於CrawlSpider的自動爬取進行實現（更加簡潔和高效）。一，介紹

scrapy shell 除錯報錯TypeError: module.init() takes at most 2 arguments (3 g iven)

1、使用scrapy shell的時候本人之前安裝了ipython，使用shell調式格式從>>>變成了【1】這種帶有ipython的格式，結果整齊度看起來比較舒服。 2、現在建立了crawl spider，同時進入到專案目錄，使用scrapy shell xxxxxxxx在c

2018 - Python 3.7 爬蟲之利用 Scrapy 框架獲取圖片並下載（二）

一、通過命令構建一個爬蟲專案二、定義 item 三、啟用 pipeline 管道四、編寫爬蟲 Spider 五、執行爬蟲六、結果檢視未安裝 Scrapy 框架，見上一篇文章：框架安裝及配置一、通過命令構建一個爬蟲專

scrapy——3 crawlSpider

相關推薦