使用Scrapy抓取數據

阿新 • • 發佈：2018-11-15

元素 www. ace 任務 onf 目錄 mod 模塊獲得

轉載：http://blog.javachen.com/2014/05/24/using-scrapy-to-cralw-data.html

Scrapy是Python開發的一個快速,高層次的屏幕抓取和web抓取框架，用於抓取web站點並從頁面中提取結構化的數據。Scrapy用途廣泛，可以用於數據挖掘、監測和自動化測試。

官方主頁： http://www.scrapy.org/
中文文檔：Scrapy 0.22 文檔
GitHub項目主頁：https://github.com/scrapy/scrapy

Scrapy 使用了 Twisted 異步網絡庫來處理網絡通訊。整體架構大致如下（註：圖片來自互聯網）：

技術分享圖片

Scrapy主要包括了以下組件：

引擎，用來處理整個系統的數據流處理，觸發事務。
調度器，用來接受引擎發過來的請求，壓入隊列中，並在引擎再次請求的時候返回。
下載器，用於下載網頁內容，並將網頁內容返回給蜘蛛。
蜘蛛，蜘蛛是主要幹活的，用它來制訂特定域名或網頁的解析規則。
項目管道，負責處理有蜘蛛從網頁中抽取的項目，他的主要任務是清晰、驗證和存儲數據。當頁面被蜘蛛解析後，將被發送到項目管道，並經過幾個特定的次序處理數據。
下載器中間件，位於Scrapy引擎和下載器之間的鉤子框架，主要是處理Scrapy引擎與下載器之間的請求及響應。
蜘蛛中間件，介於Scrapy引擎和蜘蛛之間的鉤子框架，主要工作是處理蜘蛛的響應輸入和請求輸出。

調度中間件，介於Scrapy引擎和調度之間的中間件，從Scrapy引擎發送到調度的請求和響應。

使用Scrapy可以很方便的完成網上數據的采集工作，它為我們完成了大量的工作，而不需要自己費大力氣去開發。

1. 安裝

安裝 python

Scrapy 目前最新版本為0.22.2，該版本需要 python 2.7，故需要先安裝 python 2.7。這裏我使用 centos 服務器來做測試，因為系統自帶了 python ，需要先檢查 python 版本。

查看python版本：

$ python -V
Python 2.6.6

升級版本到2.7：

$ wget http://python.org/ftp/python/2.7.6/Python-2.7.6.tar.xz
$ tar xf Python-2.7.6.tar.xz
$ cd Python-2.7.6
$ ./configure --prefix=/usr/local --enable-unicode=ucs4 --enable-shared LDFLAGS="-Wl,-rpath /usr/local/lib"
$ make && make altinstall

建立軟連接，使系統默認的 python指向 python2.7

$ mv /usr/bin/python /usr/bin/python2.6.6 
$ ln -s /usr/local/bin/python2.7 /usr/bin/python

再次查看python版本：

$ python -V
Python 2.7.6

安裝

這裏使用 wget 的方式來安裝 setuptools :

$ wget https://bootstrap.pypa.io/ez_setup.py -O - | python

安裝 zope.interface

$ easy_install zope.interface

安裝 twisted

Scrapy 使用了 Twisted 異步網絡庫來處理網絡通訊，故需要安裝 twisted。

安裝 twisted 前，需要先安裝 gcc：

$ yum install gcc -y

然後，再通過 easy_install 安裝 twisted：

$ easy_install twisted

如果出現下面錯誤：

$ easy_install twisted
Searching for twisted
Reading https://pypi.python.org/simple/twisted/
Best match: Twisted 14.0.0
Downloading https://pypi.python.org/packages/source/T/Twisted/Twisted-14.0.0.tar.bz2#md5=9625c094e0a18da77faa4627b98c9815
Processing Twisted-14.0.0.tar.bz2
Writing /tmp/easy_install-kYHKjn/Twisted-14.0.0/setup.cfg
Running Twisted-14.0.0/setup.py -q bdist_egg --dist-dir /tmp/easy_install-kYHKjn/Twisted-14.0.0/egg-dist-tmp-vu1n6Y
twisted/runner/portmap.c:10:20: error: Python.h: No such file or directory
twisted/runner/portmap.c:14: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘*’ token
twisted/runner/portmap.c:31: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘*’ token
twisted/runner/portmap.c:45: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘PortmapMethods’
twisted/runner/portmap.c: In function ‘initportmap’:
twisted/runner/portmap.c:55: warning: implicit declaration of function ‘Py_InitModule’
twisted/runner/portmap.c:55: error: ‘PortmapMethods’ undeclared (first use in this function)
twisted/runner/portmap.c:55: error: (Each undeclared identifier is reported only once
twisted/runner/portmap.c:55: error: for each function it appears in.)

請安裝 python-devel 然後再次運行：

$ yum install python-devel -y
$ easy_install twisted

如果出現下面異常：

error: Not a recognized archive type: /tmp/easy_install-tVwC5O/Twisted-14.0.0.tar.bz2

請手動下載然後安裝，下載地址在這裏

$ wget https://pypi.python.org/packages/source/T/Twisted/Twisted-14.0.0.tar.bz2#md5=9625c094e0a18da77faa4627b98c9815
$ tar -vxjf Twisted-14.0.0.tar.bz2
$ cd Twisted-14.0.0
$ python setup.py install

安裝 pyOpenSSL

先安裝一些依賴：

$ yum install libffi libffi-devel openssl-devel -y

然後，再通過 easy_install 安裝 pyOpenSSL：

$ easy_install pyOpenSSL

安裝 Scrapy

先安裝一些依賴：

$ yum install libxml2 libxslt libxslt-devel -y

最後再來安裝 Scrapy ：

$ easy_install scrapy

2. 使用 Scrapy

在安裝成功之後，你可以了解一些 Scrapy 的基本概念和使用方法，並學習 Scrapy 項目的例子 dirbot 。

Dirbot 項目位於 https://github.com/scrapy/dirbot，該項目包含一個 README 文件，它詳細描述了項目的內容。如果你熟悉 git，你可以 checkout 它的源代碼。或者你可以通過點擊 Downloads 下載 tarball 或 zip 格式的文件。

下面以該例子來描述如何使用 Scrapy 創建一個爬蟲項目。

新建工程

在抓取之前，你需要新建一個 Scrapy 工程。進入一個你想用來保存代碼的目錄，然後執行：

$ scrapy startproject tutorial

這個命令會在當前目錄下創建一個新目錄 tutorial，它的結構如下：

.
├── scrapy.cfg
└── tutorial
    ├── __init__.py
    ├── items.py
    ├── pipelines.py
    ├── settings.py
    └── spiders
        └── __init__.py

這些文件主要是：

scrapy.cfg: 項目配置文件
tutorial/: 項目python模塊, 呆會代碼將從這裏導入
tutorial/items.py: 項目items文件
tutorial/pipelines.py: 項目管道文件
tutorial/settings.py: 項目配置文件
tutorial/spiders: 放置spider的目錄

定義Item

Items是將要裝載抓取的數據的容器，它工作方式像 python 裏面的字典，但它提供更多的保護，比如對未定義的字段填充以防止拼寫錯誤。

它通過創建一個 scrapy.item.Item 類來聲明，定義它的屬性為 scrpy.item.Field 對象，就像是一個對象關系映射(ORM). 我們通過將需要的item模型化，來控制從 dmoz.org 獲得的站點數據，比如我們要獲得站點的名字，url 和網站描述，我們定義這三種屬性的域。要做到這點，我們編輯在 tutorial 目錄下的 items.py 文件，我們的 Item 類將會是這樣

from scrapy.item import Item, Field 
class DmozItem(Item):
    title = Field()
    link = Field()
    desc = Field()

剛開始看起來可能會有些困惑，但是定義這些 item 能讓你用其他 Scrapy 組件的時候知道你的 items 到底是什麽。

編寫爬蟲(Spider)

Spider 是用戶編寫的類，用於從一個域（或域組）中抓取信息。們定義了用於下載的URL的初步列表，如何跟蹤鏈接，以及如何來解析這些網頁的內容用於提取items。

要建立一個 Spider，你可以為 scrapy.spider.BaseSpider 創建一個子類，並確定三個主要的、強制的屬性：

name：爬蟲的識別名，它必須是唯一的，在不同的爬蟲中你必須定義不同的名字.
start_urls：爬蟲開始爬的一個 URL 列表。爬蟲從這裏開始抓取數據，所以，第一次下載的數據將會從這些 URLS 開始。其他子 URL 將會從這些起始 URL 中繼承性生成。
parse()：爬蟲的方法，調用時候傳入從每一個 URL 傳回的 Response 對象作為參數，response 將會是 parse 方法的唯一的一個參數,

這個方法負責解析返回的數據、匹配抓取的數據(解析為 item )並跟蹤更多的 URL。

在 tutorial/spiders 目錄下創建 DmozSpider.py

from scrapy.spider import BaseSpider

class DmozSpider(BaseSpider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]

    def parse(self, response):
        filename = response.url.split("/")[-2]
        open(filename, ‘wb‘).write(response.body)

運行項目

$ scrapy crawl dmoz

該命令從 dmoz.org 域啟動爬蟲，第三個參數為 DmozSpider.py 中的 name 屬性值。

xpath選擇器

Scrapy 使用一種叫做 XPath selectors 的機制，它基於 XPath 表達式。如果你想了解更多selectors和其他機制你可以查閱資料。

這是一些XPath表達式的例子和他們的含義：

/html/head/title: 選擇HTML文檔 <head> 元素下面的 <title> 標簽。
/html/head/title/text(): 選擇前面提到的 <title> 元素下面的文本內容
//td: 選擇所有 <td> 元素
//div[@class="mine"]: 選擇所有包含 class="mine" 屬性的div 標簽元素

這只是幾個使用 XPath 的簡單例子，但是實際上 XPath 非常強大。如果你想了解更多 XPATH 的內容，我們向你推薦這個 XPath 教程

為了方便使用 XPaths，Scrapy 提供 Selector 類，有三種方法

xpath()：返回selectors列表, 每一個select表示一個xpath參數表達式選擇的節點.
extract()：返回一個unicode字符串，該字符串為XPath選擇器返回的數據
re()：返回unicode字符串列表，字符串作為參數由正則表達式提取出來
css()

提取數據

我們可以通過如下命令選擇每個在網站中的 <li> 元素:

sel.xpath(‘//ul/li‘)

然後是網站描述:

sel.xpath(‘//ul/li/text()‘).extract()

網站標題:

sel.xpath(‘//ul/li/a/text()‘).extract()

網站鏈接:

sel.xpath(‘//ul/li/a/@href‘).extract()

如前所述，每個 xpath() 調用返回一個 selectors 列表，所以我們可以結合 xpath() 去挖掘更深的節點。我們將會用到這些特性，所以:

sites = sel.xpath(‘//ul/li‘)
for site in sites:
    title = site.xpath(‘a/text()‘).extract()
    link = site.xpath(‘a/@href‘).extract()
    desc = site.xpath(‘text()‘).extract()
    print title, link, desc

使用Item

scrapy.item.Item 的調用接口類似於 python 的 dict ，Item 包含多個 scrapy.item.Field。這跟 django 的 Model 與

Item 通常是在 Spider 的 parse 方法裏使用，它用來保存解析到的數據。

最後修改爬蟲類，使用 Item 來保存數據，代碼如下：

from scrapy.spider import Spider
from scrapy.selector import Selector

from dirbot.items import Website


class DmozSpider(Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/",
    ]

    def parse(self, response):
        """
        The lines below is a spider contract. For more info see:
        http://doc.scrapy.org/en/latest/topics/contracts.html

        @url http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/
        @scrapes name
        """
        sel = Selector(response)
        sites = sel.xpath(‘//ul[@class="directory-url"]/li‘)
        items = []

        for site in sites:
            item = Website()
            item[‘name‘] = site.xpath(‘a/text()‘).extract()
            item[‘url‘] = site.xpath(‘a/@href‘).extract()
            item[‘description‘] = site.xpath(‘text()‘).re(‘-\s([^\n]*?)\\n‘)
            items.append(item)

        return items

現在，可以再次運行該項目查看運行結果：

$ scrapy crawl dmoz

使用Item Pipeline

在 settings.py 中設置 ITEM_PIPELINES，其默認為[]，與 django 的 MIDDLEWARE_CLASSES 等相似。從 Spider 的 parse 返回的 Item 數據將依次被 ITEM_PIPELINES 列表中的 Pipeline 類處理。

一個 Item Pipeline 類必須實現以下方法：

process_item(item, spider) 為每個 item pipeline 組件調用，並且需要返回一個 scrapy.item.Item 實例對象或者拋出一個 scrapy.exceptions.DropItem 異常。當拋出異常後該 item 將不會被之後的 pipeline 處理。參數:
- item (Item object) – 由 parse 方法返回的 Item 對象
- spider (BaseSpider object) – 抓取到這個 Item 對象對應的爬蟲對象

也可額外的實現以下兩個方法：

open_spider(spider) 當爬蟲打開之後被調用。參數: spider (BaseSpider object) – 已經運行的爬蟲
close_spider(spider) 當爬蟲關閉之後被調用。參數: spider (BaseSpider object) – 已經關閉的爬蟲

保存抓取的數據

保存信息的最簡單的方法是通過 Feed exports，命令如下：

$ scrapy crawl dmoz -o items.json -t json

除了 json 格式之外，還支持 JSON lines、CSV、XML格式，你也可以通過接口擴展一些格式。

對於小項目用這種方法也足夠了。如果是比較復雜的數據的話可能就需要編寫一個 Item Pipeline 進行處理了。

所有抓取的 items 將以 JSON 格式被保存在新生成的 items.json 文件中

總結

上面描述了如何創建一個爬蟲項目的過程，你可以參照上面過程聯系一遍。作為學習的例子，你還可以參考這篇文章：scrapy 中文教程（爬cnbeta實例）。

這篇文章中的爬蟲類代碼如下：

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
 
from cnbeta.items import CnbetaItem
 
class CBSpider(CrawlSpider):
    name = ‘cnbeta‘
    allowed_domains = [‘cnbeta.com‘]
    start_urls = [‘http://www.cnbeta.com‘]
 
    rules = (
        Rule(SgmlLinkExtractor(allow=(‘/articles/.*\.htm‘, )),
             callback=‘parse_page‘, follow=True),
    )
 
    def parse_page(self, response):
        item = CnbetaItem()
        sel = Selector(response)
        item[‘title‘] = sel.xpath(‘//title/text()‘).extract()
        item[‘url‘] = response.url
        return item

需要說明的是：

該爬蟲類繼承的是 CrawlSpider 類，並且定義規則，rules指定了含有 /articles/.*\.htm 的鏈接都會被匹配。
該類並沒有實現parse方法，並且規則中定義了回調函數 parse_page，你可以參考更多資料了解 CrawlSpider 的用法

使用Scrapy抓取數據

使用Scrapy抓取數據

1. 安裝

安裝 python

安裝

安裝 zope.interface

安裝 twisted

安裝 pyOpenSSL

安裝 Scrapy

2. 使用 Scrapy

新建工程

定義Item

編寫爬蟲(Spider)

運行項目

xpath選擇器

提取數據

使用Item

使用Item Pipeline

保存抓取的數據

總結

一個站點的誕生02--用Scrapy抓取數據

使用Scrapy抓取數據

.NET抓取數據範例抓取頁面上所有的鏈接

python第一個爬蟲的例子抓取數據到mysql，實測有數據

Python抓取數據的幾種方式

C#使用Selenium+PhantomJS抓取數據

C# webrequest 抓取數據時，多個域Cookie的問題

Nodejs實現爬蟲抓取數據

一起來學習XPATH，來看看除了正則表達式我們還能怎麽抓取數據

Requests庫抓取數據

利用“爬蟲”抓視頻法院審結全國首例計算機抓取數據案

談如何解析Html並抓取數據

使用正則表達式抓取數據時遇到的小問題

scrapy使用PhantomJS和selenium爬取數據

基於scrapy中---全站爬取數據----CrawlSpider的使用

提升Scrapy框架爬取數據效率的五種方式

Scrapy抓取Quotes to Scrape

Microsoft Excel 自動取數據庫數據

記一次爬需要登錄之後才能爬取數據的demo

python - bilibili（四）抓包數據亂碼

使用Scrapy抓取數據

1. 安裝

安裝 python

安裝

安裝 zope.interface

安裝 twisted

安裝 pyOpenSSL

安裝 Scrapy

2. 使用 Scrapy

新建工程

定義Item

編寫爬蟲(Spider)

運行項目

xpath選擇器

提取數據

使用Item

使用Item Pipeline

保存抓取的數據

總結

相關推薦