python爬蟲框架Scrapy安裝與爬取示例

阿新 • • 發佈：2019-01-03

環境：python3.6，自帶pip

# 安裝
pip install scrapy

自動下載所需元件

Installing collected packages: lxml, cssselect, six, w3lib, parsel, pyasn1, attrs, idna, asn1crypto, pycparser, cffi, cryptography, pyOpenSSL, pyasn1-modules,
service-identity, zope.interface, constantly, incremental, Automat, hyperlink, Twisted, queuelib, PyDispatcher, Scrapy

安裝出現報錯提示

running build_ext
building 'twisted.test.raiser' extension
error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": http://landinghub.visualstudio.com/visual-cpp-build-tools

下面開始我解決問題的歷程

我到http://landinghub.visualstudio.com/visual-cpp-build-tools下載並安裝了Microsoft Visual C++ Build Tools（未完全安裝），pip install scrapy，再次報錯：

error: command 'cl.exe' failed: No such file or directory

於是我分析了一下當前的狀態：scrapy沒有安裝成功，但scrapy所需的元件都下載成功，包括twisted；只是成功安裝了部分元件；在安裝twisted這個元件的過程中出了錯，所以twisted沒有安裝成功，安裝scrapy就卡在了這一步；twisted是一個C++編寫的庫，需要visual C++的環境去編譯；

1、第二個問題如何解決，我猜測應該是Microsoft Visual C++ Build Tools元件安裝少了（我在安裝它的時候在沒有完全安裝成功的狀態下就取消，因為我C盤快爆了，只安裝了52個元件）；重新解除安裝安裝？然後我沒有那麼做，空間佔用太大，而且52個元件我能用到有幾個，我又不做C/C++開發；
2、看看第一個問題有沒有其他解決途徑
網上給出了第一個問題的第二種解決途徑，訪問

https://www.lfd.uci.edu/~gohlke/pythonlibs/，下載與安裝python和系統位數對應的whl檔案，這個檔案很小；我的環境是python3.6和win64，所以下載Twisted‑17.9.0‑cp36‑cp36m‑win_amd64.whl，在檔案所在位置開啟命令列視窗，執行：pip install Twisted-17.9.0-cp36-cp36m-win_amd64.whl

Installing collected packages: Twisted
Successfully installed Twisted-17.9.0

# 該提示資訊表示Twisted安裝成功，再次安裝scrapy
pip install scrapy

# 最後兩行日誌，scrapy安裝成功
Installing collected packages:quequelib,scrapy
Successfully installed queuelib-1.4.2 scrapy-1.4.0

# 檢視版本
scrapy
Scrapy 1.4.0 - no active project
···

# 解除安裝Microsoft Visual C++ Build
# 檢視安裝元件
pip list

若在爬取的過程中出現如下錯誤，最好不要網上給的方法去下載可執行檔案exe，下面一行程式碼搞定

File "c:\users\administrator\appdata\local\programs\python\python36\lib\site-packages\twisted\internet\_win32stdio.py", line 9, in <module>
import win32api
ModuleNotFoundError: No module named 'win32api'

# 安裝pypiwin32
pip install pypiwin32

scrapydemo

目標網址：http://www.seu.edu.cn/138/list.htm
1、新建專案

scrapy startproject scrapydemo

2、建立scrapydemo\scrapydemo\spiders\ActivityItem.py

import scrapy


class ActivityItem(scrapy.Item):
    title = scrapy.Field()
    time = scrapy.Field()
    url = scrapy.Field()

3、建立scrapydemo\scrapydemo\spiders\demo.py，使用xpath過濾資料
這裡需要三個變數：name 、allowed_domains 、start_urls 、和重寫Spider的parser函式，其中allowed_domains可省略；

from scrapy.spider import Spider
from scrapy.selector import Selector

from scrapydemo.spiders.ActivityItem import ActivityItem


class Demo2Spider(Spider):
    name = "demo"
    # allowed_domains = ["www.seu.edu.cn"]
    start_urls = [
        'http://www.seu.edu.cn/138/list.htm',
    ]

    def parse(self, response):

        content = response.body.decode("utf-8", "ignore")
        # Spider.log(self, "Open home page \n\n" + content)
        sel = Selector(response)
        sites = sel.xpath('//*[@id="wp_news_w6"]/ul/li')

        # fo = open("index.html", "w")
        # fo.write(content)
        # Spider.log(self, "\n檔案寫入完成... ")

        print(len(sites))
        items = []
        for site in sites:
            item = ActivityItem()
            print("\n")
            print(site.xpath('span/a/text()').extract())
            item['title'] = site.xpath('span/a/text()').extract()

            print(site.xpath('span/a/@href').extract())
            item['url'] = site.xpath('span/a/@href').extract()

            print(site.xpath('span/text()').extract())
            item['time'] = site.xpath('span/text()').extract()

            items.append(item)

        # 此處可省略，但後面寫入檔案或資料庫等操作會用到
        return items

4、啟動爬蟲

scrapy crawl demo

5、將爬取結果寫入檔案，會在工程根目錄下生產demo.json檔案

scrapy crawl demo -o demo.json

若json檔案文字中文出現unicode編碼

# json檔案中文為unicode編碼解決辦法
scrapy crawl demo -o demo.json -s FEED_EXPORT_ENCODING=utf-8

6、debug 除錯

先在demo.py中打上斷點、然後在settings的同級目錄下新建run.py

# -*- coding: utf-8 -*-  
from scrapy import cmdline  

name = 'demo'  
cmd = 'scrapy crawl {0}'.format(name)  
cmdline.execute(cmd.split())

右鍵->Debug Run，程式就會在斷點處停止；

7、爬取目標連結的所有資訊
目標連結是一個分頁連結，如果想爬取下一頁，下下一頁或者所有資訊是不是都要重寫一個爬蟲呢?scrapy框架為我們提供了簡便的方法，只需要在demo.py中新增如下程式碼：

next = response.xpath('//*[@id="wp_paging_w6"]/ul/li[2]/a[3]/@href').extract()
url = response.urljoin(next[0])
print(url)
yield scrapy.Request(url=url, callback=self.parse)

爬取下一個的url，並執行回撥函式parse，使用遞迴的方式，將爬取的資訊不斷地新增到items列表中，注意程式碼不要加到for迴圈中哦

python爬蟲框架Scrapy安裝與爬取示例

環境：python3.6，自帶pip

scrapydemo

python爬蟲框架Scrapy安裝與爬取示例

資料視覺化三步走（一）：資料採集與儲存，利用python爬蟲框架scrapy爬取網路資料並存儲

教你分分鐘學會用python爬蟲框架Scrapy爬取你想要的內容

教你分分鐘學會用python爬蟲框架Scrapy爬取心目中的女神

python學習（三）scrapy爬蟲框架（三）——爬取桌布儲存並命名

Python 爬蟲框架 Scrapy 在 Windows10 系統環境下的安裝和配置

在linux和windows下安裝python爬蟲框架scrapy

python爬蟲框架scrapy的安裝(windows)

Python爬蟲框架Scrapy學習三記—讓蟲子爬

python爬蟲--xpath結合re同時爬取文字與圖片

Python爬蟲框架Scrapy實例（二）

大神教你如果學習Python爬蟲如何才能高效地爬取海量數據

【Python爬蟲】從html裏爬取中國大學排名

python爬蟲-20行代碼爬取王者榮耀所有英雄圖片，小白也輕輕松松

Python爬蟲初探 - selenium+beautifulsoup4+chromedriver爬取需要登錄的網頁信息

精通Python爬蟲框架Scrapy PDF下載

【Python爬蟲實戰專案一】爬取大眾點評團購詳情及團購評論

python 爬蟲框架 scrapy 的目錄結構

Python爬蟲實習筆記 | Week3 資料爬取和正則再學習

爬蟲實踐之爬蟲框架Scrapy安裝

python爬蟲框架Scrapy安裝與爬取示例

環境：python3.6，自帶pip

scrapydemo

相關推薦