【python3爬蟲】Scrapy Win10下安裝與新建Scrapy專案

阿新 • • 發佈：2018-11-24

詳細安裝教程可參考：

http://www.runoob.com/w3cnote/scrapy-detail.html

https://segmentfault.com/a/1190000013178839

其他教程：

https://oner-wv.gitbooks.io/scrapy_zh/content/%E5%9F%BA%E6%9C%AC%E6%A6%82%E5%BF%B5/%E9%80%89%E6%8B%A9%E5%99%A8.html

https://scrapy-chs.readthedocs.io/zh_CN/0.24/topics/images.html

過程：

1. 安裝框架：

pip install --user Scrapy

報錯的話：

error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": http://landinghub.visualstudio.com/visual-cpp-build-tools

----------------------------------------
Command ""c:\program files\python37\python.exe" -u -c "import setuptools, tokenize;__file__='C:\\Users\\ADMINI~1\\AppData\\Local\\Temp\\pip-install-vizrew_c\\Twisted\\setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record C:\Users\ADMINI~1\AppData\Local\Temp\pip-record-qka9_ywo\install-record.txt --single-version-externally-managed --compile --user --prefix=" failed with error code 1 in C:\Users\ADMINI~1\AppData\Local\Temp\pip-install-vizrew_c\Twisted\

安裝 Microsoft visual c++ 14.0 即可

下載地址1：https://964279924.ctfile.com/fs/1445568-239446865

下載地址2：http://makeoss.oss-cn-hangzhou.aliyuncs.com/%E5%BE%AE%E8%BD%AFwin10/visualcppbuildtools_full.exe

2. 建立一個新專案，在你電腦想要放置框架的目錄cmd，然後執行建立命令：

scrapy startproject mySpider

該目錄就會多出一個叫做 mySipder 的資料夾。

建立一個爬蟲專案示例：

打算抓取 http://www.itcast.cn/channel/teacher.shtml 網站。

1）在iterms.py新增一個class：

class ItcastItem(scrapy.Item):
    # 宣告變數，要抓哪些資料
    name = scrapy.Field()
    title = scrapy.Field()
    info = scrapy.Field()

    pass

2）在spider目錄新建一個檔案itcast.py，並寫入程式碼：

import scrapy


class ItcastSpider(scrapy.Spider):
    name = "itcast"  # 爬蟲名，要啟動的爬蟲專案名
    allowed_domains = ["itcast.cn"]  # 約束區域
    start_urls = (  # 爬取地址白名單，可用把多個頁面爬下來，解析頁面時要確保html標籤結構類似。
        'http://www.itcast.cn/channel/teacher.shtml#aphp',
        'http://www.itcast.cn/channel/teacher.shtml#apython',
    )

    def parse(self, response):
        print(response.body.decode('utf-8'))  #網頁html檔案。 # 編碼格式gb2312,utf-8,GBK

        pass

    pass

3. 執行專案：

1). 安裝pywin32

下載對應版本：https://github.com/mhammond/pywin32/releases 安裝即可。

不然啟動專案的時候會報錯 ModuleNotFoundError: No module named 'win32api'

2). 啟動專案的命令：python -m scrapy crawl 專案名或爬蟲名：

python -m scrapy crawl itcast

或者用 scrapy crawl itcast 也可以啟動

將print(response.body) #網頁html檔案打印出來，html頁面地址為start_urls元組中的地址。

start_urls可以將類似html結構的不同的多個頁面url爬下來。

注意網頁檔案的編碼！編碼格式gb2312,utf-8,GBK。可以用.decode('utf8')直接編碼html的string流。當然，scrapy原生並不需要decode編碼。

一個簡單的頁面抓取專案完成！

4. 抓取網頁中的資料：

匯入之前在items.py中寫入的class,

itcast.py中完整的程式碼：

import scrapy
from mySpider.items import ItcastItem


class ItcastSpider(scrapy.Spider):
    name = "itcast"  # 爬蟲名
    allowed_domains = ["itcast.cn"]  # 約束區域
    start_urls = (  # 爬取地址白名單。可用把多個頁面爬下來，解析頁面時要確保html標籤結構類似。
        'http://www.itcast.cn/channel/teacher.shtml#aphp',
        'http://www.itcast.cn/channel/teacher.shtml#apython',
    )

    def parse(self, response):

        # html = response.body.decode('utf-8')
        # print(html)

        items = []

        for each in response.xpath("//div[@class='li_txt']"):

            # 將我們得到的資料封裝到一個 `ItcastItem` 物件
            item = ItcastItem()

            # extract()方法返回的都是unicode字串
            name = each.xpath("h3/text()").extract()
            title = each.xpath("h4/text()").extract()
            info = each.xpath("p/text()").extract()

            # xpath返回的是包含一個元素的列表
            item['name'] = name[0]
            item['title'] = title[0]
            item['info'] = info[0]

            items.append(item)
            pass
        # 直接返回最後資料
        print(items)
        yield item
        # return items

        pass

    pass

標籤節點如下：

執行如下，可以看到，成功抓取了html標籤中的文字：

要看完整的入門要點，請閱讀本文最上面的參考教程地址，裡面有另外一些知識點介紹。

我感覺，操作DOM，還是原生來的爽快，一氣呵成，一個檔案即可搞定。更覺得框架就是懶人工具。那為什麼還要學習框架，為了體現我的學習能力還沒老，為了漲工資。

【python3爬蟲】Scrapy Win10下安裝與新建Scrapy專案

【python3爬蟲】Scrapy Win10下安裝與新建Scrapy專案

【Python3爬蟲】Scrapy+MongoDB+MySQL

【Python3爬蟲】Scrapy使用IP代理池和隨機User-Agent

【python3爬蟲】beautifulsoup4 安裝

【python 爬蟲】Mac環境下selenium、ChromeDriver的安裝

【Python3爬蟲】Scrapy爬取豆瓣電影TOP250

【CNMP系列】CentOS7.0下安裝Nginx服務

【Python3 爬蟲】04_urllib.request.urlretrieve

【Python3 爬蟲】06_robots.txt查看網站爬取限制情況

【Python3 爬蟲】Beautiful Soup庫的使用

【Python3 爬蟲】爬取博客園首頁所有文章

【Python3 爬蟲】14_爬取淘寶上的手機圖片

【Python爬蟲】Requests庫的安裝

【Python3爬蟲】有道翻譯

【Python3爬蟲】網易雲音樂歌單下載

【深度學習】ubuntu16.04下安裝opencv3.4.0

【Python3爬蟲】12306爬蟲

【Python3爬蟲】拉勾網爬蟲

【Python3爬蟲】微博使用者爬蟲

【Python3爬蟲】使用Fidder實現APP爬取

【python3爬蟲】Scrapy Win10下安裝與新建Scrapy專案

相關推薦