【Python實戰】Scrapy豌豆莢應用市場爬蟲

阿新 • • 發佈：2019-01-18

對於給定的大量APP，如何爬取與之對應的（應用市場）分類、描述的資訊？且看下面分解。

1. 頁面分析

當我們在豌豆莢首頁搜尋框輸入微信後，會跳轉到搜尋結果的頁面，其url為http://www.wandoujia.com/search?key=%微信。搜尋結果一般是按相關性排序的；所以，我們認為第一條搜尋結果為所需要爬取的。緊接著，點進去後會跳轉到頁面http://www.wandoujia.com/apps/com.tencent.mm，我們會發現豌豆莢的APP的詳情頁，是www.wandoujia.com/apps/ + APP package組成。

讓我們退回到搜尋結果頁面，分析頁面元素，如圖：

所有搜尋結果在<ul>

無序列表標籤中，每一個搜尋結果在<li>標籤中。對應地，CSS選擇器應為

'#j-search-list>li::attr(data-pn)'

接下來，我們來分析APP的詳情頁，APP的名稱所對應的HTML元素如圖：

APP類別的如圖：

APP描述的如圖：

不難得到這三類元素所對應的CSS選擇器

.app-name>span::text
.crumb>.second>a>span::text
.desc-info>.con::text

通過上面的分析，確定爬取策略如下：

逐行讀取APP檔案，拼接搜尋頁面URL；
分析搜尋結果頁面，跳轉到第一條結果對應的詳情頁；

爬取詳情頁相關結果，寫到輸出檔案

2. 爬蟲實現

分析完頁面，可以coding寫爬蟲了。但是，若裸寫Python實現，則要處理下載間隔、請求、頁面解析、爬取結果序列化。Scrapy提供一個輕量級、快速的web爬蟲框架，並很好地解決了這些問題；中文doc有比較詳盡的介紹。

資料清洗

APP檔案中，可能有一些名稱不規整，需要做清洗：

# -*- coding: utf-8 -*-
import re


def clean_app_name(app_name):
    space = u'\u00a0'
    app_name = app_name.replace(space, '')
    brackets = r'\(.*\)|\[.*\]|【.*】|（.*）'
    return re.sub(brackets, '', app_name)

URL處理

拿清洗後APP名稱，拼接搜尋結果頁面URL。因為URL不識別中文等字元，需要用urllib.quote做URL編碼：

# -*- coding: utf-8 -*-
from appMarket import clean
import urllib


def get_kw_url(kw):
    """concatenate the url for searching"""

    base_url = u"http://www.wandoujia.com/search?key=%s"
    kw = clean.clean_app_name(kw)
    return base_url % (urllib.quote(kw.encode("utf8")))


def get_pkg_url(pkg):
    """get the detail url according to pkg"""

    return 'http://www.wandoujia.com/apps/%s' % pkg

爬取

Scrapy的爬蟲均繼承與scrapy.Spider類，主要的屬性及方法：

name，爬蟲的名稱，scrapy crawl命令後可直接跟爬蟲的名稱，即可啟動該爬蟲
allowed_domains，允許爬取域名的列表
start_requests()，開始爬取的方法，返回一個可迭代物件(iterable)，一般為scrapy.Request物件
parse(response)，既可負責處理response並返回處理的資料，也可以跟進的URL（以做下一步處理）

items為儲存爬取後資料的容器，類似於Python的dict，

import scrapy


class AppMarketItem(scrapy.Item):
    # define the fields for your item here like:
    kw = scrapy.Field()  # key word
    name = scrapy.Field()  # app name
    tag = scrapy.Field()  # app tag
    desc = scrapy.Field()  # app description

豌豆莢Spider程式碼：

# -*- coding: utf-8 -*-
# @Time    : 2016/6/23
# @Author  : rain
import scrapy
import codecs
from appMarket import util
from appMarket.util import wandoujia
from appMarket.items import AppMarketItem


class WandoujiaSpider(scrapy.Spider):
    name = "WandoujiaSpider"
    allowed_domains = ["www.wandoujia.com"]

    def __init__(self):
        self.apps_path = './input/apps.txt'

    def start_requests(self):
        with codecs.open(self.apps_path, 'r', 'utf-8') as f:
            for app_name in f:
                yield scrapy.Request(url=wandoujia.get_kw_url(app_name),
                                     callback=self.parse_search_result,
                                     meta={'kw': app_name.rstrip()})

    def parse(self, response):
        item = AppMarketItem()
        item['kw'] = response.meta['kw']
        item['name'] = response.css('.app-name>span::text').extract_first()
        item['tag'] = response.css('.crumb>.second>a>span::text').extract_first()
        desc = response.css('.desc-info>.con::text').extract()
        item['desc'] = util.parse_desc(desc)
        item['desc'] = u"" if not item["desc"] else item["desc"].strip()
        self.log(u'crawling the app %s' % item["name"])
        yield item

    def parse_search_result(self, response):
        pkg = response.css("#j-search-list>li::attr(data-pn)").extract_first()
        yield scrapy.Request(url=wandoujia.get_pkg_url(pkg), meta=response.meta)

APP檔案裡的應用名作為搜尋詞，也應被寫在輸出檔案裡。但是，在爬取時URL有跳轉，如何在不同層級間的Request傳遞變數呢？Request中的meta (dict) 引數實現了這種傳遞。

APP描述.desc-info>.con::text，extract返回的是一個list，拼接成string如下：

def parse_desc(desc):
    return reduce(lambda a, b: a.strip()+b.strip(), desc, '')

結果處理

Scrapy推薦的序列化方式為Json。Json的好處顯而易見：

跨語言；
Schema明晰，較於'\t'分割的純文字，讀取不易出錯

爬取結果有可能會有重複的、為空的（無搜尋結果的）；此外，Python2序列化Json時，對於中文字元，其編碼為unicode。對於這些問題，可自定義Pipeline對結果進行處理:

class CheckPipeline(object):
    """check item, and drop the duplicate one"""
    def __init__(self):
        self.names_seen = set()

    def process_item(self, item, spider):
        if item['name']:
            if item['name'] in self.names_seen:
                raise DropItem("Duplicate item found: %s" % item)
            else:
                self.names_seen.add(item['name'])
                return item
        else:
            raise DropItem("Missing price in %s" % item)


class JsonWriterPipeline(object):
    def __init__(self):
        self.file = codecs.open('./output/output.json', 'wb', 'utf-8')

    def process_item(self, item, spider):
        line = json.dumps(dict(item), ensure_ascii=False) + "\n"
        self.file.write(line)
        return item

還需在settings.py中設定

ITEM_PIPELINES = {
    'appMarket.pipelines.CheckPipeline': 300,
    'appMarket.pipelines.JsonWriterPipeline': 800,
}

分配給每個類的整型值，確定了他們執行的順序，按數字從低到高的順序，通過pipeline，通常將這些數字定義在0-1000範圍內。

【Python實戰】Scrapy豌豆莢應用市場爬蟲

1. 頁面分析

2. 爬蟲實現

資料清洗

URL處理

爬取

結果處理

【Python實戰】Scrapy豌豆莢應用市場爬蟲

【Python實戰】用Scrapyd把Scrapy爬蟲一步一步部署到騰訊雲

【Python實戰】機型自動化標註（搜狗爬蟲實現）

【Python實戰】Pandas：讓你像寫SQL一樣做資料分析（二）

【Python實戰】Pandas：讓你像寫SQL一樣做資料分析（一）

【Python實戰】Django建站筆記

豌豆莢應用市場上傳時提示“抽取icon失敗”解決方案

【專案實戰】：python：MongoDB資料庫的操作及練習

【專案實戰】：python：寫檔案個性化設定模組Python_Xlwt練習

【專案實戰】：Python ：視訊網站資料清洗整理和結論研究

【實戰】scrapy 爬取果殼問答！

【JUnit實戰】為應用程式Controller設計單元測試

【實戰】scrapy-redis + webdriver 爬取航空網站

支援向量機SVM演算法應用【Python實現】

【TensorFlow實戰】用Python實現自編碼器

【專案實戰】：基於python的p2p運營商資料資訊的特徵挖掘

【python自制】讓大白成為你的個人助手！

【Python學習】Python解決漢諾塔問題

【機器學習】1 監督學習應用與梯度下降

20170721L08-02-02老男孩Linux運維實戰培訓初級第八節課課前【上機實戰】考試講解

【Python實戰】Scrapy豌豆莢應用市場爬蟲

1. 頁面分析

2. 爬蟲實現

資料清洗

URL處理

爬取

結果處理

相關推薦