Scrapy爬取淘寶網資料的嘗試

阿新 • • 發佈：2019-01-16

因為想學習資料庫，想要獲取較大量的資料，第一個想到的自然就是淘寶。。。。其中有大量的商品資訊，淘寶網反爬措施還是比較多，特別是詳情頁面還有噁心的動態內容

該例子中使用Scrapy框架中的基礎爬蟲(CrawlSpider還有點沒搞清楚= = b)

先貼上整體程式碼

import scrapy
import re
import csv
import pymongo
from tmail.items import TmailItem
class WeisuenSpider(scrapy.Spider):
    name = 'weisuen'
    start_url = 'https://s.taobao.com/search?q=%E5%B8%BD%E5%AD%90&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.50862.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170817&s=300'
    detail_urls=[]
    data=[]
    client=pymongo.MongoClient("localhost",27017)
    db=client.taobao
    db=db.items
    def start_requests(self):
        for i in range(30):#爬31頁資料差不多了
            url=self.start_url+'&s='+str(i*44)
            yield scrapy.FormRequest(url=url,callback=self.parse)
    def url_decode(self,temp):
        while '\\' in temp:
            index=temp.find('\\')
            st=temp[index:index+7]
            temp=temp.replace(st,'')

        index=temp.find('id')
        temp=temp[:index+2]+'='+temp[index+2:]
        index=temp.find('ns')
        temp=temp[:index]+'&'+'ns='+temp[index+2:]
        index=temp.find('abbucket')
        temp='https:'+temp[:index]+'&'+'abbucket='+temp[index+8:]
        return temp
    def parse(self, response):
        item=response.xpath('//script/text()').extract()
        pat='"raw_title":"(.*?)","pic_url".*?,"detail_url":"(.*?)","view_price":"(.*?)"'
        urls=re.findall(pat,str(item))
        urls.pop(0)
        row={}.fromkeys(['name','price','link'])
        for url in urls:#解析url並放入陣列中
            weburl=self.url_decode(temp=url[1])
            item=TmailItem()
            item['name']=url[0]
            item['link']=weburl
            item['price']=url[2]
            row['name']=item['name']
            row['price']=item['price']
            row['link']=item['link']
            self.db.insert(row)
            row={}.fromkeys(['name','price','link'])
            self.detail_urls.append(weburl)
            self.data.append(item)
        return item
        for item in self.detail_urls:#這個可以抓取評論等更多相關資訊
            yield scrapy.FormRequest(url=item,callback=self.detail)
    def detail(self,response):
        print(response.url)
        #首先判斷url來自天貓還是淘寶
        if 'tmall' in str(response.url):
            pass
        else:
            pass

items.py中定義3個屬性：name，price，link

起始網頁為淘寶的搜尋地址，關鍵字我設定為“帽子”，當然修改關鍵字就只需要修改一下url中的q=後面的值就可以了

因為該型別商品資訊量很大，有很多頁所以重寫start_requests(self)方法，獲取前31頁的內容

首先

name = 'weisuen'
    start_url = 'https://s.taobao.com/search?q=%E5%B8%BD%E5%AD%90&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.50862.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170817&s=300'
    detail_urls=[]
    data=[]
    client=pymongo.MongoClient("localhost",27017)
    db=client.taobao
    db=db.items

先在定義中開啟mongodb資料庫當然我最初使用txt文字和CSV檔案看結果，成功後再使用資料庫

def start_requests(self):
        for i in range(30):#爬31頁資料差不多了
            url=self.start_url+'&s='+str(i*44)
            yield scrapy.FormRequest(url=url,callback=self.parse)

通過觀察發現頁數由url後的s=xx決定且這個值等於頁數*44

def parse(self, response):
        item=response.xpath('//script/text()').extract()
        pat='"raw_title":"(.*?)","pic_url".*?,"detail_url":"(.*?)","view_price":"(.*?)"'
        urls=re.findall(pat,str(item))
        urls.pop(0)
        row={}.fromkeys(['name','price','link'])
        for url in urls:#解析url並放入陣列中
            weburl=self.url_decode(temp=url[1])
            item=TmailItem()
            item['name']=url[0]
            item['link']=weburl
            item['price']=url[2]
            row['name']=item['name']
            row['price']=item['price']
            row['link']=item['link']
            self.db.insert(row)
            row={}.fromkeys(['name','price','link'])
            self.detail_urls.append(weburl)
            self.data.append(item)
        return item
        for item in self.detail_urls:#這個可以抓取評論等更多相關資訊
            yield scrapy.FormRequest(url=item,callback=self.detail)

在回撥函式中對獲取的網頁資料進行解析，這裡遇到的麻煩就是response.text會報錯‘GBK xxxxx’因為淘寶網頁不僅僅由UTF-8編碼還有其他編碼格式所以這樣解碼就會出現問題，我這裡採取的是先使用xpath獲取所有相關類容，再使用正則表示式對相關資訊進行提取。其中每件商品的url都有動態類容需要去掉，這個使用了一個url_decode()方法去掉其中的動態類容。解碼方法程式碼如下：

def url_decode(self,temp):
        while '\\' in temp:
            index=temp.find('\\')
            st=temp[index:index+7]
            temp=temp.replace(st,'')

        index=temp.find('id')
        temp=temp[:index+2]+'='+temp[index+2:]
        index=temp.find('ns')
        temp=temp[:index]+'&'+'ns='+temp[index+2:]
        index=temp.find('abbucket')
        temp='https:'+temp[:index]+'&'+'abbucket='+temp[index+8:]
        return temp

最後返回的url是可以直接開啟的，在回撥函式parse中將相關類容寫入了資料庫中，為了便於擴充套件，parse中生成了對於詳情頁面的請求，可以之後進行評論，評分等相關資訊的抓取

資料庫內容：

之前生成的CSV檔案

Scrapy爬取淘寶網資料的嘗試

因為想學習資料庫，想要獲取較大量的資料，第一個想到的自然就是淘寶。。。。其中有大量的商品資訊，淘寶網反爬措施還是比較多，特別是詳情頁面還有噁心的動態內容該例子中使用Scrapy框架中的基礎爬蟲(CrawlSpider還有點沒搞清楚= = b) 先貼上整體程式碼 impo

Selenium+Scrapy爬取淘寶

好久不見，今天給大家分享如何用自動化工具selenium和scrapy框架來爬取淘寶。爬取網站時候的坑！剛開始爬的時候，就想著直接進入淘寶主頁，然後用selenium工具自動一步步執行然後爬取到自己想得到的資料，然而！令我沒想到的是，利用自動化工具可以對關鍵詞進

爬蟲學習之18：使用selenium和chrome-headerless爬取淘寶網商品資訊（非同步載入網頁）

登入淘寶網，使用F12鍵觀察網頁結構，會發現淘寶網也是非同步載入網站。有時候通過逆向工程區爬取這類網站也不容易。這裡使用selenium和chrome-headerless來爬取。網上有結合selenium和PlantomJS來爬取的，但是最新版的Seleniu

python爬蟲爬取淘寶網頁資料

O、requests 和 re 庫的介紹 requests庫是一個小型好用的網頁請求模組，可用於網頁請求，常用來編寫小型爬蟲安裝requests可以使用pip命令：在命令列輸入 pip install requests re庫是正則表示式庫，是p

python +selenium 爬取淘寶網商品資訊

前幾天用python爬取豆瓣關於電影《長城》的影評，發現豆瓣的網頁是靜態的，心中一陣竊喜。以為對於動態網頁瞭解的不是太多。但是主要是用cookie加headers爬取的。效果還不錯，爬取了六七萬條網友的評價，後期主要打算研究一下，如何發現那些使用者是水軍。今天研

scrapy結合selenium爬取淘寶等動態網站

ice 網站 -i war 原因 def exe imp span 1.首先創建爬蟲項目 2.進入爬蟲 class TaobaoSpider(scrapy.Spider): name = ‘taobao‘ allowed_domains = [‘taobao.c

scrapy+selenium 爬取淘寶

SM end nts items 參數 lang 組元 accept .get # -*- coding: utf-8 -*- import scrapy from scrapy import Request from urllib.parse import quote

python爬蟲爬取淘寶搜尋頁面商品資訊資料

主要使用的庫： requests:爬蟲請求並獲取原始碼 re：使用正則表示式提取資料 json:使用JSON提取資料 pandas：使用pandans儲存資料以下是原始碼： #!coding=utf-8 import requests import re import

使用scrapy和selenium結合爬取淘寶資訊

首先，發現淘寶資訊是需要進行下拉載入資訊，否則商品資訊為空因此，在middleware.py中設定： class ScrapyseleniumspiderDownloaderMiddleware(object): # def __init__(self):

Scrapy-Splash爬取淘寶排行榜（三）

五寫spider 1.知道了要爬取的內容，所以，我們首先在start_urls中設定如下： start_urls=['https://top.taobao.com/index.php?topId=TR_FS&leafId=50010850'

【原創】Python+Scrapy+Selenium簡單爬取淘寶天貓商品資訊及評論

（轉載請註明出處）哈嘍，大家好~前言：這次寫這個小指令碼的目的是為了給老師幫個小忙，爬取某一商品的資訊，寫完覺得這個程式似乎也可以用在更普遍的地方，所以就放出來給大家看看啦，然後因為是在很短時間寫的，所以自然有很多不足之處，想著總之實現了功能再說吧，程式碼太醜大不了之後再重構

python3實現爬取淘寶頁面的商品的資料資訊（selenium+pyquery+mongodb）

1.環境須知做這個爬取的時候需要安裝好python3.6和selenium、pyquery等等一些比較常用的爬取和解析庫，還需要安裝MongoDB這個分散式資料庫。 2.直接上程式碼 spider.py import re from config

Python爬取淘寶頁面的資料，包含商品名字，價格及地址

作業系統：Windows7專業版 Python版本：3.6.4 ide：PyCharm Community Edition 4.0.4 程式碼如下： # -*- coding:utf-8 -*- __author__ = 'zengqiang.wang' import

Scrapy基於selenium結合爬取淘寶

在對於淘寶,京東這類網站爬取資料時,通常直接使用傳送請求拿回response資料,在解析獲取想要的資料時比較難的,因為資料只有在瀏覽網頁的時候才會動態載入,所以要想爬取淘寶京東上的資料,可以使用selenium來進行模擬操作對於scrapy框架

python 爬蟲實戰4 爬取淘寶MM照片

寫真換行符 rip 多行 get sts tool -o true 本篇目標抓取淘寶MM的姓名，頭像，年齡抓取每一個MM的資料簡介以及寫真圖片把每一個MM的寫真圖片按照文件夾保存到本地熟悉文件保存的過程 1.URL的格式在這裏我們用到的URL是 http:/

用selenium爬取淘寶美食

display cts win clas .get cto 分享 element nal ‘‘‘利用selenium爬取淘寶美食網頁內容‘‘‘ import re from selenium import webdriver from selenium.common.

Scrapy爬取慕課網(imooc)所有課程數據並存入MySQL數據庫

start table ise utf-8 action jpg yield star root 爬取目標：使用scrapy爬取所有課程數據，分別為 1.課程名 2.課程簡介 3.課程等級 4.學習人數並存入MySQL數據庫（目標網址 http://www.imoo

Python 爬取淘寶商品信息和相應價格

獲得 com ppa pri 大小 light parent tps 爬取！只用於學習用途！ plt = re.findall(r‘\"view_price\"\:\"[\d\.]*\"‘,html) ：獲得商品價格和view_price字段，並保存在plt中 tlt =

爬蟲實例之selenium爬取淘寶美食

獲取 web tex 匹配 ive cati def presence dea 這次的實例是使用selenium爬取淘寶美食關鍵字下的商品信息，然後存儲到MongoDB。首先我們需要聲明一個browser用來操作，我的是chrome。這裏的wait是在後面的判斷元素是

使用selenium結合PhantomJS爬取淘寶美食並存儲到MongoDB

cnblogs exc cte ota browser -- pre command out PhantomJS是一種沒有界面的瀏覽器，便於爬蟲 1、PhantomJS下載 2、phantomjs無須安裝driver，還有具體的api參考： http://phantomj

Scrapy爬取淘寶網資料的嘗試

相關推薦