1. 程式人生 > >scrapy結合selenium爬取淘寶等動態網站

scrapy結合selenium爬取淘寶等動態網站

ice 網站 -i war 原因 def exe imp span

1.首先創建爬蟲項目

2.進入爬蟲

class TaobaoSpider(scrapy.Spider):
name = ‘taobao‘
allowed_domains = [‘taobao.com‘]
  #拿一個筆記本鍵盤做示例
start_urls = [‘https://s.taobao.com/search?initiative_id=tbindexz_20170306&ie=utf8&spm=a21bo.2017.201856-taobao-item.2&sourceId=tb.index&search_type=item&ssid=s5-e&commend=all&imgfile=&q=%E7%AC%94%E8%AE%B0%E6%9C%AC%E7%94%B5%E8%84%91&suggest=0_1&_input_charset=utf-8&wq=%E7%AC%94%E8%AE%B0%E6%9C%AC&suggest_query=%E7%AC%94%E8%AE%B0%E6%9C%AC&source=suggest‘]

  #接下來,定義初始化函數
  def __init__(self):
    super(TaobaoSpider,self).__init__()
    self.driver = webdriver.PhantomJS() #在這裏,我用幽靈瀏覽器,當然也可以用Firefox()和Chrome()
火狐和谷歌瀏覽器
  #然後,開始解析源碼
  def parse(self, response):
    div_info = response.xpath(‘//div[@class="info-cont"]‘)
    for div in div_info
      title = div.xpath(‘div[@class="title-row"]/a/text()‘).extract_first(‘‘)
      price = div.xpath(‘div[contains(@class, "sale-row")]/div/span[contains(@class, "price")]/strong/text()‘).extract_first(‘‘)
      print ‘名稱:‘, title, ‘價格:‘, price
  
  #關閉爬蟲並關閉瀏覽器
  def closed(self,reason):
    print u‘爬蟲關閉了, 原因:‘,reason
    self.driver.quit()
寫到這,爬蟲類函數寫完了,然後需要去設置middlewares中間件
import time
from selenium import webdriver
from scrapy.http.response.html import HtmlResponse
from scrapy.http.response import Response
需要這幾個模塊
重寫downloadMiddleware這個類

 class SeleniumRequestDownloadMiddleWare(object):
    super(SeleniumRequestDownloadMiddleWare, self).__init__()

RequestDownloadMiddleWare(object):

    self.driver = webdriver.PhantomJS()

 def process_request(self,request,spider)

   if spider.name ==‘taobao‘:

    spider.driver.get(request.url)

    #設置滾動條,往下拉頁面獲取源碼

    for x in xrange(1,11,2):

      i = float(x)/10

      js = "document.body.scrollTop=document.body.scrollHeight * %f"%i

      spider.driver.execute_script(js)

      time.sleep(1) #需要設置等待時間1秒,不然加載緩慢的話,不出數據

     response = Response(url = request.url,body=bytes(spider.driver.page_source),request = request)

   return response 
 else:
   pass

scrapy結合selenium爬取淘寶等動態網站