1. 程式人生 > >scrapy的簡單應用-抓取鏈家資料

scrapy的簡單應用-抓取鏈家資料

最近使用scrapy 抓取一批資料,就拿鏈家實驗一下吧


環境準備

pip install scrapy

基本命令

  • 建立專案

    scrapy startproject myproject

  • 執行某個專案

    scrapy crawl myspider

  • 如何在pycharm裡執行scrapy專案?

    建立檔案,程式碼如下:

# run.py
from scrapy import cmdline
cmdline.execute('scrapy crawl dmoz'.split())
  • 建立一個spider(也可以自己新增檔案)

    • scrapy genspider myspider ‘baidu.com’
    • 這樣就形成了一個已名字為myspider的spider
  • 列出當前專案所有的spider,每行輸出一個spider

    • scrapy list
      • spider1
      • spider2
  • 在未建立專案的情況下,執行一個編寫在Python檔案中的spider

    • scrapy runspider myspider.py

xpath 簡單使用demo

例如有下面的一段html程式碼

...
<div class="demo-class"> 這是divtext
</div> <div class="demo2-class"><a href="www.china.com">這是atext></a>這是divtext</div>
response.xpath("//div[@class="demo-class"]/text()")
# 輸出的是“這是div的text”
response.xpath("//div[@class="demo-class2"]/a/@href")
#輸出的是 www.china.com
temp = response.xpath("//div[@class="demo-class2"]"
) ou = temp.xpath("a/@href") #這裡使用的是逐級查詢,注意“a/@href”前面沒有"//",“//”應該是全文檢索,不加的時候會相對位置查詢,這時輸出的是www.china.com

鏈家資料抓取(以北京為例)

  • 獲取所有的區域(例如東城區,西城區,海淀區…)
# 獲取所有region
class LjregionSpider(scrapy.Spider):
    name = 'ljRegion'
    allowed_domains = ['bj.lianjia.com']
    start_urls = ['https://bj.lianjia.com/ershoufang']
    def parse(self, response):
        regions=response.xpath('//div[@data-role="ershoufang"]/div/a')
        for r in regions:
            href=r.xpath('@href').extract()[0]
            name=r.xpath('text()').extract()[0]
            dj_ljRegion = djLjRegion(href=href,name=name)
            dj_ljRegion.save()
  • 獲取每個房源簡略資訊
#-*- coding:utf-8 -*-
import scrapy
from lianjia.items import *
import json
import codecs
import logging

class ershoufang( scrapy.Spider ):
    name = "ershoufang"
    allowed_domains=['bj.lianjia.com']
    start_urls=['https://bj.lianjia.com/ershoufang/dongcheng/']
    file = codecs.open("scrapyUrl.txt", "w", encoding="utf-8")
    def parse(self, response):
        try:            houseDetailClear=response.xpath('//div[@class="content "]/div[@class="leftContent"]/ul/li')
            for item in houseDetailClear:
                ljItem=LianjiaItem()
                ljItem['houseCode'] = item.xpath('a[@class="img "]/@data-housecode').extract()[0]
                ljItem['href'] = item.xpath('a[@class="img "]/@href').extract()[0]
                ljItem['title'] = item.xpath('div[@class="info clear"]/div[@class="title"]/a/text()').extract()[0]
                ljItem['houseInfoRegion']=item.xpath('div[@class="info clear"]/div[@class="address"]/div/a/text()').extract()[0]
                ljItem['houseInfo'] =ljItem['houseInfoRegion']+ item.xpath('div[@class="info clear"]/div[@class="address"]/div/text()').extract()[0]
                ljItem['houseInfoRegionHref']=item.xpath('div[@class="info clear"]/div[@class="address"]/div/a/@href').extract()[0]
                ljItem['positionInfo']=item.xpath('div[@class="info clear"]/div[@class="flood"]/div/text()').extract()[0]
                ljItem['positionInfoRegion']=item.xpath('div[@class="info clear"]/div[@class="flood"]/div/a/text()').extract()[0]
                ljItem['followInfo']=item.xpath('div[@class="info clear"]/div[@class="followInfo"]/text()').extract()[0]
                tagSubway=item.xpath('div[@class="info clear"]/div[@class="tag"]/span/text()').extract()
                if len(tagSubway)!=0:
                    ljItem['tagSubway']=tagSubway[0]
                else :
                    ljItem['tagSubway']=''
                tagTaxfree=item.xpath('div[@class="info clear"]/div[@class="tag"]/span[@class="taxfree"]/text()').extract()
                if len(tagTaxfree)!=0:
                    ljItem['tagTaxfree'] =tagTaxfree[0]
                else:
                    ljItem['tagTaxfree'] = ''
                tagHaskey=item.xpath('div[@class="info clear"]/div[@class="tag"]/span[@class="haskey"]/text()').extract()
                if len(tagHaskey) != 0:
                    ljItem['tagHaskey'] = tagHaskey[0]
                else:
                    ljItem['tagHaskey'] = ''
                ljItem['totalPrice']=item.xpath('div[@class="info clear"]/div[@class="priceInfo"]/div[@class="totalPrice"]/span/text()').extract()[0]
                ljItem['unitPrice']=item.xpath('div[@class="info clear"]/div[@class="priceInfo"]/div[@class="unitPrice"]/@data-price').extract()[0]
                yield ljItem
            ershoufangRegions=response.xpath('//div[@data-role="ershoufang"]/div/a/@href').extract()
            selectRegion=response.xpath('//div[@data-role="ershoufang"]/div/a[@class="selected"]')            selectRegion=selectRegion.xpath('@href').extract()[0]
            resPageInfo = response.xpath('//div[@class="page-box house-lst-page-box"]/@page-data')[0].extract().encode('utf-8')
            pgInfo = json.loads(resPageInfo)
            totalPage = pgInfo['totalPage']
            curPage = pgInfo['curPage']
            if curPage < totalPage:
                next_href='https://bj.lianjia.com%spg%d/'%(selectRegion,curPage+1)
                self.file.write('\n'+next_href+'\n')
                # 如果下一頁屬性值存在,則通過urljoin函式組合下一頁的url:
                next_page = response.urljoin(next_href)
                # 回撥parse處理下一頁的url
                yield scrapy.Request(next_page, callback=self.parse)
            else:
                regionIndex=ershoufangRegions.index(selectRegion)
                if regionIndex < len(ershoufangRegions)-1:
                    selectRegion = ershoufangRegions[regionIndex+1]
                    next_href = 'https://bj.lianjia.com/%s' % (selectRegion)
                    self.file.write("\n====================================\n")
                    self.file.write('\n')
                    self.file.write(next_href + '\n')
                    self.file.write('\n')
                    self.file.write("====================================\n")
                    # 如果下一頁屬性值存在,則通過urljoin函式組合下一頁的url:
                    next_page = response.urljoin(next_href)
                    # 回撥parse處理下一頁的url
                    yield scrapy.Request(next_page, callback=self.parse)
                else:
                    return
        except Exception,e:
            print str(e)