scrapy的簡單應用-抓取鏈家資料
阿新 • • 發佈:2018-11-05
最近使用scrapy 抓取一批資料,就拿鏈家實驗一下吧
環境準備
pip install scrapy
基本命令
建立專案
scrapy startproject myproject
執行某個專案
scrapy crawl myspider
如何在pycharm裡執行scrapy專案?
建立檔案,程式碼如下:
# run.py
from scrapy import cmdline
cmdline.execute('scrapy crawl dmoz'.split())
建立一個spider(也可以自己新增檔案)
- scrapy genspider myspider ‘baidu.com’
- 這樣就形成了一個已名字為myspider的spider
列出當前專案所有的spider,每行輸出一個spider
- scrapy list
- spider1
- spider2
- scrapy list
在未建立專案的情況下,執行一個編寫在Python檔案中的spider
- scrapy runspider myspider.py
xpath 簡單使用demo
例如有下面的一段html程式碼
...
<div class="demo-class"> 這是div的text </div>
<div class="demo2-class"><a href="www.china.com">這是a的text></a>這是div的text</div>
response.xpath("//div[@class="demo-class"]/text()")
# 輸出的是“這是div的text”
response.xpath("//div[@class="demo-class2"]/a/@href")
#輸出的是 www.china.com
temp = response.xpath("//div[@class="demo-class2"]" )
ou = temp.xpath("a/@href")
#這裡使用的是逐級查詢,注意“a/@href”前面沒有"//",“//”應該是全文檢索,不加的時候會相對位置查詢,這時輸出的是www.china.com
鏈家資料抓取(以北京為例)
- 獲取所有的區域(例如東城區,西城區,海淀區…)
# 獲取所有region
class LjregionSpider(scrapy.Spider):
name = 'ljRegion'
allowed_domains = ['bj.lianjia.com']
start_urls = ['https://bj.lianjia.com/ershoufang']
def parse(self, response):
regions=response.xpath('//div[@data-role="ershoufang"]/div/a')
for r in regions:
href=r.xpath('@href').extract()[0]
name=r.xpath('text()').extract()[0]
dj_ljRegion = djLjRegion(href=href,name=name)
dj_ljRegion.save()
- 獲取每個房源簡略資訊
#-*- coding:utf-8 -*-
import scrapy
from lianjia.items import *
import json
import codecs
import logging
class ershoufang( scrapy.Spider ):
name = "ershoufang"
allowed_domains=['bj.lianjia.com']
start_urls=['https://bj.lianjia.com/ershoufang/dongcheng/']
file = codecs.open("scrapyUrl.txt", "w", encoding="utf-8")
def parse(self, response):
try: houseDetailClear=response.xpath('//div[@class="content "]/div[@class="leftContent"]/ul/li')
for item in houseDetailClear:
ljItem=LianjiaItem()
ljItem['houseCode'] = item.xpath('a[@class="img "]/@data-housecode').extract()[0]
ljItem['href'] = item.xpath('a[@class="img "]/@href').extract()[0]
ljItem['title'] = item.xpath('div[@class="info clear"]/div[@class="title"]/a/text()').extract()[0]
ljItem['houseInfoRegion']=item.xpath('div[@class="info clear"]/div[@class="address"]/div/a/text()').extract()[0]
ljItem['houseInfo'] =ljItem['houseInfoRegion']+ item.xpath('div[@class="info clear"]/div[@class="address"]/div/text()').extract()[0]
ljItem['houseInfoRegionHref']=item.xpath('div[@class="info clear"]/div[@class="address"]/div/a/@href').extract()[0]
ljItem['positionInfo']=item.xpath('div[@class="info clear"]/div[@class="flood"]/div/text()').extract()[0]
ljItem['positionInfoRegion']=item.xpath('div[@class="info clear"]/div[@class="flood"]/div/a/text()').extract()[0]
ljItem['followInfo']=item.xpath('div[@class="info clear"]/div[@class="followInfo"]/text()').extract()[0]
tagSubway=item.xpath('div[@class="info clear"]/div[@class="tag"]/span/text()').extract()
if len(tagSubway)!=0:
ljItem['tagSubway']=tagSubway[0]
else :
ljItem['tagSubway']=''
tagTaxfree=item.xpath('div[@class="info clear"]/div[@class="tag"]/span[@class="taxfree"]/text()').extract()
if len(tagTaxfree)!=0:
ljItem['tagTaxfree'] =tagTaxfree[0]
else:
ljItem['tagTaxfree'] = ''
tagHaskey=item.xpath('div[@class="info clear"]/div[@class="tag"]/span[@class="haskey"]/text()').extract()
if len(tagHaskey) != 0:
ljItem['tagHaskey'] = tagHaskey[0]
else:
ljItem['tagHaskey'] = ''
ljItem['totalPrice']=item.xpath('div[@class="info clear"]/div[@class="priceInfo"]/div[@class="totalPrice"]/span/text()').extract()[0]
ljItem['unitPrice']=item.xpath('div[@class="info clear"]/div[@class="priceInfo"]/div[@class="unitPrice"]/@data-price').extract()[0]
yield ljItem
ershoufangRegions=response.xpath('//div[@data-role="ershoufang"]/div/a/@href').extract()
selectRegion=response.xpath('//div[@data-role="ershoufang"]/div/a[@class="selected"]') selectRegion=selectRegion.xpath('@href').extract()[0]
resPageInfo = response.xpath('//div[@class="page-box house-lst-page-box"]/@page-data')[0].extract().encode('utf-8')
pgInfo = json.loads(resPageInfo)
totalPage = pgInfo['totalPage']
curPage = pgInfo['curPage']
if curPage < totalPage:
next_href='https://bj.lianjia.com%spg%d/'%(selectRegion,curPage+1)
self.file.write('\n'+next_href+'\n')
# 如果下一頁屬性值存在,則通過urljoin函式組合下一頁的url:
next_page = response.urljoin(next_href)
# 回撥parse處理下一頁的url
yield scrapy.Request(next_page, callback=self.parse)
else:
regionIndex=ershoufangRegions.index(selectRegion)
if regionIndex < len(ershoufangRegions)-1:
selectRegion = ershoufangRegions[regionIndex+1]
next_href = 'https://bj.lianjia.com/%s' % (selectRegion)
self.file.write("\n====================================\n")
self.file.write('\n')
self.file.write(next_href + '\n')
self.file.write('\n')
self.file.write("====================================\n")
# 如果下一頁屬性值存在,則通過urljoin函式組合下一頁的url:
next_page = response.urljoin(next_href)
# 回撥parse處理下一頁的url
yield scrapy.Request(next_page, callback=self.parse)
else:
return
except Exception,e:
print str(e)