scrapy的簡單應用-抓取鏈家資料

阿新 • • 發佈：2018-11-05

最近使用scrapy 抓取一批資料，就拿鏈家實驗一下吧

環境準備

pip install scrapy

基本命令

建立專案

scrapy startproject myproject
執行某個專案

scrapy crawl myspider
如何在pycharm裡執行scrapy專案？

建立檔案，程式碼如下：

# run.py
from scrapy import cmdline
cmdline.execute('scrapy crawl dmoz'.split())

建立一個spider（也可以自己新增檔案）
- scrapy genspider myspider ‘baidu.com’
- 這樣就形成了一個已名字為myspider的spider
列出當前專案所有的spider，每行輸出一個spider
- scrapy list
  - spider1
  - spider2
在未建立專案的情況下，執行一個編寫在Python檔案中的spider
- scrapy runspider myspider.py

xpath 簡單使用demo

例如有下面的一段html程式碼

...
<div class="demo-class"> 這是div的text 
 </div>
<div class="demo2-class"><a href="www.china.com">這是a的text></a>這是div的text</div>

response.xpath("//div[@class="demo-class"]/text()")
# 輸出的是“這是div的text”
response.xpath("//div[@class="demo-class2"]/a/@href")
#輸出的是 www.china.com
temp = response.xpath("//div[@class="demo-class2"]" 
)
ou = temp.xpath("a/@href")
#這裡使用的是逐級查詢，注意“a/@href”前面沒有"//"，“//”應該是全文檢索，不加的時候會相對位置查詢,這時輸出的是www.china.com

鏈家資料抓取(以北京為例)

獲取所有的區域(例如東城區，西城區，海淀區…)

# 獲取所有region
class LjregionSpider(scrapy.Spider):
    name = 'ljRegion'
    allowed_domains = ['bj.lianjia.com']
    start_urls = ['https://bj.lianjia.com/ershoufang']
    def parse(self, response):
        regions=response.xpath('//div[@data-role="ershoufang"]/div/a')
        for r in regions:
            href=r.xpath('@href').extract()[0]
            name=r.xpath('text()').extract()[0]
            dj_ljRegion = djLjRegion(href=href,name=name)
            dj_ljRegion.save()

獲取每個房源簡略資訊

#-*- coding:utf-8 -*-
import scrapy
from lianjia.items import *
import json
import codecs
import logging

class ershoufang( scrapy.Spider ):
    name = "ershoufang"
    allowed_domains=['bj.lianjia.com']
    start_urls=['https://bj.lianjia.com/ershoufang/dongcheng/']
    file = codecs.open("scrapyUrl.txt", "w", encoding="utf-8")
    def parse(self, response):
        try:            houseDetailClear=response.xpath('//div[@class="content "]/div[@class="leftContent"]/ul/li')
            for item in houseDetailClear:
                ljItem=LianjiaItem()
                ljItem['houseCode'] = item.xpath('a[@class="img "]/@data-housecode').extract()[0]
                ljItem['href'] = item.xpath('a[@class="img "]/@href').extract()[0]
                ljItem['title'] = item.xpath('div[@class="info clear"]/div[@class="title"]/a/text()').extract()[0]
                ljItem['houseInfoRegion']=item.xpath('div[@class="info clear"]/div[@class="address"]/div/a/text()').extract()[0]
                ljItem['houseInfo'] =ljItem['houseInfoRegion']+ item.xpath('div[@class="info clear"]/div[@class="address"]/div/text()').extract()[0]
                ljItem['houseInfoRegionHref']=item.xpath('div[@class="info clear"]/div[@class="address"]/div/a/@href').extract()[0]
                ljItem['positionInfo']=item.xpath('div[@class="info clear"]/div[@class="flood"]/div/text()').extract()[0]
                ljItem['positionInfoRegion']=item.xpath('div[@class="info clear"]/div[@class="flood"]/div/a/text()').extract()[0]
                ljItem['followInfo']=item.xpath('div[@class="info clear"]/div[@class="followInfo"]/text()').extract()[0]
                tagSubway=item.xpath('div[@class="info clear"]/div[@class="tag"]/span/text()').extract()
                if len(tagSubway)!=0:
                    ljItem['tagSubway']=tagSubway[0]
                else :
                    ljItem['tagSubway']=''
                tagTaxfree=item.xpath('div[@class="info clear"]/div[@class="tag"]/span[@class="taxfree"]/text()').extract()
                if len(tagTaxfree)!=0:
                    ljItem['tagTaxfree'] =tagTaxfree[0]
                else:
                    ljItem['tagTaxfree'] = ''
                tagHaskey=item.xpath('div[@class="info clear"]/div[@class="tag"]/span[@class="haskey"]/text()').extract()
                if len(tagHaskey) != 0:
                    ljItem['tagHaskey'] = tagHaskey[0]
                else:
                    ljItem['tagHaskey'] = ''
                ljItem['totalPrice']=item.xpath('div[@class="info clear"]/div[@class="priceInfo"]/div[@class="totalPrice"]/span/text()').extract()[0]
                ljItem['unitPrice']=item.xpath('div[@class="info clear"]/div[@class="priceInfo"]/div[@class="unitPrice"]/@data-price').extract()[0]
                yield ljItem
            ershoufangRegions=response.xpath('//div[@data-role="ershoufang"]/div/a/@href').extract()
            selectRegion=response.xpath('//div[@data-role="ershoufang"]/div/a[@class="selected"]')            selectRegion=selectRegion.xpath('@href').extract()[0]
            resPageInfo = response.xpath('//div[@class="page-box house-lst-page-box"]/@page-data')[0].extract().encode('utf-8')
            pgInfo = json.loads(resPageInfo)
            totalPage = pgInfo['totalPage']
            curPage = pgInfo['curPage']
            if curPage < totalPage:
                next_href='https://bj.lianjia.com%spg%d/'%(selectRegion,curPage+1)
                self.file.write('\n'+next_href+'\n')
                # 如果下一頁屬性值存在，則通過urljoin函式組合下一頁的url:
                next_page = response.urljoin(next_href)
                # 回撥parse處理下一頁的url
                yield scrapy.Request(next_page, callback=self.parse)
            else:
                regionIndex=ershoufangRegions.index(selectRegion)
                if regionIndex < len(ershoufangRegions)-1:
                    selectRegion = ershoufangRegions[regionIndex+1]
                    next_href = 'https://bj.lianjia.com/%s' % (selectRegion)
                    self.file.write("\n====================================\n")
                    self.file.write('\n')
                    self.file.write(next_href + '\n')
                    self.file.write('\n')
                    self.file.write("====================================\n")
                    # 如果下一頁屬性值存在，則通過urljoin函式組合下一頁的url:
                    next_page = response.urljoin(next_href)
                    # 回撥parse處理下一頁的url
                    yield scrapy.Request(next_page, callback=self.parse)
                else:
                    return
        except Exception,e:
            print str(e)

scrapy的簡單應用-抓取鏈家資料

最近使用scrapy 抓取一批資料，就拿鏈家實驗一下吧環境準備 pip install scrapy 基本命令建立專案 scrapy startproject myproject 執行某個專案 scrapy crawl myspider 如何

Python爬蟲三：抓取鏈家已成交二手房資訊（58W資料）

環境：Windows7+python3.6+Pycharm2017 目標：抓取鏈家北京地區已成交二手房資訊（無需登入），如下圖，戶型、朝向、成交時間價格等，儲存到csv。最後一共抓取約58W資料，程式執行8h。 --------全部文章：京東爬蟲、鏈家爬蟲、美團爬蟲、

scrapy實戰(一)-------------爬取鏈家網的二手房資訊

主要是通過scrapy爬取二手房相關資訊，只關心ershoufang相關連結，原始碼地址: 程式碼更新： 1.增加了爬取已成交房產的資訊，用於做為目標樣本來預測未成交房屋的價格。 2.資料通過pip

python3爬蟲抓取鏈家上海租房資訊

環境：win10，anaconda3（python3.5）方法一：利用requests獲取網頁資訊，再利用正則提取資料，並將結果儲存到csv檔案。程式碼地址：程式碼抓取到的資料如下所示：從左往右依次是：房屋連結、房屋描述、房屋佈局、房屋大小、所在區、所在區的具體區

爬取鏈家網租房資訊（萬級資料的簡單實現）

這不是一個很難的專案，沒有ajax請求，也沒有用框架，只是一個requests請求和BeautifulSoup的解析不過，看這段程式碼你會發現，BeautifulSoup不止只有find和fing_all用於元素定位，還有fing_next等其他的更簡單的，

43.scrapy爬取鏈家網站二手房信息-1

response ons tro 問題 import xtra dom nts class 首先分析：目的：采集鏈家網站二手房數據1.先分析一下二手房主界面信息，顯示情況如下：url = https://gz.lianjia.com/ershoufang/pg1/顯示

43.scrapy爬取鏈家網站二手房資訊-1

首先分析：目的：採集鏈家網站二手房資料1.先分析一下二手房主介面資訊，顯示情況如下：url = https://gz.lianjia.com/ershoufang/pg1/顯示總資料量為27589套，但是頁面只給返回100頁的資料，每頁30條資料，也就是隻給返回3000條資料。

44.scrapy爬取鏈家網站二手房資訊-2

全面採集二手房資料：網站二手房總資料量為27650條，但有的引數欄位會出現一些問題，因為只給返回100頁資料，具體檢視就需要去細分請求url引數去請求網站資料。我這裡大概的獲取了一下篩選條件引數，一些存在問題也沒做細化處理，大致的採集資料量為21096，實際19794條。看一下執行完成結果： {'d

分享爬取鏈家地圖找房房價資料的小爬蟲

一、說在前面受人所託，爬取鏈家上地圖找房的資料：https://bj.lianjia.com/ditu/。上面有按區域劃分的二手房均價和在售套數，我們的任務就是抓下這些資料。二、開幹 2.1失敗一次老樣子，Chrome 按下F12開啟Chrome DevTo

Python的scrapy之爬取鏈家網房價資訊並儲存到本地

因為有在北京租房的打算，於是上網瀏覽了一下鏈家網站的房價，想將他們爬取下來，並儲存到本地。先看鏈家網的原始碼。。房價資訊都儲存在 ul 下的li 裡面爬蟲結構：其中封裝了一個數據庫處理模組，還有一個user-agent池。。先看mylian

Python的scrapy之爬取鏈家網房價信息並保存到本地

width gif pat lse idt ext tst maximum spa 因為有在北京租房的打算，於是上網瀏覽了一下鏈家網站的房價，想將他們爬取下來，並保存到本地。先看鏈家網的源碼。。房價信息都保存在 ul 下的li 裏面 ? 爬蟲結構： ? 其中封裝了一

python爬取鏈家新房資料

轉載：https://blog.csdn.net/clyjjczwdd/article/details/79466032 from bs4 import BeautifulSoup import requests import time import pandas as p

運用scrapy爬取鏈家網房價並儲存到本地

因為有在北京租房的打算，於是上網瀏覽了一下鏈家網站的房價，想將他們爬取下來，並儲存到本地。先看鏈家網的原始碼。。房價資訊都儲存在 ul 下的li 裡面爬蟲結構：其中封裝了一個數據庫處理模組，還有一個user-agent池。。

資料採集（四）：用XPath爬取鏈家網房價資料

準備工作編寫爬蟲前的準備工作，我們需要匯入用到的庫，這裡主要使用的是requests和lxml兩個。還有一個Time庫，負責設定每次抓取的休息時間。 import requests import requests import time from lxml

爬取鏈家租房資料，資料處理，進行視覺化分析

lianjiaspider.py import asyncio import aiohttp import pandas as pd from lxml import etree class LianjiaSpider(object): def __init

用Scrapy抓取豆瓣小組資料（一）

最近在coursera.org（線上學習平臺）上學SNA（Social Network Analysis，社交網路分析）。有興趣的同學可以去看一眼：https://class.coursera.org/sna-002/，課程講的很有意思，等回頭我上完全部課程打算再寫下

初識Scrapy框架+爬蟲實戰(7)-爬取鏈家網100頁租房資訊

Scrapy簡介 Scrapy，Python開發的一個快速、高層次的螢幕抓取和web抓取框架，用於抓取web站點並從頁面中提取結構化的資料。Scrapy用途廣泛，可以用於資料探勘、監測和自動化測試。Scrapy吸引人的地方在於它是一個框架，任何人都可以根

鏈家資料爬取＋地圖找房

一、鏈家資料爬取（由於鏈家二手房搜尋結果有100頁的限制，也就是隻能搜到3000條結果，因此，我將按照城區搜尋結果進行爬取）首先從搜尋結果頁面獲得二手房詳情頁面的url，儲存到apartment_url.csv中 # -*- coding: utf-8 -*- impo

python爬蟲使用BeautifulSoup庫簡單快速抓取資料

如何快速入門抓取html網頁資料開發準備：1：開發工具使用pycharm，下載點選開啟連結2 : python3.6 下載點選開啟連結配置過程百度，不做細緻分析，配置完成後進入開發，pycharm破解選擇License server啟用即可，idea.qmanga.com可用

用 Scrapy 抓取某家的樓盤資訊

最近想爬點東西，又不想造輪子，就用上了scrapy，順便記錄下自己踩過的坑和都做了些什麼。使用的軟體版本： ipython 5.1.x scrapy 1.4 準備階段（在動手寫之前，一定要先觀察好標籤位置！）：這裡使用Firefox的外掛firebug對進行頁面

scrapy的簡單應用-抓取鏈家資料

環境準備

基本命令

xpath 簡單使用demo

鏈家資料抓取(以北京為例)

相關推薦