1. 程式人生 > >Python爬蟲項目--爬取鏈家熱門城市新房

Python爬蟲項目--爬取鏈家熱門城市新房

聲明 rules nal logging 命令行 -- new exec 狀態

本次實戰是利用爬蟲爬取鏈家的新房(聲明: 內容僅用於學習交流, 請勿用作商業用途)

環境

win8, python 3.7, pycharm

正文

1. 目標網站分析

通過分析, 找出相關url, 確定請求方式, 是否存在js加密等.

2. 新建scrapy項目

1. 在cmd命令行窗口中輸入以下命令, 創建lianjia項目

scrapy startproject lianjia

2. 在cmd中進入lianjia文件中, 創建Spider文件

cd lianjia
scrapy genspider -t crawl xinfang lianjia.com

這次創建的是CrawlSpider類, 該類適用於批量爬取網頁

3. 新建main.py文件, 用於執行scrapy項目文件

到現在, 項目就創建完成了, 下面開始編寫項目

3 定義字段

在items.py文件中定義需要的爬取的字段信息

import scrapy
from scrapy.item import Item, Field

class LianjiaItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    city = Field()          #城市名
    name = Field()          #
樓盤名 type = Field() #物業類型 status = Field() #狀態 region = Field() #所屬區域 street = Field() #街道 address = Field() #具體地址 area = Field() #面積 average_price = Field() #平均價格 total_price = Field() #總價 tags = Field() #標簽

4 爬蟲主程序

在xinfang.py文件中編寫我們的爬蟲主程序

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from lianjia.items import LianjiaItem

class XinfangSpider(CrawlSpider):
    name = xinfang
    allowed_domains = [lianjia.com]
    start_urls = [https://bj.fang.lianjia.com/]
    #定義爬取的規則, LinkExtractor是用來提取鏈接(其中,allow指允許的鏈接格式, restrict_xpaths指鏈接處於網頁結構中的位置), follow為True表示跟進提取出的鏈接, callback則是調用函數
    rules = (
        Rule(LinkExtractor(allow=r\.fang.*com/$, restrict_xpaths=//div[@class="footer"]//div[@class="link-list"]/div[2]/dd), follow=True),
        Rule(LinkExtractor(allow=r.*loupan/$, restrict_xpaths=//div[@class="xinfang-all"]/div/a),callback= parse_item, follow=True)
    )
    def parse_item(self, response):
        ‘‘‘請求每頁的url‘‘‘‘
        counts = response.xpath(//div[@class="page-box"]/@data-total-count).extract_first()
        pages = int(counts) // 10 + 2
        #由於頁數最多為100, 加條件判斷
        if pages > 100:
            pages = 101
        for page in range(1, pages):
            url = response.url + "pg" + str(page)
            yield scrapy.Request(url, callback=self.parse_detail, dont_filter=False)

    def parse_detail(self, response):
        ‘‘‘解析網頁內容‘‘‘
        item = LianjiaItem()
        item["title"] = response.xpath(//div[@class="resblock-have-find"]/span[3]/text()).extract_first()[1:]
        infos = response.xpath(//ul[@class="resblock-list-wrapper"]/li)
        for info in infos:
            item["city"] = info.xpath(div/div[1]/a/text()).extract_first()
            item["type"] = info.xpath(div/div[1]/span[1]/text()).extract_first()
            item["status"] = info.xpath(div/div[1]/span[2]/text()).extract_first()
            item["region"] = info.xpath(div/div[2]/span[1]/text()).extract_first()
            item["street"] = info.xpath(div/div[2]/span[2]/text()).extract_first()
            item["address"] = info.xpath(div/div[2]/a/text()).extract_first().replace(",", "")
            item["area"] = info.xpath(div/div[@class="resblock-area"]/span/text()).extract_first()
            item["average_price"] = "".join(info.xpath(div//div[@class="main-price"]//text()).extract()).replace(" ", "")
            item["total_price"] = info.xpath(div//div[@class="second"]/text()).extract_first()
            item["tags"] = ";".join(info.xpath(div//div[@class="resblock-tag"]//text()).extract()).replace(" ","").replace("\n", "")
            yield item

5 保存到Mysql數據庫

在pipelines.py文件中編輯如下代碼

import pymysql
class LianjiaPipeline(object):
    def __init__(self):
        #創建數據庫連接對象
        self.db = pymysql.connect(
            host = "localhost",
            user = "root",
            password = "1234",
            port = 3306,
            db = "lianjia",
            charset = "utf8"
        )
        self.cursor = self.db.cursor()
    def process_item(self, item, spider):
        #存儲到數據庫中
        sql = "INSERT INTO xinfang(city, name, type, status, region, street, address, area, average_price, total_price, tags) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)"
        data = (item["city"], item["name"], item["type"], item["status"], item["region"], item["street"], item["address"], item["area"], item["average_price"], item["total_price"], item["tags"])
        try:
            self.cursor.execute(sql, data)
            self.db.commit()
        except:
            self.db.rollback()
        finally:
            return item

6 反反爬措施

由於是批量性爬取, 有必要采取些反反爬措施, 我這裏采用的是免費的IP代理. 在middlewares.py中編輯如下代碼:

from scrapy import signals
import logging
import requests
class ProxyMiddleware(object):
    def __init__(self, proxy):
        self.logger = logging.getLogger(__name__)
        self.proxy = proxy
    @classmethod
    def from_crawler(cls, crawler):
        ‘‘‘獲取隨機代理的api接口‘‘‘
        settings = crawler.settings
        return cls(
            proxy=settings.get(RANDOM_PROXY)
        )
    def get_random_proxy(self):
     ‘‘‘獲取隨機代理‘‘‘
        try:
            response = requests.get(self.proxy)
            if response.status_code == 200:
                proxy = response.text
                return proxy
        except:
            return False
    def process_request(self, request, spider):
     ‘‘‘使用隨機生成的代理請求‘‘‘
        proxy = self.get_random_proxy()
        if proxy:
            url = http:// + str(proxy)
            self.logger.debug(本次使用代理+ proxy)
            request.meta[proxy] = url

7 配置settings文件

import random
RANDOM_PROXY = "http://localhost:6686/random"
BOT_NAME = lianjia
SPIDER_MODULES = [lianjia.spiders]
NEWSPIDER_MODULE = lianjia.spiders
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = random.random()*2
COOKIES_ENABLED = False
DEFAULT_REQUEST_HEADERS = {
  Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8,
  Accept-Language: en,
}
DOWNLOADER_MIDDLEWARES = {
   lianjia.middlewares.ProxyMiddleware: 543
}
ITEM_PIPELINES = {
   lianjia.pipelines.LianjiaPipeline: 300,
}

8 執行項目文件

在mian.py中執行如下命令

from scrapy import cmdline
cmdline.execute(scrapy crawl xinfang.split())

scrapy項目即可開始執行, 最後爬取到1萬4千多條數據.

Python爬蟲項目--爬取鏈家熱門城市新房