Python爬蟲項目--爬取鏈家熱門城市新房
阿新 • • 發佈:2018-11-09
聲明 rules nal logging 命令行 -- new exec 狀態
本次實戰是利用爬蟲爬取鏈家的新房(聲明: 內容僅用於學習交流, 請勿用作商業用途)
環境
win8, python 3.7, pycharm
正文
1. 目標網站分析
通過分析, 找出相關url, 確定請求方式, 是否存在js加密等.
2. 新建scrapy項目
1. 在cmd命令行窗口中輸入以下命令, 創建lianjia項目
scrapy startproject lianjia
2. 在cmd中進入lianjia文件中, 創建Spider文件
cd lianjia
scrapy genspider -t crawl xinfang lianjia.com
這次創建的是CrawlSpider類, 該類適用於批量爬取網頁
3. 新建main.py文件, 用於執行scrapy項目文件
到現在, 項目就創建完成了, 下面開始編寫項目
3 定義字段
在items.py文件中定義需要的爬取的字段信息
import scrapy from scrapy.item import Item, Field class LianjiaItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() city = Field() #城市名 name = Field() #樓盤名 type = Field() #物業類型 status = Field() #狀態 region = Field() #所屬區域 street = Field() #街道 address = Field() #具體地址 area = Field() #面積 average_price = Field() #平均價格 total_price = Field() #總價 tags = Field() #標簽
4 爬蟲主程序
在xinfang.py文件中編寫我們的爬蟲主程序
from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from lianjia.items import LianjiaItem class XinfangSpider(CrawlSpider): name = ‘xinfang‘ allowed_domains = [‘lianjia.com‘] start_urls = [‘https://bj.fang.lianjia.com/‘] #定義爬取的規則, LinkExtractor是用來提取鏈接(其中,allow指允許的鏈接格式, restrict_xpaths指鏈接處於網頁結構中的位置), follow為True表示跟進提取出的鏈接, callback則是調用函數 rules = ( Rule(LinkExtractor(allow=r‘\.fang.*com/$‘, restrict_xpaths=‘//div[@class="footer"]//div[@class="link-list"]/div[2]/dd‘), follow=True), Rule(LinkExtractor(allow=r‘.*loupan/$‘, restrict_xpaths=‘//div[@class="xinfang-all"]/div/a‘),callback= ‘parse_item‘, follow=True) ) def parse_item(self, response): ‘‘‘請求每頁的url‘‘‘‘ counts = response.xpath(‘//div[@class="page-box"]/@data-total-count‘).extract_first() pages = int(counts) // 10 + 2 #由於頁數最多為100, 加條件判斷 if pages > 100: pages = 101 for page in range(1, pages): url = response.url + "pg" + str(page) yield scrapy.Request(url, callback=self.parse_detail, dont_filter=False) def parse_detail(self, response): ‘‘‘解析網頁內容‘‘‘ item = LianjiaItem() item["title"] = response.xpath(‘//div[@class="resblock-have-find"]/span[3]/text()‘).extract_first()[1:] infos = response.xpath(‘//ul[@class="resblock-list-wrapper"]/li‘) for info in infos: item["city"] = info.xpath(‘div/div[1]/a/text()‘).extract_first() item["type"] = info.xpath(‘div/div[1]/span[1]/text()‘).extract_first() item["status"] = info.xpath(‘div/div[1]/span[2]/text()‘).extract_first() item["region"] = info.xpath(‘div/div[2]/span[1]/text()‘).extract_first() item["street"] = info.xpath(‘div/div[2]/span[2]/text()‘).extract_first() item["address"] = info.xpath(‘div/div[2]/a/text()‘).extract_first().replace(",", "") item["area"] = info.xpath(‘div/div[@class="resblock-area"]/span/text()‘).extract_first() item["average_price"] = "".join(info.xpath(‘div//div[@class="main-price"]//text()‘).extract()).replace(" ", "") item["total_price"] = info.xpath(‘div//div[@class="second"]/text()‘).extract_first() item["tags"] = ";".join(info.xpath(‘div//div[@class="resblock-tag"]//text()‘).extract()).replace(" ","").replace("\n", "") yield item
5 保存到Mysql數據庫
在pipelines.py文件中編輯如下代碼
import pymysql class LianjiaPipeline(object): def __init__(self): #創建數據庫連接對象 self.db = pymysql.connect( host = "localhost", user = "root", password = "1234", port = 3306, db = "lianjia", charset = "utf8" ) self.cursor = self.db.cursor() def process_item(self, item, spider): #存儲到數據庫中 sql = "INSERT INTO xinfang(city, name, type, status, region, street, address, area, average_price, total_price, tags) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)" data = (item["city"], item["name"], item["type"], item["status"], item["region"], item["street"], item["address"], item["area"], item["average_price"], item["total_price"], item["tags"]) try: self.cursor.execute(sql, data) self.db.commit() except: self.db.rollback() finally: return item
6 反反爬措施
由於是批量性爬取, 有必要采取些反反爬措施, 我這裏采用的是免費的IP代理. 在middlewares.py中編輯如下代碼:
from scrapy import signals import logging import requests class ProxyMiddleware(object): def __init__(self, proxy): self.logger = logging.getLogger(__name__) self.proxy = proxy @classmethod def from_crawler(cls, crawler): ‘‘‘獲取隨機代理的api接口‘‘‘ settings = crawler.settings return cls( proxy=settings.get(‘RANDOM_PROXY‘) ) def get_random_proxy(self): ‘‘‘獲取隨機代理‘‘‘ try: response = requests.get(self.proxy) if response.status_code == 200: proxy = response.text return proxy except: return False def process_request(self, request, spider): ‘‘‘使用隨機生成的代理請求‘‘‘ proxy = self.get_random_proxy() if proxy: url = ‘http://‘ + str(proxy) self.logger.debug(‘本次使用代理‘+ proxy) request.meta[‘proxy‘] = url
7 配置settings文件
import random RANDOM_PROXY = "http://localhost:6686/random" BOT_NAME = ‘lianjia‘ SPIDER_MODULES = [‘lianjia.spiders‘] NEWSPIDER_MODULE = ‘lianjia.spiders‘ ROBOTSTXT_OBEY = False DOWNLOAD_DELAY = random.random()*2 COOKIES_ENABLED = False DEFAULT_REQUEST_HEADERS = { ‘Accept‘: ‘text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8‘, ‘Accept-Language‘: ‘en‘, } DOWNLOADER_MIDDLEWARES = { ‘lianjia.middlewares.ProxyMiddleware‘: 543 } ITEM_PIPELINES = { ‘lianjia.pipelines.LianjiaPipeline‘: 300, }
8 執行項目文件
在mian.py中執行如下命令
from scrapy import cmdline cmdline.execute(‘scrapy crawl xinfang‘.split())
scrapy項目即可開始執行, 最後爬取到1萬4千多條數據.
Python爬蟲項目--爬取鏈家熱門城市新房