python中scrapy框架爬取攜程景點資料

阿新 • • 發佈：2019-01-10

---------------------------------------------------------------------------------------------
[版權申明：本文系作者原創，轉載請註明出處]
文章出處：https://blog.csdn.net/sdksdk0/article/details/82381198

作者：朱培 ID：sdksdk0
--------------------------------------------------------------------------------------------

本文使用scrapy框架，python3.6進行爬取，主要獲取的是攜程上河南省的景點名稱，地址，省市縣，描述，圖片地址資訊等。首先通過搜尋可以得到河南的網頁地址為:http://piao.ctrip.com/dest/u-_ba_d3_c4_cf/s-tickets/P1/,然後以這個頁面為起始位置開始爬取。將爬取的資料儲存到mysql資料庫中。

1、建立scrapy專案

scrapy startproject ctrip

2、建立 spider,首先進入ctrip資料夾

scrapy genspider scenic "ctrip.com"

3、settings.py檔案中：

BOT_NAME = 'ctrip'

SPIDER_MODULES = ['ctrip.spiders']
NEWSPIDER_MODULE = 'ctrip.spiders'
ROBOTSTXT_OBEY = False
DEFAULT_REQUEST_HEADERS = {
   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
   'Accept-Language': 'en',
}
DOWNLOADER_MIDDLEWARES = {
    'ctrip.middlewares.UserAgentDownloadMiddleware': 543,
}
ITEM_PIPELINES = {
    'ctrip.pipelines.DBPipeline': 300,
}

4、middlewares.py中

import random


class UserAgentDownloadMiddleware (object):
    USER_AGENTS = [
        "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
        "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
        "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
        "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
        "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
        "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
        "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
        "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5"
    ]

    def process_request(self,request,spider):
        user_agent = random.choice(self.USER_AGENTS)
        request.headers['User-Agent'] = user_agent

5、items.py

import scrapy


class ScenicItem(scrapy.Item):
    province = scrapy.Field()
    city = scrapy.Field()
    county = scrapy.Field()
    name = scrapy.Field()
    scenic_url = scrapy.Field()
    image_url = scrapy.Field()
    address = scrapy.Field()
    descript = scrapy.Field()
    code = scrapy.Field()

6、scenic.py

# -*- coding: utf-8 -*-
import scrapy
import re
from ctrip.items import ScenicItem

class ScenicSpider(scrapy.Spider):
    name = 'scenic'
    allowed_domains = ['ctrip.com']
    start_urls = ['http://piao.ctrip.com/dest/u-_ba_d3_c4_cf/s-tickets/P1/']
    count = 0

    def parse(self, response):
        trs = response.xpath("//div[@id='searchResultContainer']//div[@class='searchresult_product04']")

        for tr in trs:
            ctrip_url = tr.xpath(".//div[1]/a/@href").get()
            c1_url = ctrip_url.split("t/t")
            scemic_num = c1_url[1].split(".")
            scemic_num = scemic_num[0]
            scenic_url = ""
            image_url = tr.xpath(".//div[1]/a/img/@src").get()
            address = tr.xpath(".//div[1]/div[@class='adress']//text()").get().strip()
            address = re.sub(r"地址：", "", address)
            descript = tr.xpath(".//div[1]/div[@class='exercise']//text()").get().strip()
            descript = re.sub(r"特色：", "", descript)
            name = tr.xpath(".//div[1]//h2/a/text()").get().strip()

            cityinfo=address
            province = "河南省"
            city = ""
            county = ""
            if "省" in cityinfo:
                matchObj = re.match(r'(.*)[?省](.+?)市(.+?)([縣]|[區])', cityinfo, re.M | re.I)
                if matchObj:
                    province = matchObj.group(1) + "省"
                    city = matchObj.group(2) + "市"
                    if "縣" in cityinfo:
                        county = matchObj.group(3) + "縣"
                    else:
                        county = matchObj.group(3) + "區"
                else:
                    matchObj2 = re.match(r'(.*)[?省](.+?)市(.+?)市', cityinfo, re.M | re.I)
                    matchObj1 = re.match(r'(.*)[?省](.+?)市', cityinfo, re.M | re.I)
                    if matchObj2:
                        city = matchObj2.group(2) + "市"
                        county = matchObj2.group(3) + "市"
                    elif matchObj1:
                        city = matchObj1.group(2) + "市"
                    else:
                        matchObj1 = re.match(r'(.*)[?省](.+?)([縣]|[區])', cityinfo, re.M | re.I)
                        if matchObj1:
                            if "縣" in cityinfo:
                                county = matchObj1.group(2) + "縣"
                            else:
                                county = matchObj1.group(2) + "區"

            else:
                matchObj = re.match(r'(.+?)市(.+?)([縣]|[區])', cityinfo, re.M | re.I)
                if matchObj:
                    city = matchObj.group(1) + "市"
                    if "縣" in cityinfo:
                        county = matchObj.group(2) + "縣"
                    else:
                        county = matchObj.group(2) + "區"
                else:
                    matchObj = re.match(r'(.+?)市', cityinfo, re.M | re.I)
                    if matchObj:
                        city = matchObj.group(1) + "市"
                    else:
                        matchObj = re.match(r'(.+?)縣', cityinfo, re.M | re.I)
                        if matchObj:
                            county = matchObj.group(1) + "縣"

            self.count += 1
            code = "A" + str(self.count)

            item = ScenicItem(name=name,province=province,city=city,county=county,address=address,descript=descript,
                              scenic_url=scenic_url,image_url=image_url,code=code)

            yield item
        next_url = response.xpath('//*[@id="searchResultContainer"]/div[11]/a[11]/@href').get()
        if next_url:
            yield scrapy.Request(url=response.urljoin(next_url), callback=self.parse,meta={})

7、pipelines.py,將資料儲存到mysql資料庫中

import pymysql


# 用於資料庫儲存
class DBPipeline(object):
    def __init__(self):
        # 連線資料庫
        self.connect = pymysql.connect(
            host='localhost',
            port=3306,
            db='edu_demo',
            user='root',
            passwd='123456',
            charset='utf8',
            use_unicode=True)

        # 通過cursor執行增刪查改
        self.cursor = self.connect.cursor();

    def process_item(self, item, spider):
        try:
            # 查重處理
            self.cursor.execute(
                """select * from a_scenic where ctrip_url = %s""",
                item['scenic_url'])
            # 是否有重複資料
            repetition = self.cursor.fetchone()

            # 重複
            if repetition:
                pass

            else:
                # 插入資料
                self.cursor.execute(
                    """insert into a_scenic(code,province, city, county, name ,description, ctrip_url,image_url,address,type)
                    value (%s,%s, %s, %s, %s, %s, %s, %s, %s, %s)""",
                    (item['code'],
                     item['province'],
                     item['city'],
                     item['county'],
                     item['name'],
                     item['descript'],
                     item['scenic_url'],
                     item['image_url'],
                     item['address'], '1'))

            # 提交sql語句
            self.connect.commit()

        except Exception as error:
            # 出現錯誤時列印錯誤日誌
            print(error)
        return item

8、start.py

from scrapy import cmdline

cmdline.execute("scrapy crawl scenic".split())

9、執行start.py即可

python中scrapy框架爬取攜程景點資料

--------------------------------------------------------------------------------------------- [版權申明：本文系作者原創，轉載請註明出處] 文章出處：https://blog.cs

Python：scrapy框架爬取校花網男神圖片儲存到本地

爬蟲四部曲，本人按自己的步驟來寫，可能有很多漏洞，望各位大神指點指點 1、建立專案 scrapy startproject xiaohuawang scrapy.cfg: 專案的配置檔案 xiaohuawang/: 該專案的python模組。之後您將在此加入程

python爬取攜程酒店資料

首先開啟攜程所有北京的酒店http://hotels.ctrip.com/hotel/beijing1 簡簡單單，原始碼中包含我們需要的酒店資料，你以為這樣就結束了？攜程的這些資料這麼廉價地就給我們得到了？事實並不是如此，當我們點選第二頁的時候出現問題：雖然酒店的資料改變了，但是我們發現

Python爬蟲scrapy框架爬取動態網站——scrapy與selenium結合爬取資料

scrapy框架只能爬取靜態網站。如需爬取動態網站，需要結合著selenium進行js的渲染，才能獲取到動態載入的資料。如何通過selenium請求url，而不再通過下載器Downloader去請求這個url?方法：在request物件通過中介軟體的時候，在中介軟體內部開始

[Python爬蟲]Scrapy框架爬取bilibili個人資訊

啟動檔案main.py from scrapy.cmdline import execute execute('scrapy crawl bili_gr_xx'.split()) 執行spider下的爬取檔案 # -*- coding: ut

Python scrapy框架爬取瓜子二手車資訊資料

專案實施依賴： python，scrapy ，fiddler scrapy安裝依賴的包：可以到https://www.lfd.uci.edu/~gohlke/pythonlibs/ 下載 pywin32，lxml，Twisted，scrapy然後pip安裝專案實施開始： 1、建立scrapy專

Java資料爬取——爬取攜程酒店資料（二）

1.首先思考怎樣根據地域獲取地域酒店資訊，那麼我們看一下攜程上是怎樣獲得的。還是開啟http://hotels.ctrip.com/domestic-city-hotel.html 這個地址，隨便點選一個地區進去（這裡我選取澳門作為示例），點選第二頁資料

python scrapy框架爬取豆瓣top250電影篇一儲存資料到mongogdb | mysql中

存到mongodb中環境 windows7 mongodb4.0 mongodb安裝教程設定具體引數在管道里面寫具體引數開啟settings 設定引數測試開始–結果程式碼 import pymongo from douban.

python scrapy框架爬取豆瓣top250電影篇一代理編寫

爬蟲偽裝: UA中介軟體編寫 settings設定 from scrapy import signals import base64 import random class my_useragent(object): def process_req

python scrapy框架爬取豆瓣top250電影篇一明確目標&&爬蟲編寫

1.明確目標 1.1在url上找到要爬取的資訊 1.2.確定了資訊,編寫items檔案 class DoubanItem(scrapy.Item): &nb

Python爬取攜程旅遊行程資訊+GIS視覺化

一、需求：爬取攜程旅行網的“北京推薦行程”首頁的各個行程文章，將各個行程所包含的景點資訊提取出來，並匯入ArcGIS進行GIS視覺化。二、爬取思路：爬取北京推薦行程主頁的各個文章的URL，然後通過該URL爬取出行程文章的資料

Scrapy爬取攜程桂林問答

guilin.sql： CREATE TABLE `guilin_ask` ( `id` INT(11) NOT NULL AUTO_INCREMENT COMMENT '主鍵', `question` VARCHAR(255) DEFAULT NULL COM

Python3.6實現scrapy框架爬取資料並將資料插入MySQL與存入文件中

# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: http://doc.scrapy.org

python爬蟲(16)使用scrapy框架爬取頂點小說網

本文以scrapy 框架來爬取整個頂點小說網的小說 1.scrapy的安裝這個安裝教程，網上有很多的例子，這裡就不在贅述了 2.關於scrapy scrapy框架是一個非常好的東西，能夠實現非同步爬取，節省時間，其實本文純粹的按照之前的思維來做，也不是不可以，但是感

Python爬蟲【實戰篇】scrapy 框架爬取某招聘網存入mongodb

建立專案 scrapy startproject zhaoping 建立爬蟲 cd zhaoping scrapy genspider hr zhaopingwang.com 目錄結構 items.py title = scrapy.Field()

python爬蟲十一：scrapy框架爬取天氣，存入資料庫

小白學習：轉：https://zhuanlan.zhihu.com/p/268854121.cmd下scrapy startproject 專案名2.我一般都是在pycharm中編寫程式碼，所以我會在idea中引入專案，這裡不知道如何在pycharm中下載scrapy模組的童

python scrapy框架爬取知乎提問資訊

前文介紹了python的scrapy爬蟲框架和登入知乎的方法. 這裡介紹如何爬取知乎的問題資訊,並儲存到mysql資料庫中. 首先,看一下我要爬取哪些內容: 如下圖所示,我要爬取一個問題的6個資訊: 問題的id(question_id) 標題(title) 問題描述

使用requests、re、BeautifulSoup、線程池爬取攜程酒店信息並保存到Excel中

備案 info imp lis sub host write count star import requests import json import re import csv import threadpool import time, random

用scrapy框架爬取映客直播用戶頭像

xpath print main back int open for pri nbsp 1. 創建項目 scrapy startproject yingke cd yingke 2. 創建爬蟲 scrapy genspider live 3. 分析http://www.i

使用scrapy框架爬取蜂鳥論壇的攝影圖片並下載到本地

utf 賦值 col 異常處理創建文件夾 clas watermark follow ret 目標網站：http://bbs.fengniao.com/使用框架：scrapy 因為有很多模塊的方法都還不是很熟悉，所有本次爬蟲有很多代碼都用得比較笨，希望各位讀者能給處意見

python中scrapy框架爬取攜程景點資料

相關推薦