1. 程式人生 > >爬蟲-攜程酒店資訊抓取降妖除魔(下)

爬蟲-攜程酒店資訊抓取降妖除魔(下)

#這篇主要是講抓取酒店頁面list的經歷,也有很多坑,反爬,價格資料放在其他位置
多分析才能事半功倍

1.通過分析酒店相關資訊list也是ajax載入,存放在json資料中,價格也在同一個json中但是放在另外的位置通過酒店id對應

在這裡插入圖片描述
在這裡插入圖片描述

2.下來就是主要提取自己需要的資訊,然後儲存就好,這邊存放csv和mysql資料庫,程式碼中有兩個註釋知識點著重留意下,然後就是儲存到mysql是通過pymysql建立,navicat建立表,記得表名和列名需要與插入資料一一對應

import requests
import json
import re
import csv
import demjson
import pymysql

#連線寫入提交
conn = pymysql.Connect(host='localhost', port=3306, user='root', passwd='***', db='jiudian')
curor = conn.cursor()
lists=[]
dicts={}
ss=0
for i in range(1,20):
    url="http://hotels.ctrip.com/Domestic/Tool/AjaxHotelList.aspx"
    headers={

        "Connection": "keep-alive",
        "origin":"http://hotels.ctrip.com",
        "Host": "hotels.ctrip.com",
        "referer": "http://hotels.ctrip.com/hotel/beijing1",
        "user-agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36",

    }
    data={
    "StartTime":"2018-10-09",
    "DepTime": "2018-10-10",
    "RoomGuestCount": "1,1,0",
    "cityId":1,
    "cityPY":" beijing",
    "cityCode":"010",
    "cityLat": 39.9105329229,
    "cityLng":116.413784021,
    "page":i,
    }

    html=requests.post(url,headers=headers,data=data)
    #ValueError: Invalid \escape: line 1 column 35442 (char 35441)問題在於編碼中是\xa0之類的,當遇到有些 不用轉義的\http之類的,則會出現以上錯誤。解決方案如下:
    regex = re.compile(r'\\(?![/u"])')
    fixed = regex.sub(r"\\\\", html.text)

    aa=json.loads(fixed)

    for n in range(0,25):
        dianming = aa["hotelPositionJSON"][n]["name"]

        #python eval函式,將列表樣式的字串轉化為列表
        jiage=eval(aa["HotelMaiDianData"]["value"]["htllist"])[n]["amount"]
        xinji=aa["hotelPositionJSON"][n]["star"][-2:]
        dangci=aa["hotelPositionJSON"][n]["stardesc"]
        pingfen=aa["hotelPositionJSON"][n]["score"]
        lianjie="http://hotels.ctrip.com"+aa["hotelPositionJSON"][n]["url"]
        ss += 1
        lists.append([ss, dianming,xinji,dangci,pingfen,jiage + "元",lianjie])

        # lists.append([s,"酒店名:"+name,"星級:"+xinji,"檔次:"+dangci,"評分:"+pingfen,"價格:"+jiage+"元"])
        dicts[ss]=["酒店名:"+dianming,"星級:"+xinji,"檔次:"+dangci,"評分:"+pingfen,"價格:"+jiage+"元","連結:"+lianjie]
        print("正在檢索中"+str(ss))
        hot = "insert into jdlist(jd_num,jd_name,jd_star,jd_good,jd_fen,jd_jiage,jd_link) values('%s','%s','%s','%s','%s','%s','%s')" % (ss,dianming,xinji,dangci,pingfen,jiage,lianjie)
        curor.execute(hot)
        conn.commit()
        # self.conn.close()
        # mm=re.findall('.*?"amount":"(.*?)"}',jiage)
# print(lists)
with open("bjjiudian.csv", "w", encoding="utf-8",newline="") as f:
    k = csv.writer(f, dialect="excel")
    k.writerow(["數量", "酒店名", "星級", "檔次", "評分", "價格","連結"])

    for list in lists:
        k.writerow(list)
print(lists)
print(dicts)

在這裡插入圖片描述
在這裡插入圖片描述