1. 程式人生 > >25.爬取去哪兒網的商品資料-1

25.爬取去哪兒網的商品資料-1


1.首先分析頁面資訊
頁面地址:http://touch.qunar.com/
爬取度假中的自由行頻道資訊
可以看到某一城市xhr獲取資訊:

 

 
  
 

 

request.url :

https://touch.dujia.qunar.com/list?modules=list%2CbookingInfo%2CactivityDetail&dep=%E5%B9%BF%E5%B7%9E&query=%E5%8E%A6%E9%97%A8%E8%87%AA%E7%94%B1%E8%A1%8C&dappDealTrace=false&mobFunction=%E6%89%A9%E5%B1%95%E8%87%AA%E7%94%B1%E8%A1%8C&cfrom=zyx&it=dujia_hy_destination&date=&needNoResult=true&originalquery=%E5%8E%A6%E9%97%A8%E8%87%AA%E7%94%B1%E8%A1%8C&limit=0,24&includeAD=true&qsact=search

這裡可以看出url是拼接而成的,%開頭的都是中文編譯的字串,這裡是被轉義後的資料。

實際url:

https://touch.dujia.qunar.com/list?modules=list%2CbookingInfo%2CactivityDetail&dep=廣州&query=廈門自由行&dappDealTrace=false&mobFunction=擴充套件自由行&cfrom=zyx&it=dujia_hy_destination&date=&needNoResult=true&originalquery=廈門自由行&limit=0,24&includeAD=true&qsact=search

這裡就分析一下url:

dep引數:表示的是出發地(我在廣州,所以定位的是廣州)

query和originalquery引數:表示的是目的地

(因此只需要修改請求的這兩個引數就能夠遍歷所有的商品資訊,出發地,目的地組合會有不一樣的資料呈現)

 

瀏覽器開啟url真實資訊:

 

2.獲取出發點dep引數資訊
請求地址:https://touch.dujia.qunar.com/p/public/dep

 

# 獲取城市引數
import
requests url = 'https://touch.dujia.qunar.com/depCities.qunar
' html = requests.get(url) # print(html.text) dict = html.json() for i in dict['data']: for j in dict['data'][i]: print(j)

如圖所示:

 

3.根據出發地獲取目的地引數

import  requests
url = 'https://touch.dujia.qunar.com/depCities.qunar'
html = requests.get(url)
# print(html.text)
dict = html.json()
#獲取出發地引數
for i in dict['data']:
    for j in dict['data'][i]:
        print(j)
        link_url = 'https://touch.dujia.qunar.com/golfz/sight/arriveRecommend?dep={}&exclude=&extensionImg=255,175'.format(j)
        html2 = requests.get(link_url)
        dict2 = html2.json()
        c_list = []
        #獲取目的地引數
        for k in dict2['data']:
            for l in k['subModules']:
                for m in l['items']:
                    city = m['query']
            #去重資料
if city not in c_list: c_list.append(city) print(c_list)

可以看到一個出發地對應有很多目的地:

 

4.獲取商品列表資訊

dep 和query 引數已經獲取,接下來就是請求json載入的資料,分析其url變化及 頁面重要的routeCount引數
https://touch.dujia.qunar.com/list?modules=list%2CbookingInfo%2CactivityDetail&dep=%E5%B9%BF%E5%B7%9E&query=%E5%8E%A6%E9%97%A8%E8%87%AA%E7%94%B1%E8%A1%8C&dappDealTrace=false&mobFunction=%E6%89%A9%E5%B1%95%E8%87%AA%E7%94%B1%E8%A1%8C&cfrom=zyx&it=dujia_hy_destination&date=&needNoResult=true&originalquery=%E5%8E%A6%E9%97%A8%E8%87%AA%E7%94%B1%E8%A1%8C&limit=0,24&includeAD=true&qsact=search

和limit的變化 每次請求是以24的倍數變化,通過獲取routeCount引數,載入請求不同url。
import  requests
import urllib
import random,time
url = 'https://touch.dujia.qunar.com/depCities.qunar'
html = requests.get(url)
# print(html.text)
dict = html.json()
#獲取出發地引數
for i in dict['data']:
    for j in dict['data'][i]:
        print(j)
        link_url = 'https://touch.dujia.qunar.com/golfz/sight/arriveRecommend?dep={}&exclude=&extensionImg=255,175'.format(j)

        #設定隨機休眠時間
        time.sleep(random.randint(1,2))

        html2 = requests.get(link_url)
        dict2 = html2.json()
        c_list = []
        #獲取目的地引數
        for k in dict2['data']:
            for l in k['subModules']:
                for m in l['items']:
                    city = m['query']
                    if city not  in c_list:
                        c_list.append(city)
        # print(c_list)

        #設定隨機休眠時間
        time.sleep(random.randint(1,2))

        #請求資料
        for c in c_list:
            #配置請求url
            url3 = 'https://touch.dujia.qunar.com/list?modules=list%2CbookingInfo%2CactivityDetail&dep={}&query={}%E8%87%AA%E7%94%B1%E8%A1%8C&dappDealTrace=false&mobFunction=%E6%89%A9%E5%B1%95%E8%87%AA%E7%94%B1%E8%A1%8C&cfrom=zyx&it=dujia_hy_destination&date=&needNoResult=true&originalquery={}&limit=0,24&qsact=scroll'.format(urllib.request.quote(j),urllib.request.quote(city),urllib.request.quote(city))
            A = url3.replace('https://touch.dujia.qunar.com','')
            # print(A)
            headers = {
                'cookie': 'QN48=tc_e1b5f5bb4d76a018_16730073949_ad75; csrfToken=d27163582839d6b8cbcb53110ed67077; QN300=organic; QN1=ezu0pVvzuB9qeVd2w90fAg==; _RF1=119.129.117.7; _RSG=AZ4soQG2oI5YMrcq1P6et8; _RDG=283bf2bcd3461d22ef1d94f9276d7c9b85; _RGUID=54b20906-b2d8-48ca-8de8-1990749b55a2; QN205=organic; QN234=home_free_t; _pk_ref.1.8600=%5B%22%22%2C%22%22%2C1542699072%2C%22http%3A%2F%2Ftouch.qunar.com%2F%22%5D; _pk_ses.1.8600=*; QN57=15427010307400.44337198739421924; QN58=1542701030742%7C1542701078367%7C4; QN233=dujia_hy_destination; _pk_id.1.8600=5f2ca9d25160d431.1542699072.1.1542705039.1542699072.; QN243=165',
                'referer': 'https://touch.dujia.qunar.com/p/list?dep=%E5%B9%BF%E5%B7%9E&query=%E5%8E%A6%E9%97%A8%E8%87%AA%E7%94%B1%E8%A1%8C&cfrom=zyx&et=&it=dujia_hy_destination',
                'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36'
            }

            html3 = requests.get(url=url3,headers=headers)
            print(url3)
            print(html3.json())
            # # 獲取 routeCount 引數
            # num = int(html3.json()['data']['limit']['routeCount'])
            #
            # # 每頁只返回 24條資料
            # for n in range(0,num,24):
            #     url4 ='https://touch.dujia.qunar.com/list?modules=list%2CbookingInfo%2CactivityDetail&dep={}&query={}%E8%87%AA%E7%94%B1%E8%A1%8C&dappDealTrace=false&mobFunction=%E6%89%A9%E5%B1%95%E8%87%AA%E7%94%B1%E8%A1%8C&cfrom=zyx&it=dujia_hy_destination&date=&needNoResult=true&originalquery={}&limit={},24&qsact=scroll,n'
            #
            #     # 設定隨機休眠時間
            #     time.sleep(random.randint(1, 2))
            #
            #     html4 = requests.get(url=url4,headers=headers)
            #     result = html4.json()
            #     print(result)