25.爬取去哪兒網的商品資料-1
1.首先分析頁面資訊
頁面地址:http://touch.qunar.com/
爬取度假中的自由行頻道資訊
可以看到某一城市xhr獲取資訊:
request.url :
https://touch.dujia.qunar.com/list?modules=list%2CbookingInfo%2CactivityDetail&dep=%E5%B9%BF%E5%B7%9E&query=%E5%8E%A6%E9%97%A8%E8%87%AA%E7%94%B1%E8%A1%8C&dappDealTrace=false&mobFunction=%E6%89%A9%E5%B1%95%E8%87%AA%E7%94%B1%E8%A1%8C&cfrom=zyx&it=dujia_hy_destination&date=&needNoResult=true&originalquery=%E5%8E%A6%E9%97%A8%E8%87%AA%E7%94%B1%E8%A1%8C&limit=0,24&includeAD=true&qsact=search
這裡可以看出url是拼接而成的,%開頭的都是中文編譯的字串,這裡是被轉義後的資料。
實際url:
https://touch.dujia.qunar.com/list?modules=list%2CbookingInfo%2CactivityDetail&dep=廣州&query=廈門自由行&dappDealTrace=false&mobFunction=擴充套件自由行&cfrom=zyx&it=dujia_hy_destination&date=&needNoResult=true&originalquery=廈門自由行&limit=0,24&includeAD=true&qsact=search
這裡就分析一下url:
dep引數:表示的是出發地(我在廣州,所以定位的是廣州)
query和originalquery引數:表示的是目的地
(因此只需要修改請求的這兩個引數就能夠遍歷所有的商品資訊,出發地,目的地組合會有不一樣的資料呈現)
瀏覽器開啟url真實資訊:
2.獲取出發點dep引數資訊
請求地址:https://touch.dujia.qunar.com/p/public/dep
# 獲取城市引數
import requests url = 'https://touch.dujia.qunar.com/depCities.qunar' html = requests.get(url) # print(html.text) dict = html.json() for i in dict['data']: for j in dict['data'][i]: print(j)
如圖所示:
3.根據出發地獲取目的地引數 import requests url = 'https://touch.dujia.qunar.com/depCities.qunar' html = requests.get(url) # print(html.text) dict = html.json() #獲取出發地引數 for i in dict['data']: for j in dict['data'][i]: print(j) link_url = 'https://touch.dujia.qunar.com/golfz/sight/arriveRecommend?dep={}&exclude=&extensionImg=255,175'.format(j) html2 = requests.get(link_url) dict2 = html2.json() c_list = [] #獲取目的地引數 for k in dict2['data']: for l in k['subModules']: for m in l['items']: city = m['query']
#去重資料 if city not in c_list: c_list.append(city) print(c_list)
可以看到一個出發地對應有很多目的地:
4.獲取商品列表資訊
dep 和query 引數已經獲取,接下來就是請求json載入的資料,分析其url變化及 頁面重要的routeCount引數
https://touch.dujia.qunar.com/list?modules=list%2CbookingInfo%2CactivityDetail&dep=%E5%B9%BF%E5%B7%9E&query=%E5%8E%A6%E9%97%A8%E8%87%AA%E7%94%B1%E8%A1%8C&dappDealTrace=false&mobFunction=%E6%89%A9%E5%B1%95%E8%87%AA%E7%94%B1%E8%A1%8C&cfrom=zyx&it=dujia_hy_destination&date=&needNoResult=true&originalquery=%E5%8E%A6%E9%97%A8%E8%87%AA%E7%94%B1%E8%A1%8C&limit=0,24&includeAD=true&qsact=search
和limit的變化 每次請求是以24的倍數變化,通過獲取routeCount引數,載入請求不同url。
import requests import urllib import random,time url = 'https://touch.dujia.qunar.com/depCities.qunar' html = requests.get(url) # print(html.text) dict = html.json() #獲取出發地引數 for i in dict['data']: for j in dict['data'][i]: print(j) link_url = 'https://touch.dujia.qunar.com/golfz/sight/arriveRecommend?dep={}&exclude=&extensionImg=255,175'.format(j) #設定隨機休眠時間 time.sleep(random.randint(1,2)) html2 = requests.get(link_url) dict2 = html2.json() c_list = [] #獲取目的地引數 for k in dict2['data']: for l in k['subModules']: for m in l['items']: city = m['query'] if city not in c_list: c_list.append(city) # print(c_list) #設定隨機休眠時間 time.sleep(random.randint(1,2)) #請求資料 for c in c_list: #配置請求url url3 = 'https://touch.dujia.qunar.com/list?modules=list%2CbookingInfo%2CactivityDetail&dep={}&query={}%E8%87%AA%E7%94%B1%E8%A1%8C&dappDealTrace=false&mobFunction=%E6%89%A9%E5%B1%95%E8%87%AA%E7%94%B1%E8%A1%8C&cfrom=zyx&it=dujia_hy_destination&date=&needNoResult=true&originalquery={}&limit=0,24&qsact=scroll'.format(urllib.request.quote(j),urllib.request.quote(city),urllib.request.quote(city)) A = url3.replace('https://touch.dujia.qunar.com','') # print(A) headers = { 'cookie': 'QN48=tc_e1b5f5bb4d76a018_16730073949_ad75; csrfToken=d27163582839d6b8cbcb53110ed67077; QN300=organic; QN1=ezu0pVvzuB9qeVd2w90fAg==; _RF1=119.129.117.7; _RSG=AZ4soQG2oI5YMrcq1P6et8; _RDG=283bf2bcd3461d22ef1d94f9276d7c9b85; _RGUID=54b20906-b2d8-48ca-8de8-1990749b55a2; QN205=organic; QN234=home_free_t; _pk_ref.1.8600=%5B%22%22%2C%22%22%2C1542699072%2C%22http%3A%2F%2Ftouch.qunar.com%2F%22%5D; _pk_ses.1.8600=*; QN57=15427010307400.44337198739421924; QN58=1542701030742%7C1542701078367%7C4; QN233=dujia_hy_destination; _pk_id.1.8600=5f2ca9d25160d431.1542699072.1.1542705039.1542699072.; QN243=165', 'referer': 'https://touch.dujia.qunar.com/p/list?dep=%E5%B9%BF%E5%B7%9E&query=%E5%8E%A6%E9%97%A8%E8%87%AA%E7%94%B1%E8%A1%8C&cfrom=zyx&et=&it=dujia_hy_destination', 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36' } html3 = requests.get(url=url3,headers=headers) print(url3) print(html3.json()) # # 獲取 routeCount 引數 # num = int(html3.json()['data']['limit']['routeCount']) # # # 每頁只返回 24條資料 # for n in range(0,num,24): # url4 ='https://touch.dujia.qunar.com/list?modules=list%2CbookingInfo%2CactivityDetail&dep={}&query={}%E8%87%AA%E7%94%B1%E8%A1%8C&dappDealTrace=false&mobFunction=%E6%89%A9%E5%B1%95%E8%87%AA%E7%94%B1%E8%A1%8C&cfrom=zyx&it=dujia_hy_destination&date=&needNoResult=true&originalquery={}&limit={},24&qsact=scroll,n' # # # 設定隨機休眠時間 # time.sleep(random.randint(1, 2)) # # html4 = requests.get(url=url4,headers=headers) # result = html4.json() # print(result)