python爬蟲系列(5.3-動態網站的爬取的策略)

阿新 • • 發佈：2018-11-11

一、認識動態網站

所謂的動態網站,是使用ajax加載出來的網頁,我們開啟網頁的時候可以正常顯示內容,但是我們在顯示網頁原始碼的時候,裡面卻找不到該節點.

二、常見動態網站的抓取方式

1、直接分析ajax呼叫的介面,然後通過程式碼請求這個介面

2、採用模擬瀏覽器請求該動態網站,然後獲取網頁資料

一、分析網頁資料請求

二、使用requests庫直接模擬請求資料

from urllib import parse

import requests

def get_html():

"""

定義一個函式獲取拉鉤職位資訊

:return:

"""

params = {

'px': 'default',

'city': '深圳',

'needAddtionalResult': 'false'

}

url = 'https://www.lagou.com/jobs/positionAjax.json?{0}'.format(parse.urlencode(params))

form_data = {

'first': 'true',

'pn': '1',

'kd': 'python',

}

headers = {

'Accept': 'application/json, text/javascript, */*; q=0.01',

'Accept-Encoding': 'gzip, deflate, br',

'Accept-Language': 'zh-CN,zh;q=0.9',

'Connection': 'keep-alive',

'Content-Length': '25',

'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',

'Cookie': 'JSESSIONID=ABAAABAAAGFABEFF0428531C82A2DC0C1C0DCAE2CA9FABE; _ga=GA1.2.484492928.1537518656; _gid=GA1.2.808273883.1537518656; _gat=1; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1537518657; user_trace_token=20180921163056-a947353a-bd78-11e8-a518-525400f775ce; LGSID=20180921163056-a9473670-bd78-11e8-a518-525400f775ce; PRE_UTM=; PRE_HOST=; PRE_SITE=; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2F; LGUID=20180921163056-a9473875-bd78-11e8-a518-525400f775ce; TG-TRACK-CODE=index_search; LGRID=20180921163107-af919ae7-bd78-11e8-bb56-5254005c3644; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1537518667; index_location_city=%E6%B7%B1%E5%9C%B3; SEARCH_ID=a81f58b5627b460b8180ad514b967cf1',

'Host': 'www.lagou.com',

'Origin': 'https://www.lagou.com',

'Referer': 'https://www.lagou.com/jobs/list_python?px=default&city=%E6%B7%B1%E5%9C%B3',

'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',

'X-Anit-Forge-Code': '0',

'X-Anit-Forge-Token': 'None',

'X-Requested-With': 'XMLHttpRequest',

}

response = requests.post(url=url, data=form_data, headers=headers)

print(response.status_code)

print(response.json())

if __name__ == '__main__':

get_html()

python爬蟲系列(5.3-動態網站的爬取的策略)

python爬蟲系列(5.3-動態網站的爬取的策略)

Python爬蟲系列之郵編區號爬取

Python爬蟲實戰專案2 | 動態網站的抓取（爬取電影網站的資訊）

python爬蟲系列(2.3-requests庫模擬使用者登入)

python爬蟲系列(1.3-關於cookie的認識)

python爬蟲系列(4.3-資料儲存到mysql資料庫中)

python 爬蟲如何通過scrapy框架簡單爬取網站資訊--以51job為例

[Python爬蟲]Scrapy配合Selenium和PhantomJS爬取動態網頁

python爬蟲系列（3）：使用Selenium和BeautifulSoup獲取12306一個月內所有車次車票情況

Python爬蟲實例（一）爬取百度貼吧帖子中的圖片

Python 爬蟲實例（7）—— 爬取新浪軍事新聞

python 爬蟲（一） requests+BeautifulSoup 爬取簡單網頁代碼示例

Python爬蟲基礎：驗證碼的爬取和識別詳解

python爬蟲學習筆記三：圖片爬取

Python爬蟲實習筆記 | Week4 專案資料爬取與反思

Python爬蟲——代理伺服器進行資訊的爬取

python爬蟲總結: 網頁內容需要分類爬取

Python爬蟲練手小專案：爬取窮遊網酒店資訊

Python爬蟲教程：多執行緒爬取電子書

Python爬蟲：selenium掛shadowsocks代理爬取網頁內容

python爬蟲系列(5.3-動態網站的爬取的策略)

相關推薦