1. 程式人生 > >爬蟲學習之17:爬取拉勾網網招聘資訊(非同步載入+Cookie模擬登陸)

爬蟲學習之17:爬取拉勾網網招聘資訊(非同步載入+Cookie模擬登陸)

         很多網站需要通過提交表單來進行登陸或相應的操作,可以用requests庫的POST方法,通過觀測表單原始碼和逆向工程來填寫表單獲取網頁資訊。本程式碼以獲取拉勾網Python相關招聘職位為例作為練習。開啟拉鉤網,F12進入瀏覽器開發者工具,可以發現網站使用了Ajax,點選Network選項卡,選中XHR項,在Header中可以看到請求的網址,Response中可以看到返回的資訊為Json格式。這裡由於Json字串比較長且複雜,所以可以用Preview選項觀察,正好是網頁中的職位資訊。招聘資訊全在content-posiotionResult-result中。翻頁後發現請求地址沒有改變,但是提交方法為POST,提交的欄位中有一個pn欄位隨著翻頁在改變,因此,可以據此構造出爬蟲程式。程式碼如下:

import requests
import json
import time
import pymongo

client = pymongo.MongoClient('localhost',27017)
mydb = client['mydb']
lagou = mydb['lagou']

cookie = '這裡換成你自己的cookie'

headers = {'cookie': cookie,
           'origin': "https://www.lagou.com",
           'x-anit-forge-code': "0",
           'accept-encoding': "gzip, deflate, br",
           'accept-language': "zh-CN,zh;q=0.8,",
           'user-agent': "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36",
           'content-type': "application/x-www-form-urlencoded; charset=UTF-8",
           'accept': "application/json, text/javascript, */*; q=0.01",
           'referer': "https://www.lagou.com/jobs/list_Pyhon?labelWords=&fromSearch=true&suginput=",
           'x-requested-with': "XMLHttpRequest",
           'connection': "keep-alive",
           'x-anit-forge-token': "None"}


def get_page(url, params):
    html = requests.post(url,data=params,headers=headers)
    json_data = json.loads(html.text)
    total_count = json_data['content']['positionResult']['totalCount']
    page_number = int(total_count/15) if int(total_count/15)<30 else 30
    get_info(url,page_number)

def get_info(url,page):
    for pn in range(1,page+1):
        params={
            'first':'true',
            'pn':str(pn),
            'kd':'Python'
        }
        try:
            html = requests.post(url,data=params,headers=headers)
            json_data = json.loads(html.text)
            results = json_data['content']['positionResult']['result']
            for result in results:
                infos = {
                    'businessZones':result['businessZones'],
                    'city': result['city'],
                    'companyFullName': result['companyFullName'],
                    'companyLabelList': result['companyLabelList'],
                    'companySize': result['companySize'],
                    'district': result['district'],
                    'education': result['education'],
                    'financeStage': result['financeStage'],
                    'firstType': result['firstType'],
                    'formatCreateTime': result['formatCreateTime'],
                    'gradeDescription': result['gradeDescription'],
                    'imState': result['imState'],
                    'industryField': result['industryField'],
                    'positionAdvantage': result['positionAdvantage'],
                    'salary': result['salary'],
                    'workYear': result['workYear'],
                }
                lagou.insert_one(infos)
                time.sleep(2)
        except requests.exceptions.ConnectionError:
            pass

if __name__=='__main__':
    url = 'https://www.lagou.com/jobs/positionAjax.json'
    params = {
        'first': 'true',
        'pn': '1',
        'kd': 'Python'
    }
    get_page(url,params)


      拉鉤網由於採取了反扒技術,使用簡單的代理或者使用普通的headers都會被遮蔽,提示“您的操作過於頻繁,請稍後再試”,經過嘗試,如果採用完整的頭部就沒有問題,爬取的資料儲存在MongoDB資料庫中。