前程無憂爬蟲實戰（通過輸入關鍵字爬取任意職位並自動儲存為.csv文字）

阿新 • • 發佈：2019-01-09

![0e644a1fa9dc00c3e7c752bdf4382aa2.jpg](https://upload-images.jianshu.io/upload_images/9136378-72ab92577ff68f7d.jpg?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

#這裡只分析主要的思路和要注意的點。有什麼不懂的可以評論提問，研究給出的程式碼理解。
##1.通過在前程無憂上面輸入關鍵字搜尋可以發現，前程無憂的資料不是ajax載入的，是普通的一個get請求，只需要構造url傳送請求，編寫解析的規則就可以了，這裡推薦採用xpath編寫解析的規則。解析這些非結構性的資料首先考慮xpath,xpath不行的話就用正則。獲取最大爬取頁碼資料只需要通過xpath定位那個最大頁數就可以了，然後把那個數字提取出來，在寫個if判斷。

2.程式碼的實現如下

#_author:'DJS'
#date:2018-11-19

import csv
import re

import requests
from lxml import etree
headers = {
    "cache-control": "no-cache",
    "postman-token": "72a56deb-825e-3ac3-dd61-4f77c4cbb4d8",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36",

}
def get_url(key1):
    try:
        i=0
        url = "https://search.51job.com/list/030800%252C040000%252C030200,000000,0000,00,9,99,{},2,1.html"
        response = requests.get(url.format(key1), headers=headers)
        html = etree.HTML(response.content.decode('gbk'))
        max_page =int("".join(re.findall('(\d+)',"".join(html.xpath("//span[@class='td']/text()")))))
        while True:
            i+=1
            url = "https://search.51job.com/list/030800%252C040000%252C030200,000000,0000,00,9,99,{},2,{}.html"
            url = url.format(key1,i)
            print("*"*100)
            print("正在爬取第%d頁" % i)
            print("*"*100)
            yield url
            #print("正在爬取%d頁"%i)
            if max_page == i:
                break
    except:
        print("獲取不到連結，已處理")

def pase_page(key1):
    try:
        for i in get_url(key1):
            url = i
            #print(url)
            response = requests.get(url,headers=headers)
            html = etree.HTML(response.content.decode('gbk'))  # 解碼成gbk後輸出，請求的是gbk，但是python預設的是
            #輸出的是utf-8，所以把utf-8解碼成gbk就可以輸出了，這樣請求和輸出就一樣了,decode 相當於輸出
            #編碼的輸入和輸出要一致。
            lists = html.xpath("//div[@id='resultList']//div[@class='el']")
            for list in lists:
                item = {}
                item["職位"] = "".join(list.xpath("./p/span/a/text()")).replace('\r\n', '').replace(' ', '')
                item["公司名稱"] = "".join(list.xpath("./span[@class='t2']/a/text()")).replace('\r\n', '').replace(' ', '')
                item["工作地點"] = "".join(list.xpath("./span[@class='t3']/text()")).replace('\r\n', '').replace(' ', '')
                item["薪資"] = "".join(list.xpath("./span[@class='t4']/text()")).replace('\r\n', '').replace(' ', '')
                item["釋出時間"] = "".join(list.xpath("./span[@class='t5']/text()")).replace('\r\n', '').replace(' ', '')
                yield item
    except:
        print("返回資料異常，已處理")

def save_excel(key1):
    try:
        header = ['職位', '公司名稱', '工作地點', '薪資', '釋出時間']
        with open(key1+'前程無憂職位資訊.csv', 'w', newline='') as f:  # w是寫入
            # 標頭在這裡傳入，作為第一行資料
            writer = csv.DictWriter(f, header)
            writer.writeheader()
        for i in pase_page(key1):
            item = i
            header = ['職位', '公司名稱', '工作地點', '薪資', '釋出時間']
            with open(key1+'前程無憂職位資訊.csv', 'a', newline='') as f:  # a是追加
                writer = csv.DictWriter(f, header)
                writer.writerow(item)
                #print(item)
    except:
        print("儲存資料異常，已處理")

if __name__ == '__main__':
    key1 = input('請輸入要爬取的職位：')
    save_excel(key1)

前程無憂爬蟲實戰（通過輸入關鍵字爬取任意職位並自動儲存為.csv文字）

前程無憂爬蟲實戰（通過輸入關鍵字爬取任意職位並自動儲存為.csv文字）

拉勾爬蟲實戰（通過輸入關鍵字爬取任意職位並自動儲存為.csv文字）

前程無憂爬蟲原始碼及分析（一）

python3[爬蟲實戰] 使用selenium，xpath爬取京東手機（上）

TP5中（通過一個表去取另一個表的相對應的名稱）

Python爬蟲實戰專案之小說資訊爬取

一個簡單Python爬蟲例項（爬取的是前程無憂網的部分招聘資訊）

Scrapy爬取前程無憂（51job）相關職位資訊

前程無憂投資拉勾有著如何“不可告人”的秘密？

Python 爬蟲實戰（二）：使用 requests-html

前程無憂

python Scrapy網路爬蟲實戰（存Json檔案以及存到mysql資料庫）

Python爬蟲入門實戰系列（一）--爬取網路小說並存放至txt檔案

R語言爬取前程無憂網招聘職位

scrapy爬取前程無憂51job網職位資訊並存儲到資料庫

scrapy框架爬取前程無憂

python爬蟲實戰（一）

Python3爬蟲實戰（requests模組）

Python3爬蟲實戰（urllib模組）

Python3.X 爬蟲實戰（動態頁面爬取解析）

前程無憂爬蟲實戰（通過輸入關鍵字爬取任意職位並自動儲存為.csv文字）

相關推薦