網路資料抓取-51job職位列表-簡單案例

BeautifulSoup · 發表 2018-10-07 07:08:14

摘要：智慧決策上手系列教程索引這是一個簡單的單頁面資料抓取案例，但也有些值得注意的坑。這裡快速解釋一下程式碼。抓取的是51job網站，搜尋“人工智慧”然後得到的招聘職位基本資訊，職位名、公司名、薪資等等。 image.png 資料直接就...

ofollow,noindex">智慧決策上手系列教程索引

這是一個簡單的單頁面資料抓取案例，但也有些值得注意的坑。這裡快速解釋一下程式碼。

抓取的是51job網站，搜尋“人工智慧”然後得到的招聘職位基本資訊，職位名、公司名、薪資等等。

image.png

資料直接就在【右鍵-檢視原始碼】的網頁原始碼裡，也可以【右鍵-檢查】從Elements元素面板看到：

image.png

我們注意到職位列表都在 class='dw_table' 的元素下面，但是第一個 class='el title' 的是表頭，不應該包含，雖然它下面也有 t1,t2,t3 但是它的 class='t1' 是個 <span> ，而正常的職位的 t1 是個 <p>

下面是主要程式碼：

from bs4 import BeautifulSoup
import requests
import time
headers = {
'User-Agent': 'Mozilla/5.0'
}
url='https://search.51job.com/list/070300,000000,0000,00,9,99,%25E4%25BA%25BA%25E5%25B7%25A5%25E6%2599%25BA%25E8%2583%25BD,2,1.html?lang=c&stype=&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&providesalary=99&lonlat=0%2C0&radius=-1⩝_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare='
html= requests.get(url,headers=headers)
html=html.text.encode('ISO-8859-1').decode('gbk') ##注意這個坑！
soup=BeautifulSoup(html, 'html.parser')

for item in soup.find('div','dw_table').find_all('div','el'):
shuchu=[]
if item.find('p','t1'):
title=item.find('p','t1').find('a')['title']
company=item.find('span','t2').string#爬公司名稱
address=item.find('span','t3').string#爬地址
xinzi = item.find('span', 't4').string#爬薪資
date=item.find('span','t5').string#爬日期
shuchu.append(str(title))

shuchu.append(str(company))
shuchu.append(str(address))
shuchu.append(str(xinzi))
shuchu.append(str(date))
print('\t'.join(shuchu))
time.sleep(1)

有幾個坑需要注意：

html=html.text.encode('ISO-8859-1').decode('gbk') 沒有這句中文就會亂碼。因為如果網頁裡沒有說明自己是什麼編碼，Requests模組就會把它當做 'uft-8' 編碼模式處理，而偏巧51job的網頁就沒有說明自己的編碼格式，那就會使用網頁預設的 'ISO-8859-1' 編碼，這就矛盾了，所以要強制重新編碼 encode 然後再解碼 decode ，這裡使用 gbk 確保中文正常顯示。
上面程式碼用 if item.find('p','t1'): 排除掉了第一行表頭,參照上面的網頁截圖。
shuchu.append(str(title)) 這裡都加了 str(...) 是防止有些時候公司名、薪資、地址可能有空的，空的計算機會認為是 None ，我們用 str(None) 就是 'None' ,變成了一個字串，不再是空了。因為後面的 '\t'.join(shuchu) 中 shuchu 列表裡面如果有空就會出錯。

最終輸出的結果大致是：

image.png

智慧決策上手系列教程索引

每個人的智慧決策新時代

如果您發現文章錯誤，請不吝留言指正；

如果您覺得有用，請點喜歡；

如果您覺得很有用，歡迎轉載~

END

網路資料抓取-51job職位列表-簡單案例

ofollow,noindex">智慧決策上手系列教程索引

智慧決策上手系列教程索引

每個人的智慧決策新時代

您可能也會喜歡…