1. 程式人生 > >Python爬蟲bs4解析實戰

Python爬蟲bs4解析實戰

zha opened 計費 pos 常用方法 ngs bsp 運維工程師 strings

1.常用方法

技術分享圖片
from bs4 import BeautifulSoup

html = """
<table class="tablelist" cellpadding="0" cellspacing="0">
    <tr class="h">
        <td class="l" width="374">職位名稱</td>
        <td>職位類別</td>
        <td>人數</td>
        <td>地點</td>
        <td>發布時間</td>
    </tr>
    <tr class="even">
        <td class="l square"><a target="_blank" href="position_detail.php?id=45021&keywords=python&tid=0&lid=0">22989-騰訊雲計費PHP高級開發工程師</a></td>
        <td>技術類</td>
        <td>2</td>
        <td>深圳</td>
        <td>2018-10-23</td>
    </tr>
    <tr class="odd">
        <td class="l square"><a target="_blank" href="position_detail.php?id=45005&keywords=python&tid=0&lid=0">25663-騰訊雲高級後臺開發(互聯網業務)(北京)</a></td>
        <td>技術類</td>
        <td>1</td>
        <td>北京</td>
        <td>2018-10-23</td>
    </tr>
    <tr class="even">
        <td class="l square"><a target="_blank" href="position_detail.php?id=45007&keywords=python&tid=0&lid=0">TEG06-雲計算架構師(深圳)</a></td>
        <td>技術類</td>
        <td>1</td>
        <td>深圳</td>
        <td>2018-10-23</td>
    </tr>
    <tr class="odd">
        <td class="l square"><a target="_blank" href="position_detail.php?id=44980&keywords=python&tid=0&lid=0">PCG04-PCG研發部數據科學家(深圳/北京)</a></td>
        <td>技術類</td>
        <td>1</td>
        <td>深圳</td>
        <td>2018-10-23</td>
    </tr>
    <tr class="even">
        <td class="l square"><a target="_blank" href="position_detail.php?id=44981&keywords=python&tid=0&lid=0">PCG04-PCG研發部業務運維工程師(深圳)</a></td>
        <td>技術類</td>
        <td>1</td>
        <td>深圳</td>
        <td>2018-10-23</td>
    </tr>
    <tr class="odd">
        <td class="l square"><a target="_blank" href="position_detail.php?id=44971&keywords=python&tid=0&lid=0">23674-騰訊新聞大數據分析工程師(北京)</a></td>
        <td>技術類</td>
        <td>2</td>
        <td>北京</td>
        <td>2018-10-23</td>
    </tr>
    <tr class="even">
        <td class="l square"><a target="_blank" href="position_detail.php?id=44964&keywords=python&tid=0&lid=0">TEG05-高級數據挖掘工程師(深圳)</a></td>
        <td>技術類</td>
        <td>2</td>
        <td>深圳</td>
        <td>2018-10-23</td>
    </tr>
    <tr class="odd">
        <td class="l square"><a target="_blank" href="position_detail.php?id=44968&keywords=python&tid=0&lid=0">PCG01-QQ後臺推薦算法工程師</a></td>
        <td>技術類</td>
        <td>1</td>
        <td>深圳</td>
        <td>2018-10-23</td>
    </tr>
    <tr class="even">
        <td class="l square"><a target="_blank" href="position_detail.php?id=44969&keywords=python&tid=0&lid=0">PCG01-QQ後臺大數據開發工程師</a></td>
        <td>技術類</td>
        <td>1</td>
        <td>深圳</td>
        <td>2018-10-23</td>
    </tr>
    <tr class="odd">
        <td class="l square"><a target="_blank" href="position_detail.php?id=44952&keywords=python&tid=0&lid=0">22989-騰訊雲AI產品高級咨詢顧問(深圳北京)</a></td>
        <td>技術類</td>
        <td>1</td>
        <td>深圳</td>
        <td>2018-10-23</td>
    </tr>
</table>    
""" soup = BeautifulSoup(html, "lxml") # 1.找到所有的tr標簽 # trs = soup.find_all("tr") # 2.找到第二個tr標簽,limit表示找到個數,在列表層面獲取具體標簽 # tr = soup.find_all("tr", limit=2)[1] # 3.找到所有class等於even的tr標簽,class關鍵字沖突,加下劃線 # trs = soup.find_all("tr", class_="even") # 4.attrs屬性可添加多個,以key-value形式 # trs = soup.find_all("tr", attrs={"class": "even"})
# 5.將所有a標簽有target屬性的找到,可以添加多個關鍵字參數 # aList = soup.find_all("a", target="_blank") # 6.獲取所有的a標簽的href屬性 # aList = soup.find_all("a") # for a in aList: # 1.通過下標操作的方式 # href = a["href"] # 2.通過attrs屬性的方式 # href = a.attrs["href"] # 獲取所有的職位信息,過濾掉第一個 trs = soup.find_all("tr")[1:] jobs = [] for
tr in trs: job = {} # tds = tr.find_all("td") # title = tds[0].string # category = tds[1].string # nums = tds[2].string # city = tds[3].string # pubtime = tds[4].string # job["title"] = title # job["category"] = category # job["nums"] = nums # job["city"] = city # job["pubtime"] = pubtime # jobs.append(job) # 獲取所有文本 infos = list(tr.stripped_strings) job["title"] = infos[0] job["category"] = infos[1] job["nums"] = infos[2] job["city"] = infos[3] job["pubtime"] = infos[4] jobs.append(job) print(jobs)
View Code

2.css選擇器方法

技術分享圖片
# 1.獲取所有tr標簽
# trs = soup.select("tr")
# 2.獲取第二個tr標簽
# tr = soup.select("tr")[1]
# 3.獲取所有class是even的tr標簽
# trs = soup.select("tr.even")
# trs = soup.select("tr[class=‘even‘]")
# 4.獲取所有a標簽的href屬性
# aList = soup.select("a")
# for a in aList:
#     print(a["href"])
# 5.將所有的職位信息提取出來
# trs = soup.select("tr")
# for tr in trs:
#     infos = list(tr.stripped_strings)
#     print(infos)
View Code

https://www.cnblogs.com/zhangxinqi/p/9218395.html#_label5 參考博客

Python爬蟲bs4解析實戰