1. 程式人生 > >跟著知識追尋者學BeautifulSoup,你學不會打不還口,罵不還手

跟著知識追尋者學BeautifulSoup,你學不會打不還口,罵不還手

一 前言

Beautiful Soup 是一個可以從HTML或XML檔案中提取資料的Python庫;其強大的提取能力讓知識追尋者放棄了使用正則匹配查詢HTML節點;Beautifu Soup 其能直接通過HTML標籤獲取相應的節點,或者通過函式直接獲得節點,大大提高了程式設計人員的開發效率;看完本篇學不會Beautiful Soup ,滿天神佛都救不了你;覺得知識追尋者的文章有點意思,關注加點贊謝謝;

二 Beautiful Soup 簡單使用

Beautiful Soup 的直譯器如下:

直譯器 使用示例
Python標準庫 BeautifulSoup(markup, "html.parser")
lxml HTML 解析器 BeautifulSoup(markup, "lxml")
lxml XML 解析器 BeautifulSoup(markup, "xml")
html5lib BeautifulSoup(markup, "html5lib")

本篇的直譯器讀者可以使用Python標準庫或者lxml HTML 解析器都可以;下午中獲取標籤其實都是獲取標籤物件,讀者謹記;

簡要概括下屬性的說明:

屬性 含義
soup.tag.name 獲取標籤tag名稱
soup.tag.string 獲取標籤tag文字內容
soup.tag 獲取標籤tag
soup.tag.attrs 獲取標籤tag所有屬性
soup.tag.attrs['class'] 獲取標籤指定class的屬性
soup.tag1.tag2 獲取子標籤tag2
soup.tag.contents 獲取tag所有直接子標籤以列表輸出
soup.tag.children 獲取直接子標籤,返回生成器
soup.tag.descendants 獲取所有子標籤,返回生成器
soup.tag.parent 獲取直接父節點
soup.tag.parents 獲取祖先節點,返回生成器
soup.tag.next_sibling 獲取後一個兄弟節點
soup.tag.previous_sibling 獲取前一個兄弟節點
soup.tag.next_siblings 獲取後一個兄弟節點,返回生成器
soup.tag.previous_siblings 獲取前一個兄弟節點,返回生成器

2.1 格式化HTML

  1. 例項化一個Beautiful Soup 例項,入參是HTML,和html.parser
  2. 呼叫prettify()方法會格式化HTML文件
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup

html = """
    <div class="filter-box d-flex align-items-center">
    <form action="" id=seeOriginal>
    <dl class="filter-sort-box d-flex align-items-center">
    <dt>排序:</dt>
    <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">預設</a></dd>
    <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
        <svg class="icon" aria-hidden="true">
            <use xlink:href="#csdnc-rss"></use>
        </svg>RSS訂閱</a>
    </dd>
  </dl>"""

# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
print(soup.prettify())

輸出結果下,是不是很美觀,結構是不是很清楚;而且還補全了缺失的標籤</form> , </div>

<div class="filter-box d-flex align-items-center">
 <form action="" id="seeOriginal">
  <dl class="filter-sort-box d-flex align-items-center">
   <dt>
    排序:
   </dt>
   <dd>
    <a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">
     預設
    </a>
   </dd>
   <dd>
    <a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
     <svg aria-hidden="true" class="icon">
      <use xlink:href="#csdnc-rss">
      </use>
     </svg>
     RSS訂閱
    </a>
   </dd>
  </dl>
 </form>
</div>

2.2 獲取標籤節點

  1. 呼叫soup.dt 會直接獲得第一個匹配到dt標籤物件;
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup

html = """
    <div class="filter-box d-flex align-items-center">
    <form action="" id=seeOriginal>
    <dl class="filter-sort-box d-flex align-items-center">
    <dt>排序:</dt>
    <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">預設</a></dd>
    <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
        <svg class="icon" aria-hidden="true">
            <use xlink:href="#csdnc-rss"></use>
        </svg>RSS訂閱</a>
    </dd>
  </dl>"""

# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
# 輸出節點 <dt>排序:</dt>
print(soup.dt)

2.3 獲取節點文字

soup.dt.string 獲得dt標籤包含的內容;

# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup

html = """
    <div class="filter-box d-flex align-items-center">
    <form action="" id=seeOriginal>
    <dl class="filter-sort-box d-flex align-items-center">
    <dt>排序:</dt>
    <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">預設</a></dd>
    <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
        <svg class="icon" aria-hidden="true">
            <use xlink:href="#csdnc-rss"></use>
        </svg>RSS訂閱</a>
    </dd>
  </dl>"""

# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
# 輸出文字內容 排序:
print(soup.dt.string)

2.4獲取節點名稱

soup.dt.name 直接獲得標籤dt的名稱;

# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup

html = """
    <div class="filter-box d-flex align-items-center">
    <form action="" id=seeOriginal>
    <dl class="filter-sort-box d-flex align-items-center">
    <dt>排序:</dt>
    <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">預設</a></dd>
    <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
        <svg class="icon" aria-hidden="true">
            <use xlink:href="#csdnc-rss"></use>
        </svg>RSS訂閱</a>
    </dd>
  </dl>"""

# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
# 輸出dt
print(soup.dt.name)

2.5 獲得節點物件種類

直接獲得標籤後使用type方法可以顯示出標籤型別是<class 'bs4.element.Tag'>

# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup

html = """
    <div class="filter-box d-flex align-items-center">
    <form action="" id=seeOriginal>
    <dl class="filter-sort-box d-flex align-items-center">
    <dt>排序:</dt>
    <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">預設</a></dd>
    <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
        <svg class="icon" aria-hidden="true">
            <use xlink:href="#csdnc-rss"></use>
        </svg>RSS訂閱</a>
    </dd>
  </dl>"""

# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
dt = soup.dt
# <class 'bs4.element.Tag'>
print(type(dt))

2.6 獲取所有屬性

soup.a.attrs 獲取匹配到第一個a標籤的所有屬性;

# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup

html = """
    <div class="filter-box d-flex align-items-center">
    <form action="" id=seeOriginal>
    <dl class="filter-sort-box d-flex align-items-center">
    <dt>排序:</dt>
    <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">預設</a></dd>
    <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
        <svg class="icon" aria-hidden="true">
            <use xlink:href="#csdnc-rss"></use>
        </svg>RSS訂閱</a>
    </dd>
  </dl>"""

# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
print(soup.a.attrs)

輸出預設匹配第一個a標籤的全部屬性

{'href': 'javascript:void(0);', 'data-report-query': '', 'class': ['btn-filter-sort', 'active'], 'target': '_self'}

2.7 獲取特定屬性

soup.a.attrs['href'],獲取匹配到第一個a標籤的href屬性內容

# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup

html = """
    <div class="filter-box d-flex align-items-center">
    <form action="" id=seeOriginal>
    <dl class="filter-sort-box d-flex align-items-center">
    <dt>排序:</dt>
    <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">預設</a></dd>
    <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
        <svg class="icon" aria-hidden="true">
            <use xlink:href="#csdnc-rss"></use>
        </svg>RSS訂閱</a>
    </dd>
  </dl>"""

# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
# 輸出javascript:void(0);
print(soup.a.attrs['href'])

2.8 獲取子節點

soup.form.dd 會獲得form標籤下第一個dd標籤

# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup

html = """
    <div class="filter-box d-flex align-items-center">
    <form action="" id=seeOriginal>
    <dl class="filter-sort-box d-flex align-items-center">
    <dt>排序:</dt>
    <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">預設</a></dd>
    <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
        <svg class="icon" aria-hidden="true">
            <use xlink:href="#csdnc-rss"></use>
        </svg>RSS訂閱</a>
    </dd>
  </dl>"""

# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
print(soup.form.dd)

輸出

<dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">預設</a></dd>

2.9 獲取所有直接子節點

soup.form.contents 將會以列表的形式輸出form所有的子標籤;

# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup

html = """
    <div class="filter-box d-flex align-items-center">
    <form action="" id=seeOriginal>
    <dl class="filter-sort-box d-flex align-items-center">
    <dt>排序:</dt>
    <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">預設</a></dd>
    <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
        <svg class="icon" aria-hidden="true">
            <use xlink:href="#csdnc-rss"></use>
        </svg>RSS訂閱</a>
    </dd>
  </dl>"""

# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
print(soup.form.contents)

輸出結果:

['\n', <dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">預設</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg aria-hidden="true" class="icon">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS訂閱</a>
</dd>
</dl>]

2.10 獲取直接子節點生成器

soup.svg.children 會獲得dd所有子節點的生成器;

# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup

html = """
    <div class="filter-box d-flex align-items-center">
    <form action="" id=seeOriginal>
    <dl class="filter-sort-box d-flex align-items-center">
    <dt>排序:</dt>
    <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">預設</a></dd>
    <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
        <svg class="icon" aria-hidden="true">
            <use xlink:href="#csdnc-rss"></use>
        </svg>RSS訂閱</a>
    </dd>
  </dl>"""

# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
for index, child in enumerate(soup.svg.children):
    print(index, child)

輸出結果:

0 

1 <use xlink:href="#csdnc-rss"></use>
2 

2.11 獲取所有子節點生成器

soup.dl.descendants 會獲取dl 標籤所有的子節點(more than direct child node),

# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup

html = """
    <div class="filter-box d-flex align-items-center">
    <form action="" id=seeOriginal>
    <dl class="filter-sort-box d-flex align-items-center">
    <dt>排序:</dt>
    <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">預設</a></dd>
    <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
        <svg class="icon" aria-hidden="true">
            <use xlink:href="#csdnc-rss"></use>
        </svg>RSS訂閱</a>
    </dd>
  </dl>"""

# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
for index, child in enumerate(soup.dl.descendants):
    print(index, child)

輸出結果:

0 

1 <dt>排序:</dt>
2 排序:
3 

4 <dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">預設</a></dd>
5 <a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">預設</a>
6 預設
7 

8 <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg aria-hidden="true" class="icon">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS訂閱</a>
</dd>
9 <a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg aria-hidden="true" class="icon">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS訂閱</a>
10 

11 <svg aria-hidden="true" class="icon">
<use xlink:href="#csdnc-rss"></use>
</svg>
12 

13 <use xlink:href="#csdnc-rss"></use>
14 

15 RSS訂閱
16 

17 

2.12 獲取直接父節點

soup.a.parent 或獲取第一個匹配到a標籤的父標籤物件;

# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup

html = """
    <div class="filter-box d-flex align-items-center">
    <form action="" id=seeOriginal>
    <dl class="filter-sort-box d-flex align-items-center">
    <dt>排序:</dt>
    <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">預設</a></dd>
    <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
        <svg class="icon" aria-hidden="true">
            <use xlink:href="#csdnc-rss"></use>
        </svg>RSS訂閱</a>
    </dd>
  </dl>"""

# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
print(soup.a.parent)

輸出結果:

<dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">預設</a></dd>

2.13 獲取祖先節點生成器

soup.a.parents 會獲得第一個匹配到a標籤的所有父節點,也就是祖先節點,返回生成器;

# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup

html = """
    <div class="filter-box d-flex align-items-center">
    <form action="" id=seeOriginal>
    <dl class="filter-sort-box d-flex align-items-center">
    <dt>排序:</dt>
    <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">預設</a></dd>
    <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
        <svg class="icon" aria-hidden="true">
            <use xlink:href="#csdnc-rss"></use>
        </svg>RSS訂閱</a>
    </dd>
  </dl>"""

# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
for node in soup.a.parents:
    if node is None:
        print(node)
    else:
        print(node.name)

輸出結果:

dd
dl
form
div
[document]

2.14 獲取兄弟節點

兄弟節點有個坑,通常是返回空白,就不做過多講解

# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup

html = """
    <div class="filter-box d-flex align-items-center">
    <form action="" id=seeOriginal>
    <dl class="filter-sort-box d-flex align-items-center">
    <dt>排序:</dt>
    <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">預設</a></dd>
    <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
        <svg class="icon" aria-hidden="true">
            <use xlink:href="#csdnc-rss"></use>
        </svg>RSS訂閱</a>
    </dd>
  </dl>"""

# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
print(soup.dt.next_sibling)

輸出是空白;其它兄弟節點屬性就不寫了,感覺沒啥意義,不是空白就是None;

三 搜尋文件

學完第二節內容,讀者們其實就是打了個基礎,重點是這章節;

函式 含義
find_all(name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs) 查詢所有匹配節點
find(name=None, attrs={}, recursive=True, text=None, **kwargs) 查詢第一個匹配節點
find_parent(name=None, attrs={}, **kwargs) 返回當前節點的父輩節
find_parents(name=None, attrs={}, **kwargs) 返回當前節點的祖先節點
find_next_sibling(name=None, attrs={}, text=None, **kwargs) 返回符合條件的後面的第一個tag節點
find_next_siblings(name=None, attrs={}, text=None, **kwargs) 返回所有符合條件的後面的兄弟節點
find_previous_sibling(self, name=None, attrs={}, text=None, **kwargs) 返回第一個符合條件的前面的兄弟節點
find_previous_siblings(self, name=None, attrs={}, text=None, **kwargs) 返回所有符合條件的前面的兄弟節點
find_next(name=None, attrs={}, text=None, **kwargs) 返回第一個符合條件的節點
find_all_next(name=None, attrs={}, text=None, limit=None, **kwargs) 返回所有符合條件的節點
find_previous(name=None, attrs={}, text=None, **kwargs) 返回第一個符合條件的節點
find_all_previousname=None, attrs={}, text=None, limit=None, **kwargs) 返回所有符合條件的節點
  1. name 表示輸出的tag名稱
  2. attrs 表示指定屬性查詢
  3. recursive 表示是否遞迴所有子節點,預設是;設定為false返回直接子節點
  4. limit 表示 限制 輸出數量
  5. **kwargs 可以指定經常出現的屬性搜尋,比如 id = 'zszxz'
  6. text 是過濾條件

本節著重講解find_all方法,find方法於find_all一致,學一個就會用另一個;

3.1 name引數示例

soup.find_all(name='dd') 會獲得所有dd標籤物件,並且返回列表;

# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup

html = """
    <div class="filter-box d-flex align-items-center">
    <form action="" id=seeOriginal>
    <dl class="filter-sort-box d-flex align-items-center">
    <dt>排序:</dt>
    <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">預設</a></dd>
    <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
        <svg class="icon" aria-hidden="true">
            <use xlink:href="#csdnc-rss"></use>
        </svg>RSS訂閱</a>
    </dd>
  </dl>"""

# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
print(soup.find_all(name='dd'))

輸出結果

[<dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">預設</a></dd>, <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg aria-hidden="true" class="icon">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS訂閱</a>
</dd>]

注:soup.find_all(name='dd') 與 soup.find_all('dd') 一致;

3.2 attrs 屬性示例

soup.find_all(attrs={'id':'seeOriginal'}) 獲取 屬性 id = seeOriginal 所有標籤物件

# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup

html = """
    <div class="filter-box d-flex align-items-center">
    <form action="" id=seeOriginal>
    <dl class="filter-sort-box d-flex align-items-center">
    <dt>排序:</dt>
    <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">預設</a></dd>
    <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
        <svg class="icon" aria-hidden="true">
            <use xlink:href="#csdnc-rss"></use>
        </svg>RSS訂閱</a>
    </dd>
  </dl>"""

# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
print(soup.find_all(attrs={'id':'seeOriginal'}))

輸出

[<form action="" id="seeOriginal">
<dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">預設</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg aria-hidden="true" class="icon">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS訂閱</a>
</dd>
</dl></form>]

3.3 recursive 示例

soup.find_all('dl',recursive=False) 會查詢dl標籤子節點,當recursive 設定為False之後就找不到了;

# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup

html = """
    <div class="filter-box d-flex align-items-center">
    <form action="" id=seeOriginal>
    <dl class="filter-sort-box d-flex align-items-center">
    <dt>排序:</dt>
    <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">預設</a></dd>
    <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
        <svg class="icon" aria-hidden="true">
            <use xlink:href="#csdnc-rss"></use>
        </svg>RSS訂閱</a>
    </dd>
  </dl>"""

# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
print(soup.find_all('dl',recursive=False))

輸出空列表[]

3.4limit示例

soup.find_all('dd',limit=1)會限制輸出結果為一條

# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup

html = """
    <div class="filter-box d-flex align-items-center">
    <form action="" id=seeOriginal>
    <dl class="filter-sort-box d-flex align-items-center">
    <dt>排序:</dt>
    <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">預設</a></dd>
    <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
        <svg class="icon" aria-hidden="true">
            <use xlink:href="#csdnc-rss"></use>
        </svg>RSS訂閱</a>
    </dd>
  </dl>"""

# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
print(soup.find_all('dd',limit=1))

輸出

[<dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">預設</a></dd>]

3.5 kwargs 示例之屬性匹配

soup.find_all(id='seeOriginal')直接指定id屬性查詢

# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup

html = """
    <div class="filter-box d-flex align-items-center">
    <form action="" id=seeOriginal>
    <dl class="filter-sort-box d-flex align-items-center">
    <dt>排序:</dt>
    <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">預設</a></dd>
    <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
        <svg class="icon" aria-hidden="true">
            <use xlink:href="#csdnc-rss"></use>
        </svg>RSS訂閱</a>
    </dd>
  </dl>"""

# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
print(soup.find_all(id='seeOriginal'))

輸出

[<form action="" id="seeOriginal">
<dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">預設</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg aria-hidden="true" class="icon">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS訂閱</a>
</dd>
</dl></form>]

3.6 kwargs 示例之正則匹配

soup.find_all(href=re.compile("java.*?"))匹配屬性 href 正則 java開頭的屬性標籤;

# -*- coding: utf-8 -*-
import re

import requests
from bs4 import BeautifulSoup

html = """
    <div class="filter-box d-flex align-items-center">
    <form action="" id=seeOriginal>
    <dl class="filter-sort-box d-flex align-items-center">
    <dt>排序:</dt>
    <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">預設</a></dd>
    <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
        <svg class="icon" aria-hidden="true">
            <use xlink:href="#csdnc-rss"></use>
        </svg>RSS訂閱</a>
    </dd>
  </dl>"""

# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
print(soup.find_all(href=re.compile("java.*?")))

輸出結果

[<a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">預設</a>]

3.7 按CSS搜尋

soup.find_all("a", class_="btn") 查詢a標籤,class屬性帶有btn

# -*- coding: utf-8 -*-
import re

import requests
from bs4 import BeautifulSoup

html = """
    <div class="filter-box d-flex align-items-center">
    <form action="" id=seeOriginal>
    <dl class="filter-sort-box d-flex align-items-center">
    <dt>排序:</dt>
    <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">預設</a></dd>
    <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
        <svg class="icon" aria-hidden="true">
            <use xlink:href="#csdnc-rss"></use>
        </svg>RSS訂閱</a>
    </dd>
  </dl>"""

# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
print(soup.find_all("a", class_="btn"))

輸出結果

[<a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg aria-hidden="true" class="icon">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS訂閱</a>]

四CSS選擇器

Beautiful Soup 還直接支援CSS選擇器搜尋,下面列出了經常使用的方法示例;

# -*- coding: utf-8 -*-
import re

import requests
from bs4 import BeautifulSoup

html = """
    <div class="filter-box d-flex align-items-center">
    <form action="" id=seeOriginal>
    <dl class="filter-sort-box d-flex align-items-center">
    <dt>排序:</dt>
    <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">預設</a></dd>
    <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
        <svg class="icon" aria-hidden="true">
            <use xlink:href="#csdnc-rss"></use>
        </svg>RSS訂閱</a>
    </dd>
  </dl>"""

# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
# 選取 dl 標籤下面的 dt標籤
lt = soup.select('dl dt')
print(lt)
dd = soup.select('dl dd')
print(dd[0])
# id 選擇器搜尋
id = soup.select('#seeOriginal')
print(id)
# class選擇器 搜尋
cla = soup.select('.btn-filter-sort')
print(cla[0])

分別輸出如下

soup.select('dl dt')

[<dt>排序:</dt>]

soup.select('dl dd')[0]

<dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">預設</a></dd>

soup.select('#seeOriginal')

[<form action="" id="seeOriginal">
<dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">預設</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg aria-hidden="true" class="icon">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS訂閱</a>
</dd>
</dl></form>]

soup.select('.btn-filter-sort')[0]

<a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">預設</a>