《零基礎入門學習Python》第060講：論一隻爬蟲的自我修養8：正則表示式4

阿新 • • 發佈：2018-12-16

有了前面幾節課的準備，我們這一次終於可以真刀真槍的幹一場大的了，但是呢，在進行實戰之前，我們還要講講正則表示式的實用方法和擴充套件語法，然後再來實戰，大家多把持一會啊。

我們先來翻一下文件：

首先，我們要舉的例子是講得最多的 search() 方法，search() 方法既有模組級別的，就是直接呼叫 re.search() 來實現，另外，編譯後的正則表示式模式物件也同樣擁有 search() 方法，我問問大家，它們之間有區別嗎？

如果你的回答僅僅是模組級別的search() 方法比模式級別的search() 方法要多一個正則表示式的引數，那你肯定沒有去翻文件。

re.

search(pattern, string, flags=0)

Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding match object. Return None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.

這是模組級別的 search() 方法，大家注意它的引數，它有一個 flags 引數， flags 引數就我們上節課講得編譯標誌位，作為一個模組級別的，它沒辦法影印，它直接在這裡使用它的標誌位就可以了。

pattern 是正則表示式的模式

string 是要搜尋的字串

我們再來看一下如果是編譯後的模式物件，它的 search() 方法又有哪些引數：

regex.search(string[, pos[, endpos]])

Scan through string looking for the first location where this regular expression produces a match, and return a corresponding match object

. Return None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.

The optional second parameter pos gives an index in the string where the search is to start; it defaults to 0. This is not completely equivalent to slicing the string; the '^' pattern character matches at the real beginning of the string and at positions just after a newline, but not necessarily at the index where the search is to start.

The optional parameter endpos limits how far the string will be searched; it will be as if the string is endpos characters long, so only the characters from pos to endpos - 1 will be searched for a match. If endpos is less than pos, no match will be found; otherwise, if rx is a compiled regular expression object, rx.search(string, 0, 50) is equivalent to rx.search(string[:50], 0).

前面的 pattern，模式物件的引數，就不需要了。

string 第一個引數就是待搜尋的字串

後面有兩個可選引數是我們模組級別的 search() 方法沒有的，它分別代表需要搜尋的起始位置（pos）和結束位置（endpos）

你就可以像 rx.search(string, 0, 50) 或者 rx.search(string[:50], 0) 這樣子去匹配它的搜尋位置了。

還有一點可能被忽略的就是，search() 方法並不會立刻返回你所需要的字串，取而代之，它是返回一個匹配物件。我們來舉個例子：

>>> import re
>>> result = re.search(r" (\w+) (\w+)", "I love Python.com")
>>> result
<_sre.SRE_Match object; span=(1, 13), match=' love Python'>

我們看到，這個 result 是一個匹配物件（ match object.），而不是一個字串。它這個匹配物件有一些方法，你使用這些方法才能夠獲得你所需要的匹配的字串：

例如：group()方法：

>>> result.group()
' love Python'

我們就把匹配的內容打印出來了。首先是一個空格，然後是 \w+ ，就是任何字元，這裡就是love，然後又是一個空格，然後又是 \w+，這裡就是Python。

說到這個 group()方法，值的一提的是，如果正則表示式中存在著子組，子組會將匹配的內容進行捕獲，通過這個 group()方法中設定序號，可以提取到對應的子組（序號從1開始）捕獲的字串。例如：

>>> result.group(1)
'love'
>>> result.group(2)
'Python'

除了 group()方法之外，它還有 start()方法、end()方法、 span() 方法，分別返回它匹配的開始位置、結束位置、範圍。

match.start([group])

match.end([group])

Return the indices of the start and end of the substring matched by group; group defaults to zero (meaning the whole matched substring). Return -1 if group exists but did not contribute to the match. For a match object m, and a group g that did contribute to the match, the substring matched by group g (equivalent to m.group(g)) is
m.string[m.start(g):m.end(g)]
Note that m.start(group) will equal m.end(group) if group matched a null string. For example, after m = re.search('b(c?)', 'cba'), m.start(0) is 1, m.end(0) is 2, m.start(1) and m.end(1) are both 2, and m.start(2) raises an IndexError exception.

An example that will remove remove_this from email addresses:
>>> email = "[email protected]_thisger.net"
>>> m = re.search("remove_this", email)
>>> email[:m.start()] + email[m.end():]
'[email protected]'
match.span([group])

For a match m, return the 2-tuple (m.start(group), m.end(group)). Note that if group did not contribute to the match, this is (-1, -1). group defaults to zero, the entire match.

>>> result.start()
1
>>> result.end()
13
>>> result.span()
(1, 13)

接下來講講 findall() 方法：

re.findall(pattern, string, flags=0)

Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.

有人可能會覺得，findall() 方法很容易，不就是找到所有匹配的內容，然後把它們組織成列表的形式返回嗎。

沒錯，這是在正則表示式裡沒有子組的情況下所做的事，如果正則表示式裡包含了子組，那麼，findall() 會變得很聰明。

我們來舉個例子吧，上貼吧爬圖：

例如我們想下載這個頁面的所有圖片：https://tieba.baidu.com/p/4863860271

我們先來踩點，看到圖片格式的標籤：

我們就來直接寫程式碼啦：

首先，我們寫下下面的程式碼，來爬取圖片地址：

import re

p = r'<img class="BDE_Image" src="[^"]+\.jpg"'
imglist = re.findall(p, html)

for each in imglist:
        print(each)

列印的結果為：

============== RESTART: C:\Users\XiangyangDai\Desktop\tieba.py ==============
<img class="BDE_Image" src="https://imgsa.baidu.com/forum/w%3D580/sign=65ac7c3d9e0a304e5222a0f2e1c9a7c3/4056053b5bb5c9ea8d7d0bdadc39b6003bf3b34e.jpg"
<img class="BDE_Image" src="https://imgsa.baidu.com/forum/w%3D580/sign=d887aa03394e251fe2f7e4f09787c9c2/77f65db5c9ea15ceaf60e830bf003af33b87b24e.jpg"
<img class="BDE_Image" src="https://imgsa.baidu.com/forum/w%3D580/sign=0db90d472c1f95caa6f592bef9167fc5/2f78cfea15ce36d34f8a8b0933f33a87e850b14e.jpg"
<img class="BDE_Image" src="https://imgsa.baidu.com/forum/w%3D580/sign=abfd18169ccad1c8d0bbfc2f4f3f67c4/bd2713ce36d3d5392db307fa3387e950342ab04e.jpg"

很顯然，這不是我們需要的地址，我們需要的只是後面的部分。我們接下來要解決的問題就是如何將裡面的地址提取出來，不少人聽到這裡，可能就已經開始動手了。但是，別急，我這裡有更好的方法。

只需要把圖片地址用小括號括起來，即將：

p = r'<img class="BDE_Image" src="[^"]+\.jpg"' 改為 p = r'<img class="BDE_Image" src="([^"]+\.jpg)"'，

大家再來看一下執行後的結果：

============== RESTART: C:\Users\XiangyangDai\Desktop\tieba.py ==============
https://imgsa.baidu.com/forum/w%3D580/sign=65ac7c3d9e0a304e5222a0f2e1c9a7c3/4056053b5bb5c9ea8d7d0bdadc39b6003bf3b34e.jpg
https://imgsa.baidu.com/forum/w%3D580/sign=d887aa03394e251fe2f7e4f09787c9c2/77f65db5c9ea15ceaf60e830bf003af33b87b24e.jpg
https://imgsa.baidu.com/forum/w%3D580/sign=0db90d472c1f95caa6f592bef9167fc5/2f78cfea15ce36d34f8a8b0933f33a87e850b14e.jpg
https://imgsa.baidu.com/forum/w%3D580/sign=abfd18169ccad1c8d0bbfc2f4f3f67c4/bd2713ce36d3d5392db307fa3387e950342ab04e.jpg

是不是很興奮，是不是很驚訝，先別急，我先把程式碼敲完，再給大家講解。

import urllib.request
import re

def open_url(url):
        req = urllib.request.Request(url)
        req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.65 Safari/537.36')
        response = urllib.request.urlopen(url)
        html = response.read()
        return html

def get_img(url):
        html = open_url(url).decode('utf-8')
        p = r'<img class="BDE_Image" src="([^"]+\.jpg)"'
        imglist = re.findall(p, html)
        
        '''
        for each in imglist:
                print(each)
        '''
        for each in imglist:
                filename = each.split('/')[-1]
                urllib.request.urlretrieve(each, filename, None)

if __name__ == '__main__':
        url = "https://tieba.baidu.com/p/4863860271"
        get_img(url)

執行結果，就是很多美眉圖片出現在桌面了（前提是這個程式在桌面執行，圖片自動下載到程式所在資料夾。）

接下來就來解決大家的困惑了：為什麼加個小括號會如此方便呢？

這是因為在 findall() 方法中，如果給出的正則表示式是包含著子組的話，那麼就會把子組的內容單獨給返回回來。然而，如果存在多個子組，那麼它還會將匹配的內容組合成元組的形式再返回。

我們還是舉個例子：

因為有時候 findall() 如果使用的不好，很多同學就會感覺很疑惑，很迷茫……

拿前面匹配 ip 地址的正則表示式來講解，我們使用 findall() 來嘗試自動從https://www.xicidaili.com/wt/獲取 ip 地址：

初程式碼如下：

import urllib.request
import re

def open_url(url):
        req = urllib.request.Request(url)
        req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.65 Safari/537.36')
        reponse = urllib.request.urlopen(req)
        html = reponse.read()
        return html

def get_ip(url):
        html = open_url(url).decode('utf-8')
        p = r'(([0,1]?\d?\d|2[0-4]\d|25[0-5])\.){3}([0,1]?\d?\d|2[0-4]\d|25[0-5])'
        iplist = re.findall(p, html)

        for each in iplist:
                print(each)

if __name__ == "__main__":
        url = "https://www.xicidaili.com/wt/"
        get_ip(url)

執行結果如下：

============== RESTART: C:\Users\XiangyangDai\Desktop\getIP.py ==============
('180.', '180', '122')
('248.', '248', '79')
('129.', '129', '198')
('217.', '217', '7')
('40.', '40', '35')
('128.', '128', '21')
('118.', '118', '106')
('101.', '101', '46')
('3.', '3', '4')

得到的結果讓我們很迷茫，為什麼會這樣呢？這明顯不是我們想要的結果，這是因為我們在正則表示式裡面使用了 3 個子組，所以，findall() 會自作聰明的把我們的結果做了分類，然後用元組的形式返回給我們。

那有沒有解決的方法呢？

要解決這個問題，我們可以讓子組不捕獲內容。

我們檢視 -> Python3 正則表示式特殊符號及用法（詳細列表）,尋求擴充套件語法。

讓子組不捕獲內容，擴充套件語法就是非捕獲組：

所以我們的初程式碼修改如下：

import urllib.request
import re

def open_url(url):
        req = urllib.request.Request(url)
        req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.65 Safari/537.36')
        reponse = urllib.request.urlopen(req)
        html = reponse.read()
        return html

def get_ip(url):
        html = open_url(url).decode('utf-8')
        p = r'(?:(?:[0,1]?\d?\d|2[0-4]\d|25[0-5])\.){3}(?:[0,1]?\d?\d|2[0-4]\d|25[0-5])'
        iplist = re.findall(p, html)

        for each in iplist:
                print(each)

if __name__ == "__main__":
        url = "https://www.xicidaili.com/wt/"
        get_ip(url)

執行得到的結果也是我們想要的 ip 地址了，如下：

============== RESTART: C:\Users\XiangyangDai\Desktop\getIP.py ==============
183.47.40.35
61.135.217.7
221.214.180.122
101.76.248.79
182.88.129.198
175.165.128.21
42.48.118.106
60.216.101.46
219.245.3.4
117.85.221.45

接下來我們又回到文件：

另外還有一些使用的方法，例如：

finditer() ，是將結果返回一個迭代器，方便以迭代方式獲取資料。

sub() ，是實現替換的操作。

在Python3 正則表示式特殊符號及用法（詳細列表）中也還有一些特殊的語法，例如：

(?=...)：前向肯定斷言。

(?！...)：前向否定斷言。

(?<=...)：後向肯定斷言。

(?<!...)：後向肯定斷言。

這些都是非常有用的，但是呢，這些內容有點多了，如果說全部都講正則表示式的話，那我們就是喧賓奪主了，我們主要講的是網路爬蟲哦。

《零基礎入門學習Python》第060講：論一隻爬蟲的自我修養8：正則表示式4

接下來講講 findall() 方法：

這是因為在 findall() 方法中，如果給出的正則表示式是包含著子組的話，那麼就會把子組的內容單獨給返回回來。然而，如果存在多個子組，那麼它還會將匹配的內容組合成元組的形式再返回。

要解決這個問題，我們可以讓子組不捕獲內容。

所以，大家還是要自主學習一下，多看，多學，多操作。

小甲魚零基礎入門學習python 第12章 p12_6.py p12_7.py 錯誤分析及更正

《零基礎入門學習Python》第060講：論一隻爬蟲的自我修養8：正則表示式4

第003講：插曲之變數和字串 | 學習記錄（小甲魚零基礎入門學習Python）

第002講：用Python設計第一個遊戲|學習記錄（小甲魚零基礎入門學習Python）

第001講：我和Python的第一次親密接觸|學習記錄（小甲魚零基礎入門學習Python）

第014講：字串：各種奇葩的內建方法 | 學習記錄（小甲魚零基礎入門學習Python）

第013講：元組：戴上了枷鎖的列表 | 學習記錄（小甲魚零基礎入門學習Python）

第011講：列表：一個打了激素的陣列2 | 學習記錄（小甲魚零基礎入門學習Python）

第010講：列表：一個打了激素的陣列1 | 學習記錄（小甲魚零基礎入門學習Python）

第009講：了不起的分支和迴圈3 | 學習記錄（小甲魚零基礎入門學習Python）

第007、008講：了不起的分支迴圈1&2 | 學習記錄（小甲魚零基礎入門學習Python）

第006講：python之常用操作符| 學習記錄（小甲魚零基礎入門學習Python）

第020講：函式：內嵌函式和閉包 | 學習記錄（小甲魚零基礎入門學習Python）

第019講：我的地盤聽我的 | 學習記錄（小甲魚零基礎入門學習Python）

第018講：函式：靈活即強大 | 學習記錄（小甲魚零基礎入門學習Python）

第017講：函式 - Python的樂高積木 | 學習記錄（小甲魚零基礎入門學習Python）

第016講：序列！序列！ | 學習記錄（小甲魚零基礎入門學習Python）

第015講：字串：格式化 | 學習記錄（小甲魚零基礎入門學習Python）

小甲魚的《零基礎入門學習Python》[課後作業] 第001講：我和Python的第一次親密接觸 | 課後測試題及答案

小甲魚的《零基礎入門學習Python》[課後作業] 第002講：用Python設計第一個遊戲 | 課後測試題及答案

《零基礎入門學習Python》第060講：論一隻爬蟲的自我修養8：正則表示式4

接下來講講 findall() 方法：

這是因為在 findall() 方法中，如果給出的正則表示式是包含著子組的話，那麼就會把子組的內容單獨給返回回來。然而，如果存在多個子組，那麼它還會將匹配的內容組合成元組的形式再返回。

要解決這個問題，我們可以讓子組不捕獲內容。

所以，大家還是要自主學習一下，多看，多學，多操作。

相關推薦