python之re

阿新 • • 發佈：2018-11-05

正則表示式（Regular Expression）使用單個字串來描述、匹配一系列匹配某個句法規則的字串。

re中常用方法:

re.match(pattern, string)
    pattern:自己編寫的正則表示式
    string：要匹配的字串
  自字串首字元，從左向右開始匹配。若匹配成功，則返回一個匹配物件Macth Object，具有group() groups()等方法，用來返回字串的匹配部分。否則返回None（注意不是空字串""）。
re.search

(pattern, string)
引數同上，從字串中查詢符合pattern的內容，找到一個就停止查詢。若匹配成功，則返回一個匹配物件Macth Object。

In[1]：s = '123abcnvios321avnie'
In[2]：result = re.search(r'\d+[a-z]', s)
In[3]：result.group()
Out[3]: '123a'

re.findall(pattern, string)
引數同上。從字串中查詢所有符合pattern的內容。返回一個列表。

In[1]：s = '123abcnvios321avnie'
In[2]：result = re.findall(r'\d+[a-z]', s)
In[3]：result
Out[3]: ['123a', '321a']

re.sub(pattern, repl, string)
repl：可以是一個字串變數，用來替換要被替換的匹配值；或者為一個函式對匹配結果進行某種操作以後，進行返回，用返回值替換匹配值。返回一個字串。

# repl為字串變數
In[1]：s = 'name=xxx score=66'
# 將成績66改為88
In[2]：result = re.sub(r'\d+',  ‘88’,  s)
In[3]：result
Out[3]: 'name=xxx score=88'

# repl為函式名
In[1]：s = 'name=xxx math_score=66 english_score=77'
In[2]：def replace(score):
           print(score.group())
    	   print(type(score))
           return '0'
# 可以發現傳入score的是一個匹配物件Macth Object，每匹配到一個執行一次函式。
In[3]：result = re.sub(r'\d+', replace, s)
66
<class '_sre.SRE_Match'>
77
<class '_sre.SRE_Match'>
In[4]：result
Out[4]: 'name=xxx math_score=0 englist_score=0'

# 將所有成績加10分
In[5]：def replace(score):
           return str(int(score.group()) + 10)
In[6]：result = re.sub(r'\d+',  replace,  s)
In[7]：result
Out[7]: 'name=xxx math_score=76 englist_score=87'

re.split(pattern, string)
根據匹配進行切割字串，並返回一個列表

In[1]：s = 'aaa-bbb:ccc,ddd'
# 根據- 或：或，進行分割 
In[2]：result = re.split(r'-|:|,', s)
In[3]：result
Out[3]: ['aaa', 'bbb', 'ccc', 'ddd']

表示字元

字元	功能
.	匹配任意1個字元（除了\n）
[ ]	匹配[ ]中列舉的字元
\d	匹配數字，即0-9
\D	匹配非數字，即不是數字
\s	匹配空白，即空格，tab鍵
\S	匹配非空白
\w	匹配單詞字元，即a-z、A-Z、0-9、_
\W	匹配非單詞字元

^: 對[]內的內容取反，即匹配非[]內的內容
\d == [0-9]
\D == [^0-9]
\w == [0-9a-zA-Z_]

表示數量

字元	功能
*	匹配前一個字元出現0次或者無限次，即可有可無
+	匹配前一個字元出現1次或者無限次，即至少有1次
?	匹配前一個字元出現1次或者0次，即要麼有1次，要麼沒有
{m}	匹配前一個字元出現m次
{m,}	匹配前一個字元至少出現m次
{m,n}	匹配前一個字元出現從m到n次

｛1，｝ == +
｛0，｝ == *
｛0， 1｝ == ？

表示邊界

字元	功能
^	匹配字串開頭
$	匹配字串結尾
\b	匹配一個單詞的邊界
\B	匹配非單詞邊界

字串邊界：

In[1]：re.match(r'.+w$', 'windows') # 錯誤匹配，字元w未在字串結尾。
In[2]：result = re.match(r'.+s$', 'windows')
In[2]：result.group()
Out[2]: 'windows'

單詞邊界：
想讓一個字元出現在單詞的邊界處，則該字元後邊可以跟空格表示單詞結束，或者該字元在字串最後，表示字串結束

In[1]：result = re.match(r'^\w+w\b', 'windows') # 錯誤匹配，字元w未在單詞最後或字串結尾。
# w在單詞最後
In[2]：result = re.match(r'^\w+ow\b', 'window sssss')
In[2]：result.group()
Out[2]: 'window'
In[3]：result = re.match(r'^\w+ow\B', 'windows')
In[3]：result.group()
Out[3]: 'window'

匹配分組

字元	功能
\|	匹配左右任意一個表示式
（ab）	將括號中字元作為一個分組
\num	引用分組num匹配到的字串
（?P<name>）	分組起別名
(?P=name)	引用別名為name分組陪陪到的字串

分組：

In[1]：s = '<html><p>windows</p></html>'
# 將括號中字元作為一個分組
In[2]：result = re.match(r'<(.+)><(.+)>(.+)</.+></.+>', s)
# 通過groups()顯示
In[3]：result.groups()
Out[3]: ('html', 'p', 'windows')

引用分組：

In[1]：s = '<html><p>windows</html></p>'
# 當字串s中標籤不對應時，也能匹配出結果，但在實際應用中有問題。
In[2]：result = re.match(r'<.+><.+>(.+)</.+></.+>', s)
In[3]：result.group()
Out[3]:  <html><p>windows</html></p>

# 通過引用分組解決
In[1]：s = '<html><p>windows</html></p>'
# 使用引用分組後匹配不出結果，因為標籤不對應。\1引用的是html，但是在字串s中他的位置是p.\2引用的是p，但是在字串s中他的位置是html,
In[2]：result = re.match(r'<(.+)><(.+)>(.+)</\2></\1>', s)

# 字串s中標籤對應，則匹配成功
In[1]：s = '<html><p>windows</p></html>'
In[2]：result = re.match(r'<(.+)><(.+)>(.+)</\2></\1>', s)
In[3]：result.group()
Out[3]: <html><p>windows</html></p>
In[4]：result.groups()
Out[54]: ('html', 'p', 'windows')

分組起別名:
和引用分組類似。相當於給匹配到的分組起個名字，在匹配時，不用引用分組的方式（\x）,而是用引用別名的方式（?P=name）

In[1]：s = '<html><p>windows</p></html>'
In[2]：result = re.match(r'<(?P<name1>.+)><(?P<name2>.+)>(.+)</(?P=name2)></(?P=name1)>', s)
In[3]：result.group()
Out[3]: <html><p>windows</p></html>

貪婪和非貪婪：
在Python裡用正則表示式匹配時預設是貪婪的，即：總是嘗試匹配儘可能多的字元；非貪婪則相反，總是嘗試匹配儘可能少的字元。
例如我們匹配一串數字時，可能出現缺少數字的現象。

In[1]: s = 'this is a telephone number: 0311-12345678'
# 想要匹配出0311-12345678
In[2]: result = re.match(r'^.+(\d+-\d+)$', s)
In[3]: re.group(1)
Out[3]: '1-12345678'

上面這個例子匹配時，預設為貪婪模式，”.+“會盡量的匹配出儘可能多的字元。因此數字的前三位”031“也被他匹配走了，“\d+”只需一位字元就可以匹配，所以它匹配了數字“1”。可以匹配模式改為如下形式：

在\d後面加上數量限制：re.match(r’^.+(\d{4}-\d+)$’, s)
改為分貪婪模式：re.match(r’^.+?(\d±\d+)$’, s)
非貪婪模式的改法為：將非貪婪操作符“？”，用在"*" , “+” , "?"的後面，要求正則匹配的越少越好。

手機號匹配規則分析：
4. 總長度11位： patten=’\d\d\d\d\d\d\d\d\d\d\d‘
（這樣仍有許多問題，如第一個數非1，或字串s過長等。）
5. 第一個數為1： patten=’1\d\d\d\d\d\d\d\d\d\d‘
6. 第二個數為3或4或5或7或8：patten=’1[34578]\d\d\d\d\d\d\d\d\d‘
7. 剩餘九位數字：patten=’1[34578]\d{9}‘
8. 開頭、結尾：patten=’^1[34578]\d{9}$‘
注：此分析只是簡單的過程，不是實際手機號匹配模式。對於其他位的數字沒有加條件限制。

匹配0-100任意數字
分析：
一位數：0 1 2 …
兩位數：23 81 39 …
三位數：100
當位數>=2時，第一個數字不能為0，故0為特殊情況。
9. 匹配第一個數：pattern = ‘[1-9]’
10. 匹配第二個數：pattern = ‘[1-9]\d?’，此時可匹配除0以外的一位數或兩位數。
11. 特殊情況：pattern = ‘[1-9]\d?|0|100’
12. 新增邊界：pattern = ‘[1-9]\d? $|0$ |100 $KaTeX parse error: Expected 'EOF', got '\d' at position 47: …ttern = ‘[1-9]?\̲d̲?$ |0 $|100$ ’

python之re

python之(re)正則表達式上

Python之re操作

python之re

python之re模組（正則表示式）常用函式

python之re模組使用

Python 之 re模組正則表示式

python之re提取字串中括號內的內容

Python基礎（13）_python模塊之re模塊(正則表達式)

New-Python-模塊之re其他

python模塊之re

進階第七課 Python模塊之re

python 模塊之-re

python正則表達式之re模塊使用

【轉】Python之正則表示式（re模組）

python之正則模組Re

python之爬蟲的入門05------實戰：爬取貝殼網（用re匹配需要的資料）

Python 常用模組之re 正則表示式的使用

python學習之-re模組（正則表示式模組）

python之正則表示式：re模組

Python基礎之re模組

python之re

相關推薦