python 正則表示式基礎實戰

阿新 • • 發佈：2019-02-18

python正則表示式的過程大致如下圖：

首先，通過re.compile()函式，生成pattern物件，該物件可以多次使用。

然後呼叫pattern物件的match（string）方法，在string中進行匹配，匹配成功後，返回一個match物件，

通過呼叫match物件的group()方法，可以檢視匹配到的資訊。。。。

下面，我們簡單示例一下：

>>> p=re.compile(r'imooc')#生成pattern物件，該物件可以多次使用
>>> p
<_sre.SRE_Pattern object at 0x7f121d1a2418>
>>> type(p)
<type '_sre.SRE_Pattern'>
>>> m=p.match('imoocabc')#生成match物件
>>> m
<_sre.SRE_Match object at 0x7f121d194238>
>>> type(m)
<type '_sre.SRE_Match'>
>>> m.group()#匹配到的子串
'imooc'
>>> m.string #被匹配的字串
'imoocabc'
>>> m.span() #匹配到的子串在原字串中的索引位置
(0, 5)
>>> m.re #pattern例項
<_sre.SRE_Pattern object at 0x7f121d1a2418>
>>> 

>>> p=re.compile(r'imooc')
>>> m=p.match('IMOOcjfjd')#匹配失敗，因為區分大小寫
>>> type(m)
<type 'NoneType'>
>>> m.group()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
>>> p=re.compile(r'imooc',re.I)#re.I表示忽略（ignore）大小寫
>>> p
<_sre.SRE_Pattern object at 0x7f121d1b71f8>
>>> type(p)
<type '_sre.SRE_Pattern'>
>>> m=p.match('imoocjdf')
>>> m.group()
'imooc'
>>> m=p.match('IMOOcjdf')
>>> m.group()
'IMOOc'
>>> 

>>> p=re.compile(r'(imooc)',re.I)#將imooc放到圓括號中
>>> p
<_sre.SRE_Pattern object at 0x7f121f265f48>
>>> m=p.match('iMOOCidf')
>>> m
<_sre.SRE_Match object at 0x7f121d1d9198>
>>> m.group()
'iMOOC'
>>> m.groups()#將imooc放到圓括號中，m.groups()會返回一個元組
('iMOOC',)
>>>

上面的示例，我們都是先呼叫re.compile()方法，生成一個pattern物件，然後再呼叫該pattern物件的match()方法，這兩步可以合併為一步，如下：

>>> m=re.match(r'imooc','imoocjdf')#先生成pattern物件，再生成match物件，缺點：生成的pattern物件只使用了一次！
>>> m
<_sre.SRE_Match object at 0x7f121f2c2ac0>
>>> type(m)
<type '_sre.SRE_Match'>
>>> m.group()
'imooc'

上面是正則表示式的簡單使用，現在，我們學習正則表示式的基礎語法：

【1】匹配單個字元

1》
.:用來匹配任意一個字元（不包含'\n'）

>>> m=re.match(r'.','adg')
>>> m.group()
'a'
>>> m=re.match(r'.','123adg')
>>> m.group()
'1'
>>> m=re.match(r'.','@123adg')
>>> m.group()
'@'
>>> m=re.match(r'.','%@123adg')
>>> m.group()
'%'
>>> m=re.match(r'.','\n%@123adg')#  .無法匹配'\n'
>>> m.group()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
>>> type(m)
<type 'NoneType'>
>>> m=re.match(r'{.}','{a}\n%@123adg')
>>> m.group()
'{a}'
>>> m=re.match(r'{.}','{8}\n%@123adg')
>>> m.group()
'{8}'
>>> m=re.match(r'{.}','{8ab}\n%@123adg')
>>> m.group()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
>>> type(m)
<type 'NoneType'>
>>> m=re.match(r'{...}','{8ab}\n%@123adg')
>>> m.group()
'{8ab}'

2》
[a-z]匹配a-z中的任意一個字元 [0-9]匹配0-9中的任意一個字元

>>> m=re.match(r'{[abc]}','{b}\n%@123adg')
>>> m.group()
'{b}'
>>> m=re.match(r'{[abc]}','{g}\n%@123adg')
>>> m.group()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
>>> type(m)
<type 'NoneType'>
>>> m=re.match(r'{[a-z]}','{g}\n%@123adg')
>>> m.group()
'{g}'
>>> m=re.match(r'{[a-zA-Z]}','{H}\n%@123adg')
>>> m.group()
'{H}'
>>> m=re.match(r'[[a-z]]','[d]g%@123adg')#匹配失敗，應用'\['來匹配'['
>>> m=re.match(r'([a-z])','(d)kdjkf')    #匹配失敗，應用'\('來匹配'('

3》\w代表一個單詞字元（等價於）[A-Za-z0-9_] 即 \w 可以匹配大寫字母小寫字母數字下劃線！！
\W代表一個非單詞字元（\w的對立面）

>>> m=re.match(r'{[\w]}','{H}\n%@123adg')
>>> m.group()
'{H}'
>>> m=re.match(r'{[\w]}','{h}\n%@123adg')
>>> m.group()
'{h}'
>>> m=re.match(r'{[\w]}','{9}\n%@123adg')
>>> m.group()
'{9}'
>>> m=re.match(r'{[\w]}','{&}\n%@123adg')
>>> m.group()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
>>> type(m)
<type 'NoneType'>
>>> m=re.match(r'{[\W]}','{&}\n%@123adg')
>>> m.group()
'{&}'

4》\d匹配一個數字
\D匹配一個非數字

>>> m=re.match(r'\d','9agd')
>>> m.group()
'9'
>>> m=re.match(r'\d','0agd')
>>> m.group()
'0'
>>> m=re.match(r'\d','ghh0agd')
>>> m.group()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
>>> m=re.match(r'\D','ghh0agd')
>>> m.group()
'g'
>>> m=re.match(r'\D','Aghh0agd')
>>> m.group()
'A'
>>> m=re.match(r'\D','+Aghh0agd')
>>> m.group()
'+'
>>> m=re.match(r'\D','*+Aghh0agd')
>>> m.group()
'*'

5》\s匹配空白（空格，Tab鍵，‘\n’）
\S匹配非空白字元

>>> m=re.match(r'\s',' *+Aghh0agd')
>>> m.group()
' '
>>> m=re.match(r'\S','*+Aghh0agd')
>>> m.group()
'*'
>>> m=re.match(r'\S','$Aghh0agd')
>>> m.group()
'$'
>>> m=re.match(r'\S','h$Aghh0agd')
>>> m.group()
'h'

【2】匹配多個字元

1》* 匹配前一個字元0次或者無限次（任意次）

>>> m=re.match(r'[A-Z][a-z]*','H')
>>> m.group()
'H'
>>> m=re.match(r'[A-Z][a-z]*','Hjdkjfksdjkff')
>>> m.group()
'Hjdkjfksdjkff'
>>> m=re.match(r'[A-Z][a-z]*','Hjdkjf98ksdjkff')
>>> m.group()
'Hjdkjf'
>>> m=re.match(r'[_a-zA-Z][\w]*','_this')
>>> m.group()
'_this'
>>> m=re.match(r'[_a-zA-Z][\w]*','_')
>>> m.group()
'_'
>>> m=re.match(r'[_a-zA-Z][\w]*','3abc')
>>> m.group()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'

2》+ 匹配前一個字元1次或者無限次（正整數次）

>>> m=re.match(r'A[0-9]+','A')
>>> m.group()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
>>> m=re.match(r'A[0-9]+','A8')
>>> m.group()
'A8'
>>> m=re.match(r'A[0-9]+','A88787gg')
>>> m.group()
'A88787'

3》？匹配前一個字元0次或者1次

>>> m=re.match(r'[1-9]?[0-9]','100')
>>> m.group()
'10'
>>> m=re.match(r'[1-9]?[0-9]','990')
>>> m.group()
'99'
>>> m=re.match(r'[1-9]?[0-9]','0990')
>>> m.group()
'0'

4》{m}/{m,n}匹配起一個字元m次或者m到n次

>>> m=re.match(r'[a-z0-9]{6}','123abckkk')
>>> m.group()
'123abc'
>>> m=re.match(r'[a-z0-9]{6}','123ab')
>>> m.group()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
>>> m=re.match(r'[a-z0-9]{6}@163.com','[email protected]')
>>> m.group()
'[email protected]'
>>> m=re.match(r'[a-z0-9]{6,10}@163.com','[email protected]')
>>> m.group()
'[email protected]'
>>> m=re.match(r'[a-z0-9]{6,10}@163.com','[email protected]')
>>> m.group()
'[email protected]'
>>> m=re.match(r'[a-z0-9]{6,10}@163.com','[email protected]')
>>> m.group()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
>>> m=re.match(r'[a-z0-9]{6,10}@163.com','[email protected]')
>>> m.group()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'

5》*？ +？？？匹配模式變為非貪婪（儘可能少的匹配字元）

>>> m=re.match(r'a[0-9]*','a90987abc')
>>> m.group()
'a90987'
>>> m=re.match(r'a[0-9]*?','a90987abc')
>>> m.group()
'a'
>>> m=re.match(r'a[0-9]+','a90987abc')
>>> m.group()
'a90987'
>>> m=re.match(r'a[0-9]+?','a90987abc')
>>> m.group()
'a9'
>>> m=re.match(r'a[0-9]?','a90987abc')
>>> m.group()
'a9'
>>> m=re.match(r'a[0-9]??','a90987abc')
>>> m.group()
'a'

【3】使用正則表示式進行邊界匹配

1》^ 匹配字串的開頭（match預設就是從頭開始匹配）
2》$ 匹配字串的結尾

>>> m=re.match(r'[\w]{6,10}@163.com','[email protected]') #match預設就是從頭開始匹配
>>> m.group()
'[email protected]'
>>> m=re.match(r'^[\w]{6,10}@163.com','[email protected]') #顯式指定從頭開始匹配
>>> m.group()
'[email protected]'
>>> m=re.match(r'^[\w]{6,10}@163.com$','[email protected]') #指定必須以@163.com結尾
>>> m.group()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
>>> m=re.match(r'^[\w]{6,10}@163.com$','[email protected]')  #指定必須以@163.com結尾
>>> m.group()
'[email protected]'

3》\A \Z 指定的字串必須出現在開頭/結尾

>>> m=re.match(r'\Aimooc[0-9]+','imooc123')
>>> m.group()
'imooc123'
>>> m=re.match(r'\Aimooc[0-9]+','iimooc123')
>>> m.group()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
>>> m=re.match(r'[0-9]+imooc\Z','123imooc')
>>> m.group()
'123imooc'
>>> m=re.match(r'[0-9]+imooc\Z','123imoocbook')
>>> m.group()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'

【4】其他妙用

1》 | 匹配左右任意一個表示式

>>> m=re.match(r'abc|d','abckkk')
>>> m.group()
'abc'
>>> m=re.match(r'abc|d','dabckkk')
>>> m.group()
'd'
>>> m=re.match(r'[1-9]?[0-9]|100','0')
>>> m.group()
'0'
>>> m=re.match(r'[1-9]?[0-9]|100','59')
>>> m.group()
'59'
>>> m=re.match(r'[1-9]?[0-9]|100','100')
>>> m.group()
'10'
>>> m=re.match(r'[1-9]?[0-9]$|100','100')
>>> m.group()
'100'

2》（163|126|qq）括號中的表示式作為一個分組，任選其中一個

>>> m=re.match(r'[\w]{4,6}@(163|126|qq).com','[email protected]')
>>> m.group()
'[email protected]'
>>> m=re.match(r'[\w]{4,6}@(163|126|qq).com','[email protected]')
>>> m.group()
'[email protected]'
>>> m=re.match(r'[\w]{4,6}@(163|126|qq).com','[email protected]')
>>> m.group()
'[email protected]'
>>> m=re.match(r'[\w]{4,6}@(163|126|qq).com','[email protected]')
>>> m.group()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'

3》\number 引用編號為number的分組所匹配到字串

>>> m=re.match(r'<([\w]+>)\1\1','<book>book>book>')
>>> m.group()
'<book>book>book>'
>>> m.groups()
('book>',)
>>> m=re.match(r'<([\w]+>)[\w]+</\1','<book>python</book>')
>>> m.groups()
('book>',)
>>> m.group()
'<book>python</book>'

4》（?P<name>）給分組起一個別名（?P=name）引用別名為name的分組所匹配到的字串

>>> m=re.match(r'<(?P<mark>[\w]+>)[\w]+</\1','<book>python</book>') #通過編號引用
>>> m.group()
'<book>python</book>'
>>> m=re.match(r'<(?P<mark>[\w]+>)[\w]+</(?P=mark)','<book>python</book>')#通過別名引用
>>> m.group()
'<book>python</book>'

上面我們學習了，正則表示式的基礎語法，接著，我們簡單學習一下python中的正則表示式模組-----re模組

使用該模組，首先要匯入：import re

1》search(pattern,string,flags=0) 在一個字串中查詢匹配，返回第一個匹配

>>> s='imooc hours=1000'
>>> m=re.search(r'\d+',s)
>>> m
<_sre.SRE_Match object at 0x7fb4ae3da4a8>
>>> m.group()
'1000'
>>> s='imooc c++=1230,python=1200'
>>> m=re.search(r'\d+',s)
>>> m
<_sre.SRE_Match object at 0x7fb4ae3da4a8>
>>> m.group()
'1230'

2》findall(pattern,string,flags=0) 找到匹配，返回所有匹配部分的列表

>>> s='c++=100,java=200,python=300'
>>> m=re.search(r'\d+',s)#返回第一個匹配
>>> m
<_sre.SRE_Match object at 0x7fb4ae3da578>
>>> m.group()
'100'
>>> info=re.findall(r'\d+',s) #返回所有匹配部分的列表
>>> info
['100', '200', '300']

3》sub(pattern,repl,string,count=0,flags=0) substitute 代替，替換
將字串中匹配正則表示式的部分替換為其他值

3.1>repl是字串

>>> s='imooc c++=1000'
>>> info=re.sub(r'\d+','1001',s)
>>> info
'imooc c++=1001'
>>> s='imooc c++=1000,python=2000'
>>> info=re.sub(r'\d+','1001',s)
>>> info
'imooc c++=1001,python=1001'

3.2>repl是函式物件
sub(pattern,repl,string,count=0,flags=0) 若果repl是函式物件，則該函式接收的引數是一個match物件，
首先，對正則表示式pattern進行編譯，生成一個pattern物件，
然後，使用該pattern物件，在字串string中進行匹配，匹配成功，返回一個match物件，
最後，將該match物件，傳遞給repl函式！！！
【該部分程式碼可以幫助理解上述說明！！
>>> p=re.compile(r'\d+')
>>> p
<_sre.SRE_Pattern object at 0x7fb4ae486a48>
>>> m=p.match('123sdjf')
>>> m
<_sre.SRE_Match object at 0x7fb4ae3da4a8>
>>> m.group()
'123'
】

>>> def add1(match):
...     s=match.group()
...     value=int(s)+1
...     return str(value)
... 
>>> s='imooc c++=1000,java=2000,python=3000'
>>> re.sub(r'\d+',add1,s)
'imooc c++=1001,java=2001,python=3001'

4》split(pattern,string,maxsplit=0,flags=0) 根據匹配分割字串，返回分割後字串組成的列表

>>> s='imooc:c c++ python java'
>>> re.split(r':| ',s) #指定 ':'或' '作為分隔符（可以指定多個分隔符！！！）
['imooc', 'c', 'c++', 'python', 'java']
>>> s='imooc:c c++ python java,c#'
>>> re.split(r':| |,',s) #指定 ':'或' '或','作為分隔符
['imooc', 'c', 'c++', 'python', 'java', 'c#']

最後，我們通過一個簡單例項，來抓取網上的圖片：

import urllib2,re
req=urllib2.urlopen('http://www.imooc.com/course/list')
buf=req.read()
urllist=re.findall(r'http:.+\.jpg',buf)
print urllist
i=0
for url in urllist:
    request=urllib2.urlopen(url)
    buffer=request.read()
    f=open('C:\\Users\\91135\\Desktop\\image\\'+str(i)+'.jpg','w')
    f.write(buffer)
    f.close()
    i+=1

至此，我們對python中的正則表示式有了一個基本的瞭解，另外兩篇介紹正則表示式的文章，推薦學習：

python 正則表示式基礎實戰

python 正則表示式基礎實戰

python--正則表示式的實戰

Python 正則表示式，實戰篇！再搞不懂？算我輸~

Python基礎六：Python 正則表示式

Python正則表示式學習（1）——re.sub()基礎

Python正則表示式的簡單應用和示例演示

Python正則表示式初識（九）

Python 正則表示式：compile,match

Python 正則表示式模組詳解

Python 正則表示式：search

Python | 正則表示式的常見用法

python 正則表示式找出字串中的純數字

正則表示式基礎1

正則表示式基礎三

正則表示式基礎2

python 正則表示式簡介

Python 正則表示式：findall

正則表示式 - 基礎篇

Python正則表示式:re模組

Python 正則表示式：sub

python 正則表示式基礎實戰

相關推薦