Python正則表達式基礎
非Python獨有,re模塊實現
re.match
re.match嘗試從字符串的起始位置匹配一個模式,如果不是起始位置匹配成功的話,match()就返回none.
re.match(pattern,string,flags=0)
最常規的匹配
import re content = ‘Hello 123 4567 World_This is a Regex Demo‘ result = re.match(‘^Hello\s\d\d\d\s\d{4}\s\w{10}.*Demo$‘,content) print(result) print(len(content)) print(result.span()) print(result.group())
<_sre.SRE_Match object; span=(0, 41), match=‘Hello 123 4567 World_This is a Regex Demo‘>
41
(0, 41)
Hello 123 4567 World_This is a Regex Demo
泛匹配
import re
content = ‘Hello 123 4567 World_This is a Regex Demo‘
result = re.match(‘^Hello.*Demo$‘,content)
print(result)
<_sre.SRE_Match object; span=(0, 41), match=‘Hello 123 4567 World_This is a Regex Demo‘>
匹配目標
content = ‘Hello 1234456 World, Nice_to meet u‘
result = re.match(‘^Hello\s(\d+)\sWorld‘,content)
print(result)
print(result.group(1))
<_sre.SRE_Match object; span=(0, 19), match=‘Hello 1234456 World‘>
1234456
貪婪匹配
content = ‘Hello 1234456 World, Nice to meet u_This is a Regex Demo‘ result = re.match(‘^He.*(\d+).*Demo$‘,content) print(result) print(result.group(1))
<_sre.SRE_Match object; span=(0, 56), match=‘Hello 1234456 World, Nice to meet u_This is a Reg>
6 # .* 匹配到最後一個字符
非貪婪匹配
content = ‘Hello 1234456 World, Nice to meet u_This is a Regex Demo‘
result = re.match(‘^He.*?(\d+).*Demo$‘,content)
print(result)
print(result.group(1))
<_sre.SRE_Match object; span=(0, 56), match=‘Hello 1234456 World, Nice to meet u_This is a Reg>
1234456 # .*? 會匹配盡可能少的字符
匹配模式
. 本身不能匹配換行符
content = ‘‘‘Hello 1234456 World, Nice to meet u_This
is A Regex Demo
‘‘‘
result = re.match(‘^He.*?(\d+).*Demo$‘,content)
print(result)
None
加上第三個參數
result = re.match(‘^He.*?(\d+).*Demo$‘,content,re.S)
print(result)
print(result.group(1))
<_sre.SRE_Match object; span=(0, 57), match=‘Hello 1234456 World, Nice to meet u_This \nis A R>
1234456
轉義
content = ‘price is $5.00‘
result = re.match(‘price is $5.00‘,content)
print(result)
None
增加轉義字符後:
result = re.match(‘price is \$5\.00‘,content)
print(result)
<_sre.SRE_Match object; span=(0, 14), match=‘price is $5.00‘>
總結: 盡量使用泛匹配、使用括號得到匹配目標、盡量使用非貪婪模式、有換行符就用re.S
re.search
re.search掃描整個字符串並返回第一個成功的匹配
import re
content = ‘Extra strings Hello 1234556 World_This is a Regex Demo Extra strings‘
result = re.match(‘Hello.*?(\d+).*?Demo‘,content)
print(result)
None
re.match沒有找到字符
result = re.search(‘Hello.*?(\d+).*?Demo‘,content)
print(result)
print(result.group(1))
<_sre.SRE_Match object; span=(14, 54), match=‘Hello 1234556 World_This is a Regex Demo‘>
1234556
總結:為匹配方便,能用search就不用match
匹配演練
import re
html = ‘‘‘<div id="songs-list">
<h2 class="title">經典老歌</h2>
<p class="introduction">經典老歌列表</p>
<ul id="list" class="list-group">
<li data-view="2">一路上有你</li>
<li data-view="7">
<a href="2.mp3" singer="任賢齊">滄海一聲笑</a>
</li>
<li data-view="4" class="active">
<a href="3.mp3" singer="齊秦">往事隨風</a>
</li>
<li data-view="6"><a href="4.mp3" singer="beyond">光輝歲月</a></li>
<li data-view="5"><a href="5.mp3" singer="陳慧琳">記事本</a></li>
<li data-view="5">
<a href="6.mp3" singer="鄧麗君"><i class="fa fa-user"></i>但願人長久</a>
</li>
</ul>
</div>
‘‘‘
result = re.search(‘<li.*?active.*?singer="(.*?)">(.*?)</a>‘,html,re.S)
if result:
print(result.group(1),result.group(2))
齊秦 往事隨風
result = re.search(‘<li.*?singer="(.*?)">(.*?)</a>‘,html,re.S)
if result:
print(result.group(1),result.group(2))
任賢齊 滄海一聲笑
默認匹配第一個
result = re.search(‘<li.*?singer="(.*?)">(.*?)</a>‘,html)
if result:
print(result.group(1),result.group(2))
beyond 光輝歲月
去掉換行符後的輸出結果
re.findall
搜索字符串,以列表形式返回全部匹配的子串
results = re.findall(‘<li.*?href="(.*?)".*?singer="(.*?)">(.*?)</a>‘,html,re.S)
print(results)
[(‘2.mp3‘, ‘任賢齊‘, ‘滄海一聲笑‘), (‘3.mp3‘, ‘齊秦‘, ‘往事隨風‘), (‘4.mp3‘, ‘beyond‘, ‘光輝歲月‘), (‘5.mp3‘, ‘陳慧琳‘, ‘記事本‘), (‘6.mp3‘, ‘鄧麗君‘, ‘但願人長久‘)]
for result in results:
print(result[0],result[1],result[2])
2.mp3 任賢齊 滄海一聲笑
3.mp3 齊秦 往事隨風
4.mp3 beyond 光輝歲月
5.mp3 陳慧琳 記事本
6.mp3 鄧麗君 但願人長久
results = re.findall(‘<li.*?>\s*?(<a.*?>)?(\w+)(</a>)?\s*?</li>‘,html,re.S)
print(results)
[(‘‘, ‘一路上有你‘, ‘‘), (‘‘, ‘滄海一聲笑‘, ‘‘), (‘‘, ‘往事隨風‘, ‘‘), (‘‘, ‘光輝歲月‘, ‘‘), (‘‘, ‘記事本‘, ‘‘), (‘‘, ‘但願人長久‘, ‘‘)]
for result in results:
print(result[1])
一路上有你
滄海一聲笑
往事隨風
光輝歲月
記事本
但願人長久
re.sub
替換字符串中每一個匹配的子串後返回替換後的字符
content = ‘Extra strings Hello 1234567 World_This is a Regex Demo Extra strings‘
content = re.sub(‘\d+‘,‘‘,content)
print(content)
Extra strings Hello World_This is a Regex Demo Extra strings
content = ‘Extra strings Hello 1234567 World_This is a Regex Demo Extra strings‘
content = re.sub(‘\d+‘,‘Replacement‘,content)
print(content)
Extra strings Hello Replacement World_This is a Regex Demo Extra strings
字符"\1表示": 引用前面的字符串
content = ‘Extra strings Hello 1234567 World_This is a Regex Demo Extra strings‘
content = re.sub(‘(\d+)‘,r‘\1 8910‘,content)
print(content)
Extra strings Hello 1234567 8910 World_This is a Regex Demo Extra strings
html = re.sub(‘<a.*?>|</a>‘,‘‘,html)
print(html)
<div id="songs-list">
<h2 class="title">經典老歌</h2>
<p class="introduction">經典老歌列表</p>
<ul id="list" class="list-group">
<li data-view="2">一路上有你</li>
<li data-view="7">
滄海一聲笑
</li>
<li data-view="4" class="active">
往事隨風
</li>
<li data-view="6">光輝歲月</li>
<li data-view="5">記事本</li>
<li data-view="5">
<i class="fa fa-user"></i>但願人長久
</li>
</ul>
</div>
results = re.findall(‘<li.*?>(.*?)</li>‘,html,re.S)
print(results)
[‘一路上有你‘, ‘\n 滄海一聲笑\n ‘, ‘\n 往事隨風\n ‘, ‘光輝歲月‘, ‘記事本‘, ‘\n 但願人長久\n ‘]
for result in results:
print(result.strip())
一路上有你
滄海一聲笑
往事隨風
光輝歲月
記事本
re.compile
將正則字符串編譯為正則表達式的對象,以便於復用該匹配模式
content = ‘‘‘Hello 1234567 World_This
is a Regex Demo‘‘‘
pattern = re.compile(‘Hello.*Demo‘,re.S)
result = re.match(pattern,content)
print(result)
<_sre.SRE_Match object; span=(0, 40), match=‘Hello 1234567 World_This\nis a Regex Demo‘>
實例練習
(會卡機的)
獲取豆瓣圖書信息
import requests
import re
content = requests.get(‘https://book.douban.com/‘).text
pattern = re.compile(‘<li.*?cover.*?href="(.*?)".*?alt="(.*?)".*?more-meta.*?author">(.*?)</span>.*?year">(.*?)</span>.*?</li>‘,re.S)
results = re.findall(pattern,content)
for result in results:
url,name,author,date = result
author = re.sub(‘\s‘,‘‘,author)
date = re.sub(‘\s‘,‘‘,date)
print(url,name,author,date)
Python正則表達式基礎