1. 程式人生 > >Python正則表達式基礎

Python正則表達式基礎

cover 則表達式 ngs earch tps 轉義字符 htm 使用 user

非Python獨有,re模塊實現

re.match

re.match嘗試從字符串的起始位置匹配一個模式,如果不是起始位置匹配成功的話,match()就返回none.
re.match(pattern,string,flags=0)

最常規的匹配

import re

content = ‘Hello 123 4567 World_This is a Regex Demo‘
result = re.match(‘^Hello\s\d\d\d\s\d{4}\s\w{10}.*Demo$‘,content)
print(result)
print(len(content))
 print(result.span())
print(result.group())

<_sre.SRE_Match object; span=(0, 41), match=‘Hello 123 4567 World_This is a Regex Demo‘>
41
(0, 41)
Hello 123 4567 World_This is a Regex Demo

泛匹配

import re
 
 content = ‘Hello 123 4567 World_This is a Regex Demo‘
 result = re.match(‘^Hello.*Demo$‘,content)
print(result)

<_sre.SRE_Match object; span=(0, 41), match=‘Hello 123 4567 World_This is a Regex Demo‘>

匹配目標

 content = ‘Hello 1234456 World, Nice_to meet u‘
 result = re.match(‘^Hello\s(\d+)\sWorld‘,content)
 print(result)
print(result.group(1))

<_sre.SRE_Match object; span=(0, 19), match=‘Hello 1234456 World‘>
1234456

貪婪匹配

content = ‘Hello 1234456 World, Nice to meet u_This is a Regex Demo‘

 result = re.match(‘^He.*(\d+).*Demo$‘,content)
 print(result)
 print(result.group(1))

<_sre.SRE_Match object; span=(0, 56), match=‘Hello 1234456 World, Nice to meet u_This is a Reg>
6 # .* 匹配到最後一個字符

非貪婪匹配

content = ‘Hello 1234456 World, Nice to meet u_This is a Regex Demo‘

result = re.match(‘^He.*?(\d+).*Demo$‘,content)
 print(result)
 print(result.group(1))

<_sre.SRE_Match object; span=(0, 56), match=‘Hello 1234456 World, Nice to meet u_This is a Reg>
1234456 # .*? 會匹配盡可能少的字符

匹配模式

. 本身不能匹配換行符

content = ‘‘‘Hello 1234456 World, Nice to meet u_This 
 is A Regex Demo
 ‘‘‘
 result = re.match(‘^He.*?(\d+).*Demo$‘,content)
 print(result)

None

加上第三個參數

 result = re.match(‘^He.*?(\d+).*Demo$‘,content,re.S)
 print(result)
 print(result.group(1))

<_sre.SRE_Match object; span=(0, 57), match=‘Hello 1234456 World, Nice to meet u_This \nis A R>
1234456

轉義

content = ‘price is $5.00‘
result = re.match(‘price is $5.00‘,content)
print(result)

None
增加轉義字符後:

result = re.match(‘price is \$5\.00‘,content)
print(result)

<_sre.SRE_Match object; span=(0, 14), match=‘price is $5.00‘>

總結: 盡量使用泛匹配、使用括號得到匹配目標、盡量使用非貪婪模式、有換行符就用re.S

re.search

re.search掃描整個字符串並返回第一個成功的匹配

import re
 
content = ‘Extra strings Hello 1234556 World_This is a Regex Demo Extra strings‘
result = re.match(‘Hello.*?(\d+).*?Demo‘,content)
print(result)

None

re.match沒有找到字符

result = re.search(‘Hello.*?(\d+).*?Demo‘,content)
print(result)
print(result.group(1))

<_sre.SRE_Match object; span=(14, 54), match=‘Hello 1234556 World_This is a Regex Demo‘>
1234556
總結:為匹配方便,能用search就不用match

匹配演練

import re
 html = ‘‘‘<div id="songs-list">
   <h2 class="title">經典老歌</h2>
   <p class="introduction">經典老歌列表</p>
   <ul id="list" class="list-group">
     <li data-view="2">一路上有你</li>
     <li data-view="7">
       <a href="2.mp3" singer="任賢齊">滄海一聲笑</a>
     </li>
     <li data-view="4" class="active">
       <a href="3.mp3" singer="齊秦">往事隨風</a>
     </li>
     <li data-view="6"><a href="4.mp3" singer="beyond">光輝歲月</a></li>
    <li data-view="5"><a href="5.mp3" singer="陳慧琳">記事本</a></li>
     <li data-view="5">
       <a href="6.mp3" singer="鄧麗君"><i class="fa fa-user"></i>但願人長久</a>
     </li>
   </ul>
 </div>
 ‘‘‘
 result = re.search(‘<li.*?active.*?singer="(.*?)">(.*?)</a>‘,html,re.S)
 if result:
     print(result.group(1),result.group(2))
 

齊秦 往事隨風

result = re.search(‘<li.*?singer="(.*?)">(.*?)</a>‘,html,re.S)
 if result:
     print(result.group(1),result.group(2))

任賢齊 滄海一聲笑

默認匹配第一個

result = re.search(‘<li.*?singer="(.*?)">(.*?)</a>‘,html)
if result:
     print(result.group(1),result.group(2))

beyond 光輝歲月

去掉換行符後的輸出結果

re.findall

搜索字符串,以列表形式返回全部匹配的子串

results = re.findall(‘<li.*?href="(.*?)".*?singer="(.*?)">(.*?)</a>‘,html,re.S)
print(results)

[(‘2.mp3‘, ‘任賢齊‘, ‘滄海一聲笑‘), (‘3.mp3‘, ‘齊秦‘, ‘往事隨風‘), (‘4.mp3‘, ‘beyond‘, ‘光輝歲月‘), (‘5.mp3‘, ‘陳慧琳‘, ‘記事本‘), (‘6.mp3‘, ‘鄧麗君‘, ‘但願人長久‘)]

for result in results:
     print(result[0],result[1],result[2])

2.mp3 任賢齊 滄海一聲笑
3.mp3 齊秦 往事隨風
4.mp3 beyond 光輝歲月
5.mp3 陳慧琳 記事本
6.mp3 鄧麗君 但願人長久

results = re.findall(‘<li.*?>\s*?(<a.*?>)?(\w+)(</a>)?\s*?</li>‘,html,re.S)
print(results)

[(‘‘, ‘一路上有你‘, ‘‘), (‘‘, ‘滄海一聲笑‘, ‘‘), (‘‘, ‘往事隨風‘, ‘‘), (‘‘, ‘光輝歲月‘, ‘‘), (‘‘, ‘記事本‘, ‘‘), (‘‘, ‘但願人長久‘, ‘‘)]

 for result in results:
     print(result[1])

一路上有你
滄海一聲笑
往事隨風
光輝歲月
記事本
但願人長久

re.sub

替換字符串中每一個匹配的子串後返回替換後的字符

content = ‘Extra strings Hello 1234567 World_This is a Regex Demo Extra strings‘
 content = re.sub(‘\d+‘,‘‘,content)
 print(content)

Extra strings Hello World_This is a Regex Demo Extra strings

content = ‘Extra strings Hello 1234567 World_This is a Regex Demo Extra strings‘
content = re.sub(‘\d+‘,‘Replacement‘,content)
print(content)

Extra strings Hello Replacement World_This is a Regex Demo Extra strings

字符"\1表示": 引用前面的字符串

content = ‘Extra strings Hello 1234567 World_This is a Regex Demo Extra strings‘
content = re.sub(‘(\d+)‘,r‘\1 8910‘,content)   
print(content)

Extra strings Hello 1234567 8910 World_This is a Regex Demo Extra strings

html = re.sub(‘<a.*?>|</a>‘,‘‘,html)
print(html)
 <div id="songs-list">
   <h2 class="title">經典老歌</h2>
   <p class="introduction">經典老歌列表</p>
   <ul id="list" class="list-group">
     <li data-view="2">一路上有你</li>
     <li data-view="7">
       滄海一聲笑
     </li>
     <li data-view="4" class="active">
       往事隨風
     </li>
     <li data-view="6">光輝歲月</li>
    <li data-view="5">記事本</li>
     <li data-view="5">
       <i class="fa fa-user"></i>但願人長久
     </li>
   </ul>
 </div>
results = re.findall(‘<li.*?>(.*?)</li>‘,html,re.S)
print(results)

[‘一路上有你‘, ‘\n 滄海一聲笑\n ‘, ‘\n 往事隨風\n ‘, ‘光輝歲月‘, ‘記事本‘, ‘\n 但願人長久\n ‘]

for result in results:
    print(result.strip())

一路上有你
滄海一聲笑
往事隨風
光輝歲月
記事本

re.compile

將正則字符串編譯為正則表達式的對象,以便於復用該匹配模式

 content = ‘‘‘Hello 1234567 World_This
 is a Regex Demo‘‘‘
 pattern = re.compile(‘Hello.*Demo‘,re.S)
 result = re.match(pattern,content) 
 print(result)

<_sre.SRE_Match object; span=(0, 40), match=‘Hello 1234567 World_This\nis a Regex Demo‘>

實例練習

(會卡機的)
獲取豆瓣圖書信息

import requests
import re

content = requests.get(‘https://book.douban.com/‘).text
pattern = re.compile(‘<li.*?cover.*?href="(.*?)".*?alt="(.*?)".*?more-meta.*?author">(.*?)</span>.*?year">(.*?)</span>.*?</li>‘,re.S)
results = re.findall(pattern,content)
for result in results:
    url,name,author,date = result
    author = re.sub(‘\s‘,‘‘,author)
    date = re.sub(‘\s‘,‘‘,date)
    print(url,name,author,date)

Python正則表達式基礎