1. 程式人生 > >自學python爬蟲(三)正則表示式

自學python爬蟲(三)正則表示式

一、什麼是正則表示式

正則表示式是對字串操作的一種邏輯公式,就是用事先定義好的一些特定字元,及這些特定字元的組合,組成一個“規則字串”,這個“規則字串”用來表達對字串的一種過濾邏輯。(非Python獨有,python中re模組實現)

二、常見的匹配模式

這裡寫圖片描述
re.match
re.match嘗試從字串的起始位置匹配一個模式,如果不是起始位置匹配成功的話,match()就返回None。
最常規的匹配:

import re

content = "Hello 123 4567 World_This is a Regex Demo"
result = re.match('^Hello\s
\d\d\d\s\d{4}\s\w{10}.*Demo$',content) print(result)
import re

content = "Hello 123 4567 World_This is a Regex Demo"
print(len(content))
result = re.match('^Hello\s\d\d\d\s\d{4}\s\w{10}.*Demo$',content)
print(result)
print(result.group())
print(result.span())

泛匹配

import re

content = "Hello 123 4567 World_This is a Regex Demo"
result = re.match('^Hello.*Demo$',content) print(result) print(result.group()) print(result.span())

匹配目標

import re

content = "Hello 1234567 World_This is a Regex Demo"
result = re.match('^Hello\s(\d+)\sWorld.*Demo$',content)
print(result)
print(result.group(1))
print(result.span())

貪婪匹配

import re

content = "Hello 1234567 World_This is a Regex Demo"
result = re.match('^He.*(\d+).*Demo$',content) print(result) print(result.group(1)) print(result.span())

非貪婪匹配

import re

content = "Hello 1234567 World_This is a Regex Demo"
result = re.match('^He.*?(\d+).*Demo$',content)
print(result)
print(result.group(1))
print(result.span())

匹配模式

import re

content = '''Hello 1234567 World_This
is a Regex Demo'''
result = re.match('^He.*(\d+).*Demo$',content)
print(result)
print(result.group(1))
print(result.span())
import re

content = '''Hello 1234567 World_This
is a Regex Demo'''
result = re.match('^He.*(\d+).*Demo$',content,re.S)
print(result)
print(result.group(1))
print(result.span())

轉義

import re

content = 'price is $5.00'
result = re.match('^price is \$5\.00',content)
print(result)

總結:儘量使用泛匹配、是用括號得到匹配目標、儘量使用非貪婪模式、有換行符就用re.S

re.search

re.search掃描整個字串並返回第一個成功的匹配。

import re

content = "Extra String Hello 123 4567 World_This is a Regex Demo"
result = re.match('Hello.*?(\d+).*?Demo$',content)
print(result)
import re

content = "Extra String Hello 123 4567 World_This is a Regex Demo"
result = re.search('Hello.*?(\d+).*?Demo$',content)
print(result)

總結:為匹配方便,能用search就不用match。
re.findall
這裡寫圖片描述
這裡寫圖片描述
re.sub

import re

content = "Extra String Hello 1234567 World_This is a Regex Demo"
conteent = re.sub('\d+','replacement',content)
print(conteent)
import re

content = "Extra String Hello 1234567 World_This is a Regex Demo"
content = re.sub('\d+','',content)
print(content)
import re

content = "Extra String Hello 1234567 World_This is a Regex Demo"
content = re.sub('(\d+)',r'\1 8910',content)
print(content)

re.compile
將正則表示式編譯為正則物件,以便複用該匹配模式。

import re

content = "Hello 123 4567 World_This is a Regex Demo"
pattern = re.compile('^Hello\s\d\d\d\s\d{4}\s\w{10}.*Demo$')
result = re.match(pattern,content)
print(result)