1. 程式人生 > >Python 正則re模組之compile()和findall()詳解

Python 正則re模組之compile()和findall()詳解

首先我們看下官方文件裡關於的compile的說明:

re.compile(pattern, flags=0)
Compile a regular expression pattern into a regular expression object, which can be used for matching using its match() and search() methods, described below.The expression’s behaviour can be modified by specifying a flags value. Values can be any of the following variables, combined using bitwise OR (the | operator).

</pre><pre name="code" class="python">The sequence:
prog = re.compile(pattern)
result = prog.match(string)
<strong><span style="font-size:24px;">is equivalent to</span></strong>
result = re.match(pattern, string)
but using re.compile() and saving the resulting regular expression object for reuse is more efficient when the expression will be used several times in a single program.

Note:The compiled versions of the most recent patterns passed to re.compile() and the module-level matching functions are cached, so programs that use only a few regular expressions at a time needn’t worry about compiling regular expressions.

下面是flag dotall的說明:

re.DOTALL
Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.

》》》》》》》》》》》》》》》》》》》》

下面是關於findall的說明:

re.findall(pattern, string, flags=0)
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.

》》》》》》》》》》》》》》》》》》》》

下面舉個栗子進行講解:
>>> import re
>>> s = "adfad asdfasdf asdfas asdfawef asd adsfas "

>>> reObj1 = re.compile('((\w+)\s+\w+)')
>>> reObj1.findall(s)
[('adfad asdfasdf', 'adfad'), ('asdfas asdfawef', 'asdfas'), ('asd adsfas', 'asd')]

>>> reObj2 = re.compile('(\w+)\s+\w+')
>>> reObj2.findall(s)
['adfad', 'asdfas', 'asd']

>>> reObj3 = re.compile('\w+\s+\w+')
>>> reObj3.findall(s)
['adfad asdfasdf', 'asdfas asdfawef', 'asd adsfas']

程式碼參考下圖進行理解:


對於上面的程式碼,我們可以看到:

findall函式返回的總是正則表示式在字串中所有匹配結果的列表list,此處主要討論列表中“結果”的展現方式,即findall中返回列表中每個元素包含的資訊。

1.當給出的正則表示式中帶有多個括號時,列表的元素為多個字串組成的tuple,tuple中字串個數與括號對數相同,字串內容與每個括號內的正則表示式相對應,並且排放順序是按括號出現的順序。

2.當給出的正則表示式中帶有一個括號時,列表的元素為字串,此字串的內容與括號中的正則表示式相對應(不是整個正則表示式的匹配內容)。

3.當給出的正則表示式中不帶括號時,列表的元素為字串,此字串為整個正則表示式匹配的內容。

《《《《《《《《《《《《《《《《《

對於.re.compile.findall(data)之後的資料,我們可以通過list的offset索引或者str.join()函式來使之變成str字串,從而進行方便的處理,下面是python3.5中str.join()的文件:
str.join(iterable)
Return a string which is the concatenation of the strings in the iterable iterable. A TypeError will be raised if there are any non-string values in iterable, including bytes objects.The separator between elements is the string providing this method.


經過上面的介紹,相信對crawler裡的正則有很大的幫助