Python 正則re模組之compile()和findall()詳解
阿新 • • 發佈:2019-02-19
首先我們看下官方文件裡關於的compile的說明:
re.compile(pattern, flags=0) Compile a regular expression pattern into a regular expression object, which can be used for matching using its match() and search() methods, described below.The expression’s behaviour can be modified by specifying a flags value. Values can be any of the following variables, combined using bitwise OR (the | operator).
</pre><pre name="code" class="python">The sequence: prog = re.compile(pattern) result = prog.match(string) <strong><span style="font-size:24px;">is equivalent to</span></strong> result = re.match(pattern, string) but using re.compile() and saving the resulting regular expression object for reuse is more efficient when the expression will be used several times in a single program. Note:The compiled versions of the most recent patterns passed to re.compile() and the module-level matching functions are cached, so programs that use only a few regular expressions at a time needn’t worry about compiling regular expressions.
下面是flag dotall的說明:
re.DOTALL
Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.
》》》》》》》》》》》》》》》》》》》》
下面是關於findall的說明:
re.findall(pattern, string, flags=0) Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.
》》》》》》》》》》》》》》》》》》》》
下面舉個栗子進行講解:>>> import re
>>> s = "adfad asdfasdf asdfas asdfawef asd adsfas "
>>> reObj1 = re.compile('((\w+)\s+\w+)')
>>> reObj1.findall(s)
[('adfad asdfasdf', 'adfad'), ('asdfas asdfawef', 'asdfas'), ('asd adsfas', 'asd')]
>>> reObj2 = re.compile('(\w+)\s+\w+')
>>> reObj2.findall(s)
['adfad', 'asdfas', 'asd']
>>> reObj3 = re.compile('\w+\s+\w+')
>>> reObj3.findall(s)
['adfad asdfasdf', 'asdfas asdfawef', 'asd adsfas']
程式碼參考下圖進行理解:
findall函式返回的總是正則表示式在字串中所有匹配結果的列表list,此處主要討論列表中“結果”的展現方式,即findall中返回列表中每個元素包含的資訊。
1.當給出的正則表示式中帶有多個括號時,列表的元素為多個字串組成的tuple,tuple中字串個數與括號對數相同,字串內容與每個括號內的正則表示式相對應,並且排放順序是按括號出現的順序。
2.當給出的正則表示式中帶有一個括號時,列表的元素為字串,此字串的內容與括號中的正則表示式相對應(不是整個正則表示式的匹配內容)。
3.當給出的正則表示式中不帶括號時,列表的元素為字串,此字串為整個正則表示式匹配的內容。
《《《《《《《《《《《《《《《《《
對於.re.compile.findall(data)之後的資料,我們可以通過list的offset索引或者str.join()函式來使之變成str字串,從而進行方便的處理,下面是python3.5中str.join()的文件:str.join(iterable)
Return a string which is the concatenation of the strings in the iterable iterable. A TypeError will be raised if there are any non-string values in iterable, including bytes objects.The separator between elements is the string providing this method.
經過上面的介紹,相信對crawler裡的正則有很大的幫助