python re.sub 正則表示式過濾指定字元
阿新 • • 發佈:2018-12-14
re.sub(pattern, repl, string, count=0, flags=0)
Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged. repl can be a string or a function; if it is a string, any backslash escapes in it are processed. That is, \n is converted to a single newline character, \r is converted to a carriage return, and so forth. Unknown escapes such as \j are left alone. Backreferences, such as \6, are replaced with the substring matched by group 6 in the pattern. For example:
>>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):' ,
... r'static PyObject*\npy_\1(void)\n{',
... 'def myfunc():')
'static PyObject*\npy_myfunc(void)\n{'
If repl is a function, it is called for every non-overlapping occurrence of pattern. The function takes a single match object argument, and returns the replacement string. For example:
>>> def dashrepl(matchobj):
... if matchobj.group(0) == '-': return ' '
... else: return '-'
>>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
'pro--gram files'
>>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)
'Baked Beans & Spam'
The pattern may be a string or an RE object.
The optional argument count is the maximum number of pattern occurrences to be replaced; countmust be a non-negative integer. If omitted or zero, all occurrences will be replaced. Empty matches for the pattern are replaced only when not adjacent to a previous match, so sub('x*', '-', 'abc') returns '-a-b-c-'.
In addition to character escapes and backreferences as described above, \g<name> will use the substring matched by the group named name, as defined by the (?P<name>...) syntax. \g<number>uses the corresponding group number; \g<2> is therefore equivalent to \2, but isn’t ambiguous in a replacement such as \g<2>0. \20 would be interpreted as a reference to group 20, not a reference to group 2 followed by the literal character '0'. The backreference \g<0> substitutes in the entire substring matched by the RE.
Changed in version 2.7: Added the optional flags argument.
簡單來說,pattern 就是一個需要被替換的正則表示式,當其匹配成功後,就會用 repl 進行替換,而 string 就是需要被替換的字串。
比如,我們需要對一個字串進行處理,首先需要刪除括號內所有內容,包括括號,其次,刪除空格,然後按照逗號將其進行分割
s = '物流 企業, 生產效率, 資料包絡分析(DEA),Window Analysis,'
r1 = re.compile('[((].*?[))]|\s|,$')
# [((] 匹配英文( 或者中文(
# [((]\w*?[))] 匹配以中文括號或者英文括號括起來的 \w*?
# \w*? 匹配字母、數字、下劃線,重複任意次,儘可能少重複
# | 邏輯或
# \s 任意空白符
# ,$ 從最後開始的一個逗號
s1 = re.sub(r1,'',s) # 對於正則表示式進行 '' 替換,等效於刪除正則表示式可以匹配的內容
# 按照逗號分割
s2 = s1.split(',') # 按照逗號將其分割
for x in s2:
print x
s1 的結果為 物流企業,生產效率,資料包絡分析,WindowAnalysis