【模式匹配】Aho-Corasick自動機

阿新 • • 發佈：2019-01-18

1. 多模匹配

AC自動機（Aho-Corasick Automaton）是多模匹配演算法的一種。所謂多模匹配，是指在字串匹配中，模式串有多個。前面所介紹的KMP、BM為單模匹配，即模式串只有一個。假設主串\(T[1 \cdots m]\)，模式串有k個\(\mathbb{P} = \{ P_1, \cdots, P_k\}\)，且模式串集合的總長度為\(n\)。如果採用KMP來匹配多模式串，則演算法複雜度為：

\[ O(|P_1|+m+\cdots + |P_k|+m)=O(n+km) \]

而KMP並沒有利用到模式串之間的重複字元結構資訊，每一次的匹配都需要將主串從頭至尾掃描一遍。貝爾實驗室的Aho與Corasick於1975年基於有限狀態機（finite state machines）提出AC自動機演算法[1]。小插曲：實際上AC演算法比KMP提出要早，KMP是1977年才被提出來了的。

2. AC演算法

AC自動機

自動機由狀態（數字標記的圓圈）和轉換（帶標籤的箭頭）組成，每一次轉換對應一個字元。AC演算法的核心包括三個函式：goto、failure、output；這三個函式構成了AC自動機。對於模式串{he, his, hers, she}，goto函式表示字元按模式串的轉移，暗含了模式串的共同字首的字元結構資訊，如下圖：

failure函式表示匹配失敗時退回的狀態：

output函式表示模式串對應於自動機的狀態：

完整的AC自動機如下：

匹配

AC演算法根據自動機匹配模式串，過程比較簡單：從主串的首字元、自動機的初始狀態0開始，

若字元匹配成功，則按自動機的goto函式轉移到下一狀態；且若轉移的狀態對應有output函式，則輸出已匹配上的模式串；

若字元匹配失敗，則遞迴地按自動機的failure函式進行轉移

匹配母串的演算法如下：

構造

AC自動機的確簡單高效，但是如何構造其對應的goto、failure、output函式呢？首先來看goto函式，細心一點我們發現goto函式本質上就是一棵帶有回退指標的trie樹，利用模式串的共同字首資訊，與output函式共同表示模式串的字元結構的資訊。

failure函式是整個AC演算法的精妙之處，用於匹配失敗時的回溯；且回溯到的狀態\(state\)應滿足：狀態\(state\)能按當前狀態的轉移字元進行能goto到的狀態，且能構成最長匹配。記\(g(r,a)=s\)表示狀態r可以按字元a goto到狀態s，則稱狀態r為狀態s的前一狀態，字元a為狀態s的轉移字元。failure函式滿足這樣一個規律：當匹配失敗時，回溯到的狀態為前一狀態的failure函式值（我們稱之為failure轉移狀態

）按轉移字元能goto到的狀態；若不能，則為前一狀態的failure轉移狀態的failure轉移狀態按轉移能goto到的狀態；若不能，則為......上面的話聽著有點拗口，讓我們以上圖AC自動機為例子來說明：

對於狀態7，前一狀態6的failure轉移狀態為0，狀態0按轉移字元s可以goto到狀態3，所以狀態7的failure函式\(f(7)=3\)；
對於狀態2，前一狀態1的failure轉移狀態為0，狀態0按轉移字元e可以goto到狀態0，所以狀態2的failure函式\(f(2)=0\)；

其中，所有root節點（狀態0）能goto到的狀態，其failure函式值均為0。根據goto表（trie樹）的特性，可知某一狀態的前一狀態、轉移字元是唯一確定的。因此定義\(\beta(s)=r\)表示狀態\(s\)的前一狀態為\(r\)，\(\tau(s)=a\)指狀態\(s\)的轉移字元為\(a\)；記\(f^{i}(s)=f\left( f^{(i-1)}(s)\right)\)。那麼，狀態s的failure函式的計算公式為：

\[ f(s) = \left\{ {\matrix{ {g\left( f^{n}(\beta(s)), \tau(s) \right)} & n = \arg \underset{i}{\min} \, \left\{ g\left( f^{i}(\beta(s)), \tau(s) \right) \neq failure \right\}\cr {0} & else \cr } } \right. \]

在計算failure函式時，巧妙地運用佇列進行遞迴構造，具體實現如下：

3. 實現

# coding=utf-8
from collections import deque, namedtuple

automaton = []
# state_id: int, value: char, goto: dict, failure: int, output: set
Node = namedtuple("Node", "state value goto failure output")


def init_trie(words):
    """
    creates an AC automaton, firstly create an empty trie, then add words to the trie
    and sets fail transitions
    """
    create_empty_trie()
    map(add_word, words)
    set_fail_transitions()


def create_empty_trie():
    """ initialize the root of the trie """
    automaton.append(Node(0, '', {}, 0, set()))


def add_word(word):
    """add word into trie"""
    node = automaton[0]
    for char in word:
        # char is not in trie
        if goto_state(node, char) is None:
            next_state = len(automaton)
            node.goto[char] = next_state  # modify goto(state, char)
            automaton.append(Node(next_state, char, {}, 0, set()))
            node = automaton[next_state]
        else:
            node = automaton[goto_state(node, char)]
    node.output.add(word)


def goto_state(node, char):
    """goto function"""
    if char in node.goto:
        return node.goto[char]
    else:
        return None


def set_fail_transitions():
    """construction of failure function, and update the output function"""
    queue = deque()
    # initialization
    for char in automaton[0].goto:
        s = automaton[0].goto[char]
        queue.append(s)
        automaton[s] = automaton[s]._replace(failure=0)
    while queue:
        r = queue.popleft()
        node = automaton[r]
        for a in node.goto:
            s = node.goto[a]
            queue.append(s)
            state = node.failure
            # failure transition recursively
            while goto_state(automaton[state], a) is None and state != 0:
                state = automaton[state].failure
            # except the chars in goto function, all chars transition will goto root node self
            if state == 0 and goto_state(automaton[state], a) is None:
                goto_a = 0
            else:
                goto_a = automaton[state].goto[a]
            automaton[s] = automaton[s]._replace(failure=goto_a)
            fs = automaton[s].failure
            automaton[s].output.update(automaton[fs].output)


def search_result(strings):
    """AC pattern matching machine"""
    result_set = set()
    node = automaton[0]
    for char in strings:
        while goto_state(node, char) is None and node.state != 0:
            node = automaton[node.failure]
        if node.state == 0 and goto_state(node, char) is None:
            node = automaton[0]
        else:
            node = automaton[goto_state(node, char)]
        if len(node.output) >= 1:
            result_set.update(node.output)
    return result_set


init_trie(['he', 'she', 'his', 'hers'])
print search_result("ushersm")

-------------------------------------------------------- 2016-06-14 更新 --------------------------------------------------------
實現了一個scala版本，支援新增詞屬性，程式碼託管在scala-AC。

4. 參考資料

[1] Aho, Alfred V., and Margaret J. Corasick. "Efficient string matching: an aid to bibliographic search." Communications of the ACM 18.6 (1975): 333-340.
[2] Pekka Kilpeläinen, Lecture 4: Set Matching and Aho-Corasick Algorithm.

【模式匹配】Aho-Corasick自動機

1. 多模匹配

2. AC演算法

AC自動機

匹配

構造

3. 實現

4. 參考資料

【模式匹配】Aho-Corasick自動機

【模式匹配】KMP演算法的來龍去脈

【模式匹配】更快的Boyer-Moore演算法

【模式匹配】之 —— KMP演算法詳解及證明

【圖片匹配】--- SIFT_Opencv3.1.0_C++_ubuntu

【算法】後綴自動機SAM

PHP PC端微信掃碼支付【模式二】詳細教程-附帶源碼（轉）

CF 612C. Replace To Make Regular Bracket Sequence【括號匹配】

【模式分解】無損連線&保持函式依賴

POJ 3189 Steady Cow Assignment 【二分】+【多重匹配】

織夢熊掌號外掛,dedecms如何接入熊掌號API提交功能【完美匹配】

Bailian2976 Bailian1936 All in All【字串匹配】

資料結構——使用Java棧實現【括號匹配】

【立體匹配】Stereo Processing by Semiglobal Matching and Mutual Information（SGM）

【特徵匹配】Harris及Shi-Tomasi原理及原始碼解析

【字串匹配】【BKDRhash||KMP】

【模式識別】SVM核函式

【學習筆記】迴文自動機

【模式識別】Fisher線性判別

【模式識別】Boosting

【模式匹配】Aho-Corasick自動機

1. 多模匹配

2. AC演算法

AC自動機

匹配

構造

3. 實現

4. 參考資料

相關推薦