1. 程式人生 > >edit distance 編輯距離

edit distance 編輯距離

Refrence : Dynamic Programming Algorithm (DPA) for Edit-Distance

編輯距離

關於兩個字串s1,s2的差別,可以通過計算他們的最小編輯距離來決定。

所謂的編輯距離:s1s2變成相同字串需要下面操作的最小次數。

1.把某個字元ch1變成ch2

2.刪除某個字元

3.插入某個字元

例如 s1 = “12433” s2=”1233”;

則可以通過在s2中間插入4得到12433s1一致。

d(s1,s2) = 1 (進行了一次插入操作)

編輯距離的性質

計算兩個字串s1+ch1, s2+ch2編輯距離有這樣的性質:

1.d(s1,””) = d(“”,s1) = |s1|d(“ch1”,”ch2”) = ch1 == ch2 ? 0 : 1;

2.d(s1+ch1,s2+ch2) = min( d(s1,s2)+ ch1==ch2 ? 0 : 1 ,

d(s1+ch1,s2),

d(s1,s2+ch2));

第一個性質是顯然的。

第二個性質:由於我們定義的三個操作來作為編輯距離的一種衡量方法。

於是對ch1,ch2可能的操作只有

1.ch1變成ch2

2.s1+ch1後刪除ch1d = (1+d(s1,s2+ch2))

3.s1+ch1後插入ch2d = (1 + d(s1+ch1,s2))

對於23的操作可以等價於:

_2.s2+ch2後新增ch1d=(1+d(s1,s2+ch2))

_3.s2+ch2後刪除ch2d=(1+d(s1+ch1,s2))

因此可以得到計算編輯距離的性質2

複雜度分析

從上面性質2可以看出計算過程呈現這樣的一種結構(假設各個層用當前計算的串長度標記,並假設兩個串長度都為 n )

可以看到,該問題的複雜度指數級別 3 n 次方,對於較長的串,時間上是無法讓人忍受的。

分析:在上面的結構中,我們發現多次出現了 (n-1,n-1), (n-1,n-2)……。換句話說該結構具有重疊子問題。再加上前面性質2所具有的最優子結構。符合動態規劃演算法基本要素。因此可以使用動態規劃演算法把複雜度降低到多項式級別

動態規劃求解

首先為了避免重複計運算元問題,新增兩個輔助陣列。

一.儲存子問題結果。

M[ |s1| ,|s2| ] , 其中

M[ i , j ] 表示子串 s1(0->i) s2(0->j) 的編輯距離

二.儲存字元之間的編輯距離.

E[ |s1|, |s2| ] , 其中 E[ i, j ] = s[i] = s[j] ? 0 : 1

.新的計算表示式

根據性質1得到

M[ 0,0] = 0;

M[ s1i, 0 ] = |s1i|;

M[ 0, s2j ] = |s2j|;

根據性質2得到

M[ i, j ]= min(m[i-1,j-1] + E[ i, j ] ,

m[i, j-1] ,

m[i-1, j]);

複雜度

從新的計算式看出,計算過程為

i=1 -> |s1|

j=1 -> |s2|

M[i][j] = ……

因此複雜度為 O( |s1| * |s2| ) ,如果假設他們的長度都為n,則複雜度為 O(n^2)

Reference: Dynamic Programming Algorithm (DPA) for Edit-Distance

The words `computer' and `commuter' are very similar, and a change of just one letter, p->m will change the first word into the second. The word `sport' can be changed into `spot' by the deletion of the `p', or equivalently, `spot' can be changed into `sport' by the insertion of `p'.

The edit distance of two strings, s1 and s2, is defined as the minimum number of point mutations required to change s1 into s2, where a point mutation is one of:

  1. change a letter,
  2. insert a letter or
  3. delete a letter

The following recurrence relations define the edit distance, d(s1,s2), of two strings s1 and s2:

d('', '') = 0
d(s, '')  = d('', s) = |s|   -- i.e. length of s
d(s1+ch1, s2+ch2)
  = min( d(s1, s2) + if ch1=ch2 then 0 else 1 fi,
         d(s1+ch1, s2) + 1,
         d(s1, s2+ch2) + 1 )

The first two rules above are obviously true, so it is only necessary consider the last one. Here, neither string is the empty string, so each has a last character, ch1 and ch2 respectively. Somehow, ch1 and ch2 have to be explained in an edit of s1+ch1 into s2+ch2. If ch1 equals ch2, they can be matched for no penalty, i.e. 0, and the overall edit distance is d(s1,s2). If ch1 differs from ch2, then ch1 could be changed into ch2, i.e. 1, giving an overall cost d(s1,s2)+1. Another possibility is to delete ch1 and edit s1 into s2+ch2, d(s1,s2+ch2)+1. The last possibility is to edit s1+ch1 into s2 and then insert ch2, d(s1+ch1,s2)+1. There are no other alternatives. We take the least expensive, i.e. min, of these alternatives.

The recurrence relations imply an obvious ternary-recursive routine. This is not a good idea because it is exponentially slow, and impractical for strings of more than a very few characters.

Examination of the relations reveals that d(s1,s2) depends only on d(s1',s2') where s1' is shorter than s1, or s2' is shorter than s2, or both. This allows the dynamic programming technique to be used.

A two-dimensional matrix, m[0..|s1|,0..|s2|] is used to hold the edit distance values:

m[i,j] = d(s1[1..i], s2[1..j])

m[0, 0] = 0
m[i, 0] = i,  i=1..|s1|
m[0, j] = j,  j=1..|s2|

m[i,j] = min( m[i-1,j-1]
              + if s1[i]=s2[j] then 0 else 1 fi,
              m[i-1, j] + 1,
              m[i, j-1] + 1 ),    i=1..|s1|, j=1..|s2|

m[,] can be computed row by row. Row m[i,] depends only on row m[i-1,]. The time complexity of this algorithm is O(|s1|*|s2|). If s1 and s2 have a `similar' length, about `n' say, this complexity is O(n2), much better than exponential!

The words `computer' and `commuter' are very similar, and a change of just one letter, p->m will change the first word into the second. The word `sport' can be changed into `spot' by the deletion of the `p', or equivalently, `spot' can be changed into `sport' by the insertion of `p'.

The edit distance of two strings, s1 and s2, is defined as the minimum number of point mutations required to change s1 into s2, where a point mutation is one of:

  1. change a letter,
  2. insert a letter or
  3. delete a letter

The following recurrence relations define the edit distance, d(s1,s2), of two strings s1 and s2:

d('', '') = 0
d(s, '')  = d('', s) = |s|   -- i.e. length of s
d(s1+ch1, s2+ch2)
  = min( d(s1, s2) + if ch1=ch2 then 0 else 1 fi,
         d(s1+ch1, s2) + 1,
         d(s1, s2+ch2) + 1 )

The first two rules above are obviously true, so it is only necessary consider the last one. Here, neither string is the empty string, so each has a last character, ch1 and ch2 respectively. Somehow, ch1 and ch2 have to be explained in an edit of s1+ch1 into s2+ch2. If ch1 equals ch2, they can be matched for no penalty, i.e. 0, and the overall edit distance is d(s1,s2). If ch1 differs from ch2, then ch1 could be changed into ch2, i.e. 1, giving an overall cost d(s1,s2)+1. Another possibility is to delete ch1 and edit s1 into s2+ch2, d(s1,s2+ch2)+1. The last possibility is to edit s1+ch1 into s2 and then insert ch2, d(s1+ch1,s2)+1. There are no other alternatives. We take the least expensive, i.e. min, of these alternatives.

The recurrence relations imply an obvious ternary-recursive routine. This is not a good idea because it is exponentially slow, and impractical for strings of more than a very few characters.

Examination of the relations reveals that d(s1,s2) depends only on d(s1',s2') where s1' is shorter than s1, or s2' is shorter than s2, or both. This allows the dynamic programming technique to be used.

A two-dimensional matrix, m[0..|s1|,0..|s2|] is used to hold the edit distance values:

m[i,j] = d(s1[1..i], s2[1..j])

m[0, 0] = 0
m[i, 0] = i,  i=1..|s1|
m[0, j] = j,  j=1..|s2|

m[i,j] = min( m[i-1,j-1]
              + if s1[i]=s2[j] then 0 else 1 fi,
              m[i-1, j] + 1,
              m[i, j-1] + 1 ),    i=1..|s1|, j=1..|s2|

m[,] can be computed row by row. Row m[i,] depends only on row m[i-1,]. The time complexity of this algorithm is O(|s1|*|s2|). If s1 and s2 have a `similar' length, about `n' say, this complexity is O(n2), much better than exponential!