單詞糾錯
單詞糾錯
??在我們平時使用Word或者其他文字編輯軟件的時候,常常會遇到單詞糾錯的功能。比如在Word中:
單詞“Chinab”有紅色的下劃線,說明該單詞拼寫有誤,在“拼寫檢查”中,給出了幾個可能的單詞來幫助用戶糾正拼寫。那麽,我們是否能夠自己來實現這個功能呢?
??Why not?
??關於單詞糾錯的思路,可以參考Peter Norvig的鼎鼎大名的網站:http://norvig.com/spell-correct.html 。主要涉及到的相關概念為字符串的編輯距離,讀者可以參考文章:動態規劃法(十一)編輯距離 。
單詞糾錯算法
??首先,我們需要一個語料庫,基本上所有的NLP任務都會有語料庫。單詞糾錯的語料庫為bit.txt,裏面包含的內容如下:
- Gutenberg語料庫數據;
- 維基詞典;
- 英國國家語料庫中的最常用單詞列表。
下載的網址為:https://github.com/percent4/-word- 。
??接著,我們取出裏面的所有英語單詞,並統計其出現次數。對於一個給定的英語單詞(不管其是否拼寫有誤),依次找到和它編輯距離為0,1,2的單詞,這些單詞的優先順序為編輯距離為0的單詞(即該單詞本身) > 編輯距離為1的單詞 > 編輯距離為2的單詞。 最後按照這些單詞是否在語料庫中出現及單詞的優先順序及在語料庫中的出現次數考慮,考慮的順序為:是否在語料庫中出現,單詞的優先順序,在語料庫中的出現次數,最後選取在預料庫中出現,優先順序最高,在語料庫中出現次數最多的單詞作為該單詞的糾正結果。當然,也有可能是它本身,即單詞正確。
Python實現
??實現單詞糾錯的完整Python代碼(spelling_correcter.py)如下:
# -*- coding: utf-8 -*- import re, collections def tokens(text): """ Get all words from the corpus """ return re.findall(‘[a-z]+‘, text.lower()) with open(‘E://big.txt‘, ‘r‘) as f: WORDS = tokens(f.read()) WORD_COUNTS = collections.Counter(WORDS) def known(words): """ Return the subset of words that are actually in our WORD_COUNTS dictionary. """ return {w for w in words if w in WORD_COUNTS} def edits0(word): """ Return all strings that are zero edits away from the input word (i.e., the word itself). """ return {word} def edits1(word): """ Return all strings that are one edit away from the input word. """ alphabet = ‘abcdefghijklmnopqrstuvwxyz‘ def splits(word): """ Return a list of all possible (first, rest) pairs that the input word is made of. """ return [(word[:i], word[i:]) for i in range(len(word) + 1)] pairs = splits(word) deletes = [a + b[1:] for (a, b) in pairs if b] transposes = [a + b[1] + b[0] + b[2:] for (a, b) in pairs if len(b) > 1] replaces = [a + c + b[1:] for (a, b) in pairs for c in alphabet if b] inserts = [a + c + b for (a, b) in pairs for c in alphabet] return set(deletes + transposes + replaces + inserts) def edits2(word): """ Return all strings that are two edits away from the input word. """ return {e2 for e1 in edits1(word) for e2 in edits1(e1)} def correct(word): """ Get the best correct spelling for the input word """ # Priority is for edit distance 0, then 1, then 2 # else defaults to the input word itself. candidates = (known(edits0(word)) or known(edits1(word)) or known(edits2(word)) or [word]) return max(candidates, key=WORD_COUNTS.get) def correct_match(match): """ Spell-correct word in match, and preserve proper upper/lower/title case. """ word = match.group() def case_of(text): """ Return the case-function appropriate for text: upper, lower, title, or just str.: """ return (str.upper if text.isupper() else str.lower if text.islower() else str.title if text.istitle() else str) return case_of(word)(correct(word.lower())) def correct_text_generic(text): """ Correct all the words within a text, returning the corrected text. """ return re.sub(‘[a-zA-Z]+‘, correct_match, text)
測試
??有了上述的單詞糾錯程序,接下來我們對一些單詞或句子做測試。如下:
original_word_list = [‘fianlly‘, ‘castel‘, ‘case‘, ‘monutaiyn‘, ‘foresta‘, ‘helloa‘, ‘forteen‘, ‘persreve‘, ‘kisss‘, ‘forteen helloa‘, ‘phons forteen Doora. This is from Chinab.‘]
for original_word in original_word_list:
correct_word = correct_text_generic(original_word)
print(‘Orginial word: %s\nCorrect word: %s‘%(original_word, correct_word))
輸出結果如下:
Orginial word: fianlly
Correct word: finally
Orginial word: castel
Correct word: castle
Orginial word: case
Correct word: case
Orginial word: monutaiyn
Correct word: mountain
Orginial word: foresta
Correct word: forest
Orginial word: helloa
Correct word: hello
Orginial word: forteen
Correct word: fourteen
Orginial word: persreve
Correct word: preserve
Orginial word: kisss
Correct word: kiss
Orginial word: forteen helloa
Correct word: fourteen hello
Orginial word: phons forteen Doora. This is from Chinab.
Correct word: peons fourteen Door. This is from China.
??接著,我們對如下的Word文檔(Spelling Error.docx)進行測試(下載地址為:https://github.com/percent4/-word-),
對該文檔進行單詞糾錯的Python代碼如下:
from docx import Document
from nltk import sent_tokenize, word_tokenize
from spelling_correcter import correct_text_generic
from docx.shared import RGBColor
# 文檔中修改的單詞個數
COUNT_CORRECT = 0
#獲取文檔對象
file = Document("E://Spelling Error.docx")
#print("段落數:"+str(len(file.paragraphs)))
punkt_list = r",.?\"‘!()/\\-<>:@#$%^&*~"
document = Document() # word文檔句柄
def write_correct_paragraph(i):
global COUNT_CORRECT
# 每一段的內容
paragraph = file.paragraphs[i].text.strip()
# 進行句子劃分
sentences = sent_tokenize(text=paragraph)
# 詞語劃分
words_list = [word_tokenize(sentence) for sentence in sentences]
p = document.add_paragraph(‘ ‘*7) # 段落句柄
for word_list in words_list:
for word in word_list:
# 每一句話第一個單詞的第一個字母大寫,並空兩格
if word_list.index(word) == 0 and words_list.index(word_list) == 0:
if word not in punkt_list:
p.add_run(‘ ‘)
# 修改單詞,如果單詞正確,則返回原單詞
correct_word = correct_text_generic(word)
# 如果該單詞有修改,則顏色為紅色
if correct_word != word:
colored_word = p.add_run(correct_word[0].upper()+correct_word[1:])
font = colored_word.font
font.color.rgb = RGBColor(0x00, 0x00, 0xFF)
COUNT_CORRECT += 1
else:
p.add_run(correct_word[0].upper() + correct_word[1:])
else:
p.add_run(word)
else:
p.add_run(‘ ‘)
# 修改單詞,如果單詞正確,則返回原單詞
correct_word = correct_text_generic(word)
if word not in punkt_list:
# 如果該單詞有修改,則顏色為紅色
if correct_word != word:
colored_word = p.add_run(correct_word)
font = colored_word.font
font.color.rgb = RGBColor(0xFF, 0x00, 0x00)
COUNT_CORRECT += 1
else:
p.add_run(correct_word)
else:
p.add_run(word)
for i in range(len(file.paragraphs)):
write_correct_paragraph(i)
document.save(‘E://correct_document.docx‘)
print(‘修改並保存文件完畢!‘)
print(‘一共修改了%d處。‘%COUNT_CORRECT)
輸出的結果如下:
修改並保存文件完畢!
一共修改了19處。
修改後的Word文檔如下:
其中的紅色字體部分為原先的單詞有拼寫錯誤,進行拼寫糾錯後的單詞,一共修改了19處。
總結
??單詞糾錯實現起來並沒有想象中的那麽難,但也不是那麽容易~
??本文的代碼及文檔都已上傳Github,地址為:https://github.com/percent4/-word- 。
??最後,希望對此感興趣的讀者都能讀一下Peter Norvig那篇鼎鼎大名的文章:http://norvig.com/spell-correct.html 。
註意:本人現已開通微信公眾號: Python爬蟲與算法(微信號為:easy_web_scrape), 歡迎大家關註哦~~
單詞糾錯