1. 程式人生 > >python自然語言處理(一)

python自然語言處理(一)

1標識化處理

何為標識化處理?實際上就是一個將原生字串分割成一系列有意義的分詞,其複雜性根據不同NLP應用而異,目標語言的複雜性也佔了很大部分,例如中文的標識化是要比英文要複雜。
word_tokenize()是一種通用的,面向所有語料庫的標識化方法,基本能應付絕大多數。
reges_tokenize()基於正則表示式,自定義程度更高。

#!/user/bin/env python
#-*- coding:utf-8 -*-
import re
import operator
import nltk
string = "Thanks to a hands-on guide introducing programming fundamentals alongside topics in computational linguistics. plus comprehensive API documentation. NLTK is suitable for linguists ."
w = re.split('\W+',string)#匹配所有非單詞性字元 word = nltk.word_tokenize(string)#分析單詞包括標點符號 sentence = nltk.sent_tokenize(string)#分詞句子,只能按照英文.分割 w2 = nltk.tokenize.regexp_tokenize(string,'\w+')#正則去匹配所有字元數字 w3 = nltk.tokenize.regexp_tokenize(string,'\d+')#匹配所有數字 print(w) print(word) print(sentence) print(w2) print(w3)

結果為:

['Thanks', 'to', 'a', 'hands', 'on', 'guide', 'introducing', 'programming', 'fundamentals', 'alongside', 'topics', 'in', 'computational', 'linguistics', 'plus', 'comprehensive', 'API', 'documentation', 'NLTK', 'is', 'suitable', 'for', 'linguists', '']

['Thanks', 'to', 'a', 'hands-on', 'guide', 'introducing', 'programming', 'fundamentals', 'alongside', 'topics', 'in', 'computational', 'linguistics', '.', 'plus', 'comprehensive', 'API', 'documentation', '.', 'NLTK', 'is', 'suitable', 'for', 'linguists', '.']

['Thanks to a hands-on guide introducing programming fundamentals alongside topics in computational linguistics.', 'plus comprehensive API documentation.', 'NLTK is suitable for linguists .']

['Thanks', 'to', 'a', 'hands', 'on', 'guide', 'introducing', 'programming', 'fundamentals', 'alongside', 'topics', 'in', 'computational', 'linguistics', 'plus', 'comprehensive', 'API', 'documentation', 'NLTK', 'is', 'suitable', 'for', 'linguists']

[]

['Thanks', 'to', 'a', 'hands', '-', 'on', 'guide', 'introducing', 'programming', 'fundamentals', 'alongside', 'topics', 'in', 'computational', 'linguistics', '.', 'plus', 'comprehensive', 'API', 'documentation', '.', 'NLTK', 'is', 'suitable', 'for', 'linguists', '.']

2詞幹提取

通常會使用詞幹提取忽略語法上的變化,將其歸結為相同的詞根,詞幹提取法較為簡單,對於複雜的NLP任務,就必須用到詞形還原(lemmatization),Porter詞幹提取器一般回答道70%以上的精準度。基本的已經夠用了。

from nltk.stem import PorterStemmer
pst = PorterStemmer()
pst.stem('looking')
#Out[4]: 'look'

3詞形還原

詞形還原是一種更條理化的方法,它涵蓋了詞根所有的方法和變化形式,詞形還原操作會利用上下文語境和詞性來確定單詞的變化形式,並運用不同的標準化規則,根據詞性來獲取相關的詞根(lemma)。
首先下載wordnet語意字典

import nltk
nltk.download()

彈出一個視窗,選擇wordnet的兩個庫下載即可。
這裡寫圖片描述
在Mac執行出錯視窗卡頓,另一種解決辦法直接從網站下載:
http://www.nltk.org/nltk_data/
選擇:
這裡寫圖片描述
下載後放到檔案目錄即可。

4停用詞

停用詞的移除是不同NLP應用中的常用預處理步驟,簡單的理解就是停用語料庫中所有文件中都會出現的單詞。

from nltk.corpus import stopwords
stoplist = stopwords.words('english')#選擇語言
text = "This is just a test"

cleanlist = [word for word in text.split() if word not in stoplist]#移除停用詞
cleanlist
Out[13]: ['This', 'test']

5 edit_distance演算法拼寫糾錯

舉例說明,此演算法主要是計算兩個詞之間要經過【至少】多少次轉換才能轉變為另一個詞。
例如從rain到shine要經過三個步驟:"rain" -> "sain" -> "shin" -> "shine"

from nltk.metrics import edit_distance
edit_distance('he','she')
Out[18]: 1
edit_distance('rain','shine')
Out[19]: 3
edit_distance('a','b')
Out[20]: 1
edit_distance('like','love')
Out[21]: 2

官方文件如下:

def edit_distance(s1, s2, substitution_cost=1, transpositions=False):
    """
    Calculate the Levenshtein edit-distance between two strings.
    The edit distance is the number of characters that need to be
    substituted, inserted, or deleted, to transform s1 into s2.  For
    example, transforming "rain" to "shine" requires three steps,
    consisting of two substitutions and one insertion:
    "rain" -> "sain" -> "shin" -> "shine".  These operations could have
    been done in other orders, but at least three steps are needed.

    Allows specifying the cost of substitution edits (e.g., "a" -> "b"),
    because sometimes it makes sense to assign greater penalties to substitutions.

    This also optionally allows transposition edits (e.g., "ab" -> "ba"),
    though this is disabled by default.

    :param s1, s2: The strings to be analysed
    :param transpositions: Whether to allow transposition edits
    :type s1: str
    :type s2: str
    :type substitution_cost: int
    :type transpositions: bool
    :rtype int
    """
    # set up a 2-D array
    len1 = len(s1)
    len2 = len(s2)
    lev = _edit_dist_init(len1 + 1, len2 + 1)

    # iterate over the array
    for i in range(len1):
        for j in range(len2):
            _edit_dist_step(lev, i + 1, j + 1, s1, s2,
                            substitution_cost=substitution_cost, transpositions=transpositions)
    return lev[len1][len2]