文字識別（自然語言處理，NLP）

阿新 • • 發佈：2018-12-20

語音識別

語音----------------------->文字--------------------->語義

NLTK - 自然語言工具包

分詞

import nltk.tokenize as tk
tk.sent_tokenize(文字)->句子列表
tk.word_tokenize(文字)->單詞列表
分詞器 = tk.WordPunctTokenizer() > 略有不同(會把’s分開成’ 和s)
分詞器.tokenize(文字)->單詞列表 /
程式碼：

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import nltk.tokenize as tk
doc = "Are you curious about tokenization? " \
      "Let's see how it works! " \
      "We need to analyze a couple of sentences " \
      "with punctuations to see it in action."
print(doc)
tokens = tk.sent_tokenize( 
doc,language='english')
for i, token in enumerate(tokens):
    print("%2d" % (i + 1), token)
print('-' * 15)
tokens = tk.word_tokenize(doc)
for i, token in enumerate(tokens):
    print("%2d" % (i + 1), token)
print('-' * 15)
tokenizer = tk.WordPunctTokenizer()
tokens = tokenizer.tokenize(doc)
for i, 
 token in enumerate(tokens):
    print("%2d" % (i + 1), token)

詞幹

import nltk.stem.porter as pt
import nltk.stem.lancaster as lc
import nltk.stem.snowball as sb
注意：提取出來的不一定是單詞，也有可能只是單詞的部分組成
pt.PorterStemmer() -> 波特詞幹提取器，偏寬鬆
lc.LancasterStemmer() -> 朗卡斯特詞幹提取器，偏嚴格
sb.SnowballStemmer(語言) -> 思諾博詞幹提取器，偏中庸
XXX詞幹提取器.stem(單詞)->詞幹
程式碼：

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import nltk.stem.porter as pt
import nltk.stem.lancaster as lc
import nltk.stem.snowball as sb
words = ['table', 'probably', 'wolves', 'playing',
         'is', 'dog', 'the', 'beaches', 'grounded',
         'dreamt', 'envision']
pt_stemmer = pt.PorterStemmer()
lc_stemmer = lc.LancasterStemmer()
sb_stemmer = sb.SnowballStemmer('english')
for word in words:
    pt_stem = pt_stemmer.stem(word)
    lc_stem = lc_stemmer.stem(word)
    sb_stem = sb_stemmer.stem(word)
    print('%8s %8s %8s %8s' % (
        word, pt_stem, lc_stem, sb_stem))

詞形還原

名詞：複數->單數
動詞：分詞->原型

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import nltk.stem as ns
words = ['table', 'probably', 'wolves', 'playing',
         'is', 'dog', 'the', 'beaches', 'grounded',
         'dreamt', 'envision']
lemmatizer = ns.WordNetLemmatizer()
for word in words:
    n_lemma = lemmatizer.lemmatize(word, pos='n')
    v_lemma = lemmatizer.lemmatize(word, pos='v')
    print('%8s %8s %8s' % (word, n_lemma, v_lemma))

詞袋

相似的詞會出現在含義相似的語句裡面。根據相似輸入對應相似輸出，統計詞典中的詞在每個樣本內出現的次數，根據次數統計規律，找到相似語句，聊天機器人就可以通過其進行反饋。
The brown dog is running. The black dog is in the black room. Running in the room is forbidden.
1 The brown dog is running
2 The black dog is in the black room
3 Running in the room is forbidden
the brown dog is running black in room forbidden
1 1 1 1 1 1 0 0 0 0
2 1 0 1 1 0 2 1 1 0
3 1 0 0 1 1 0 1 1 1
程式碼：

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import nltk.tokenize as tk
import sklearn.feature_extraction.text as ft
doc = 'The brown dog is running. ' \
      'The black dog is in the black room. ' \
      'Running in the room is forbidden.'
print(doc)
# 語句統計
sentences = tk.sent_tokenize(doc)
print(sentences)
# 計數向量
cv = ft.CountVectorizer()
bow = cv.fit_transform(sentences).toarray()
#返回的是矩陣，
print(bow)
words = cv.get_feature_names()
print(words)

詞頻

詞頻是詞袋矩陣的歸一化。根據詞袋統計的詞語數量，得到詞語出現頻率。

import nltk.tokenize as tk
import sklearn.feature_extraction.text as ft
import sklearn.preprocessing as sp
doc = 'The brown dog is running. ' \
      'The black dog is in the black room. ' \
      'Running in the room is forbidden.'
print(doc)
sentences = tk.sent_tokenize(doc)
print(sentences)
# 文字統計的特徵提取。 
cv = ft.CountVectorizer()
bow = cv.fit_transform(sentences).toarray()
print(bow)
words = cv.get_feature_names()
print(words)
# 統計詞頻
tf = sp.normalize(bow, norm='l1')
print(tf)

文件頻率（DF）

針對詞典中的每一個單詞，用包含該單詞的樣本數閉上總樣本數。如果這個單詞越稀有，則文件頻率越小。單詞越稀有，文件頻率越小，單詞的稀有度貢獻了文件的特徵。

逆文件頻率（IDF）

逆文件頻率越高，文件頻率越低，單詞越稀有，可識別性貢獻越高。
詞頻越高---------------------------------------------->語義表現力貢獻越高

詞頻你文件頻率（TF-IDF）

詞頻乘你文件頻率，綜合體現了單詞對樣本語義表現力和可識別性貢獻的大小。

詞頻矩陣中的每一個元素乘以相應單詞的逆文件頻率，其值越大說明該詞對樣本語義的貢獻越大，根據每個詞的貢獻力度，構建學習模型。
程式碼：

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import nltk.tokenize as tk
import sklearn.feature_extraction.text as ft
doc = 'The brown dog is running. ' \
      'The black dog is in the black room. ' \
      'Running in the room is forbidden.'
print(doc)
sentences = tk.sent_tokenize(doc)
print(sentences)
# 特徵提取器，統計各文字在該行出現的次數（特徵值次數）
cv = ft.CountVectorizer()
bow = cv.fit_transform(sentences).toarray()
print(bow)
# 得到統計的特徵值。
words = cv.get_feature_names()
print(words)
tt = ft.TfidfTransformer()
# 得到詞頻-逆文件頻率
tfidf = tt.fit_transform(bow).toarray()
print(tfidf)

基於多項分佈樸素貝葉斯的情感分析

多項分佈樸素貝葉斯分類器
通過有監督學習，將關鍵單詞和情感聯絡起來，對未知語句，進行詞語匹配，判斷其情感好壞。
情感分析
A B C
1 2 3 -> {‘A’: 1, ‘C’: 3, ‘B’: 2}
4 5 6 -> {‘C’:6, ‘A’: 4, ‘B’, 5}
7 8 9 …
程式碼：

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import nltk.corpus as nc
import nltk.classify as cf
import nltk.classify.util as cu
# 每個好評例項的單詞字典的列表
pdata = []
# 獲取自帶資料：電影評論的好評
fileids = nc.movie_reviews.fileids('pos')
for fileid in fileids:
    # nc.movie_reviews.words：nltk.corpus.reader.plaintext.CategorizedPlaintextCorpusReader的例項方法
    words = nc.movie_reviews.words(fileid)
    # 生成含有每個單詞的字典
    sample = {}
    for word in words:
        sample[word] = True
    pdata.append((sample, 'POSITIVE'))
# 每個差評例項的單詞字典的列表
ndata = []
# 獲取自帶資料：電影評論的差評
fileids = nc.movie_reviews.fileids('neg')
for fileid in fileids:
    words = nc.movie_reviews.words(fileid)
    sample = {}
    for word in words:
        sample[word] = True
    ndata.append((sample, 'NEGATIVE'))
# 劃分訓練集和測試集，這裡沒有考慮交叉驗證的問題
pnumb, nnumb = int(0.8 * len(pdata)), int(0.8 * len(ndata))
train_data = pdata[:pnumb] + ndata[:nnumb]
test_data = pdata[pnumb:] + ndata[nnumb:]
# 生成多項式樸素貝葉斯分類模型，使用的nltk的模型
model = cf.NaiveBayesClassifier.train(train_data)
# 驗證模型準確度
ac = cu.accuracy(model, test_data)
print('%.2f%%' % round(ac * 100, 2))
# 最具資訊量的特徵值
tops = model.most_informative_features()
for top in tops[:5]:
    print(top[0])
reviews = [
    'It is an amazing movie.',
    'This is a dull movie. I wound never recommend it to anyoue.',
    'The cinematography is pretty great in this movie.',
    'The direction was terrible and the story was all over the place.']
sents, probs = [], []
# 生成詞語字典，這裡就沒有使用單詞劃分的方法，直接通過split切割。
for review in reviews:
    words = review.split()
    sample = {}
    for word in words:
        sample[word] = True
    # 可能的概率，這裡相當於得到分類結果
    pcls = model.prob_classify(sample)
    # 分類
    sent = pcls.max()
    # 處於這個類的概率，置信度，準確率
    prob = pcls.prob(sent)
    sents.append(sent)
    probs.append(prob)
for review, sent, prob in zip(
        reviews, sents, probs):
    print(review, '->', sent, '%.2f%%' % round(
        prob * 100, 2))

主題抽取

程式碼：topic.py

文字分類，一般情況下選擇基於統計的分類器進行訓練。自然語言有明顯的基於統計的特徵。

程式碼：doc.py
1 2 3 4 5 6
2 3 0 0 1 4
0 4 1 1 2 2
10.性別識別
程式碼：gndr.py

文字識別（自然語言處理，NLP）

目錄

語音識別

NLTK - 自然語言工具包

分詞

詞幹

詞形還原

詞袋

詞頻

文件頻率（DF）

逆文件頻率（IDF）

詞頻你文件頻率（TF-IDF）

基於多項分佈樸素貝葉斯的情感分析

主題抽取

文字識別（自然語言處理，NLP）

Mac os Pycharm 中使用Stanza進行實體識別（自然語言處理nlp）

Python 2.7下下載並安裝nltk （自然語言處理工具包）

NLP系列(1)_從破譯外星人文字淺談自然語言處理的基礎

【forever1dreamsxx--NLP】日子在指尖悄悄流淌，不覺間卻沉積出暗香陣陣。一個普通的數學系本科生，熱愛數學，熱愛自然語言處理，從事自然語言處理相關工作。郵箱：

【AI測試】智慧音箱--自然語言處理，語音語義識別測試

學習自然語言處理，一張圖就夠了

深度學習視訊，吳恩達，CS231n，斯坦福，計算機視覺，牛津大學，xDeepMind ，自然語言處理，莫煩，Tensorflow

人工智慧，深度學習，計算機視覺，自然語言處理，機器學習，百度網盤視訊教程

自然語言處理(一)NLP概述

R語言機器學習與大資料視覺化暨Python文字挖掘與自然語言處理核心技術研修

7.3 執行期型別識別（Runtime Type Identification，RTTI）

Andrew NG 機器學習筆記-week11-應用例項：圖片文字識別（Application Example：Photo OCR）

不是你無法入門自然語言處理（NLP），而是你沒找到正確的開啟方式

python自然語言處理（NLP）1------中文分詞1，基於規則的中文分詞方法

自然語言處理NLP（一）

自然語言處理NLP（二）

自然語言處理(nlp)比計算機視覺(cv)發展緩慢，而且更難！

自然語言處理-錯字識別（基於Python）kenlm、pycorrector

用深度學習解決自然語言處理中的7大問題，文字分類、語言建模、機器翻譯

文字識別（自然語言處理，NLP）

目錄

語音識別

NLTK - 自然語言工具包

分詞

詞幹

詞形還原

詞袋

詞頻

文件頻率（DF）

逆文件頻率（IDF）

詞頻你文件頻率（TF-IDF）

基於多項分佈樸素貝葉斯的情感分析

主題抽取

相關推薦