NLTK01 《NLTK基礎教程--用NLTK和Python庫構建機器學習應用》

阿新 • • 發佈：2019-01-11

01 關於NLTK的認知

 很多介紹NLP的，都會提到NLTK庫。還以為NLTK是多牛逼的必需品。看了之後，感覺NLTK對實際專案，作用不大。很多內容都是從語義、語法方面解決NLP問題的。感覺不太靠譜。而且本身中文語料庫不多。很多介紹NLTK的書籍和blog都比較陳舊。
 《NLTK基礎教程--用NLTK和Python庫構建機器學習應用》雖然是2017年6月第一版。但內容大部分還是很陳舊的。基本都是採用英文的素材。書中排版類、文字類錯誤很多。
《Python自然語言處理》 [美] Steven Bird,Ewan Klein & Edward Loper著 陳濤 張旭 催楊 劉海平 譯 的介紹的更全面。程式碼及其陳舊，知識點很全面。

02 部分程式碼整理

下面整理了1、2、3、4、6、8章的程式碼。在win10 nltk3.2.4 python3.5.3/python3.6.1環境，可以正常執行。一定要注意nltk_data程式碼的下載，還有缺少庫的時候，按需安裝。其中 pywin32-221.win-amd64-py3.6.exe/pywin32-221.win-amd64-py3.5.exe 需要手工下載[https://sourceforge.net/projects/pywin32/files/pywin32/Build%20221/]。
需要下載的資料，都在程式碼裡給出連結，或者說明。

02.01 自然語言處理簡介(NLTKEssentials01.py)

# 《NLTK基礎教程--用NLTK和Python庫構建機器學習應用》01 自然語言處理簡介
# win10 nltk3.2.4 python3.5.3/python3.6.1
# filename:NLTKEssentials01.py # 自然語言處理簡介

import nltk
#nltk.download() # 完全下載需要很久，很可能需要多次嘗試，才能下載成功
print("Python and NLTK installed successfully")
'''Python and NLTK installed successfully'''

# 1.2 先從Python開始
# 1.2.1 列表 

lst = [1, 2, 3, 4]
print(lst)
'''[1, 2, 3, 4]'''

# print('Fisrt element: ' + lst[0])
# '''TypeError: must be str, not int'''

print('Fisrt element: ' + str(lst[0]))
'''Fisrt element: 1'''

print('First element: ' + str(lst[0]))
print('last element: ' + str(lst[-1]))
print('first three elemenets: ' + str(lst[0:2]))
print('last three elements: ' + str(lst[-3:]))
'''
First element: 1
last element: 4
first three elemenets: [1, 2]
last three elements: [2, 3, 4]
'''

# 1.2.2 自主功能
print(dir(lst))
'''
['__add__', '__class__', '__contains__', '__delattr__', '__delitem__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__iadd__', '__imul__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__rmul__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', 'append', 'clear', 'copy', 'count', 'extend', 'index', 'insert', 'pop', 'remove', 'reverse', 'sort']
'''

print(' , '.join(dir(lst)))
'''
__add__ , __class__ , __contains__ , __delattr__ , __delitem__ , __dir__ , __doc__ , __eq__ , __format__ , __ge__ , __getattribute__ , __getitem__ , __gt__ , __hash__ , __iadd__ , __imul__ , __init__ , __init_subclass__ , __iter__ , __le__ , __len__ , __lt__ , __mul__ , __ne__ , __new__ , __reduce__ , __reduce_ex__ , __repr__ , __reversed__ , __rmul__ , __setattr__ , __setitem__ , __sizeof__ , __str__ , __subclasshook__ , append , clear , copy , count , extend , index , insert , pop , remove , reverse , sort
'''

help(lst.index)
'''
Help on built-in function index:

index(...) method of builtins.list instance
    L.index(value, [start, [stop]]) -> integer -- return first index of value.
    Raises ValueError if the value is not present.
'''

mystring = "Monty Python !  And the holy Grail ! \n"
print(mystring.split())
'''
['Monty', 'Python', '!', 'And', 'the', 'holy', 'Grail', '!']
'''

print(mystring.strip())
'''Monty Python !  And the holy Grail !'''

print(mystring.lstrip())
'''
Monty Python !  And the holy Grail ! 

 '''

print(mystring.rstrip())
'''Monty Python !  And the holy Grail !'''

print(mystring.upper())
'''
MONTY PYTHON !  AND THE HOLY GRAIL ! 

'''
print(mystring.replace('!', ''''''))
'''
Monty Python   And the holy Grail  

'''
# 1.2.3 正則表示式
import re
if re.search('Python', mystring):
    print("We found python ")
else:
    print("No ")
'''We found python '''

import re
print(re.findall('!', mystring))
'''['!', '!']'''

# 1.2.4 字典
word_freq = {}
for tok in mystring.split():
    if tok in word_freq:
        word_freq[tok] += 1
    else:
        word_freq[tok] = 1
print(word_freq)
'''{'Monty': 1, 'Python': 1, '!': 2, 'And': 1, 'the': 1, 'holy': 1, 'Grail': 1}'''

# 1.2.5 編寫函式
import sys
def wordfreq(mystring):
    '''
    Function to generated the frequency distribution of the given text
    '''
    print(mystring)
    word_freq = {}
    for tok in mystring.split():
        if tok in word_freq:
            word_freq[tok] += 1
        else:
            word_freq[tok] = 1
    print(word_freq)

def main():
    str = "This is my fist python program"
    wordfreq(str)
if __name__ == '__main__':
    main()
'''
This is my fist python program
{'This': 1, 'is': 1, 'my': 1, 'fist': 1, 'python': 1, 'program': 1}
'''

# 1.3 向NLTK邁進
from urllib import request
response = request.urlopen('http://python.org/')
html = response.read()
html = html.decode('utf-8')
print(len(html))
'''48141'''
#print(html)

tokens = [tok for tok in html.split()]
print("Total no of tokens :" + str(len(tokens)))
'''Total no of tokens :2901'''
print(tokens[0: 100])
'''
['<!doctype', 'html>', '<!--[if', 'lt', 'IE', '7]>', '<html', 'class="no-js', 'ie6', 'lt-ie7', 'lt-ie8', 'lt-ie9">', '<![endif]-->', '<!--[if', 'IE', '7]>', '<html', 'class="no-js', 'ie7', 'lt-ie8', 'lt-ie9">', '<![endif]-->', '<!--[if', 'IE', '8]>', '<html', 'class="no-js', 'ie8', 'lt-ie9">', '<![endif]-->', '<!--[if', 'gt', 'IE', '8]><!--><html', 'class="no-js"', 'lang="en"', 'dir="ltr">', '<!--<![endif]-->', '<head>', '<meta', 'charset="utf-8">', '<meta', 'http-equiv="X-UA-Compatible"', 'content="IE=edge">', '<link', 'rel="prefetch"', 'href="//ajax.googleapis.com/ajax/libs/jquery/1.8.2/jquery.min.js">', '<meta', 'name="application-name"', 'content="Python.org">', '<meta', 'name="msapplication-tooltip"', 'content="The', 'official', 'home', 'of', 'the', 'Python', 'Programming', 'Language">', '<meta', 'name="apple-mobile-web-app-title"', 'content="Python.org">', '<meta', 'name="apple-mobile-web-app-capable"', 'content="yes">', '<meta', 'name="apple-mobile-web-app-status-bar-style"', 'content="black">', '<meta', 'name="viewport"', 'content="width=device-width,', 'initial-scale=1.0">', '<meta', 'name="HandheldFriendly"', 'content="True">', '<meta', 'name="format-detection"', 'content="telephone=no">', '<meta', 'http-equiv="cleartype"', 'content="on">', '<meta', 'http-equiv="imagetoolbar"', 'content="false">', '<script', 'src="/static/js/libs/modernizr.js"></script>', '<link', 'href="/static/stylesheets/style.css"', 'rel="stylesheet"', 'type="text/css"', 'title="default"', '/>', '<link', 'href="/static/stylesheets/mq.css"', 'rel="stylesheet"', 'type="text/css"', 'media="not', 'print,', 'braille,']
'''

import re
tokens = re.split('\W+', html)
print(len(tokens))
'''6131'''
print(tokens[0: 100])
'''
['', 'doctype', 'html', 'if', 'lt', 'IE', '7', 'html', 'class', 'no', 'js', 'ie6', 'lt', 'ie7', 'lt', 'ie8', 'lt', 'ie9', 'endif', 'if', 'IE', '7', 'html', 'class', 'no', 'js', 'ie7', 'lt', 'ie8', 'lt', 'ie9', 'endif', 'if', 'IE', '8', 'html', 'class', 'no', 'js', 'ie8', 'lt', 'ie9', 'endif', 'if', 'gt', 'IE', '8', 'html', 'class', 'no', 'js', 'lang', 'en', 'dir', 'ltr', 'endif', 'head', 'meta', 'charset', 'utf', '8', 'meta', 'http', 'equiv', 'X', 'UA', 'Compatible', 'content', 'IE', 'edge', 'link', 'rel', 'prefetch', 'href', 'ajax', 'googleapis', 'com', 'ajax', 'libs', 'jquery', '1', '8', '2', 'jquery', 'min', 'js', 'meta', 'name', 'application', 'name', 'content', 'Python', 'org', 'meta', 'name', 'msapplication', 'tooltip', 'content', 'The', 'official']
'''

'''pip3 install bs4 lxml'''
import nltk
from bs4 import BeautifulSoup
#clean = nltk.clean_html(html)
#tokens = [tok for tok in clean.split()]
soup = BeautifulSoup(html, "lxml")
clean = soup.get_text()
tokens = [tok for tok in clean.split()]
print(tokens[:100])
'''
['Welcome', 'to', 'Python.org', '{', '"@context":', '"http://schema.org",', '"@type":', '"WebSite",', '"url":', '"https://www.python.org/",', '"potentialAction":', '{', '"@type":', '"SearchAction",', '"target":', '"https://www.python.org/search/?q={search_term_string}",', '"query-input":', '"required', 'name=search_term_string"', '}', '}', 'var', '_gaq', '=', '_gaq', '||', '[];', "_gaq.push(['_setAccount',", "'UA-39055973-1']);", "_gaq.push(['_trackPageview']);", '(function()', '{', 'var', 'ga', '=', "document.createElement('script');", 'ga.type', '=', "'text/javascript';", 'ga.async', '=', 'true;', 'ga.src', '=', "('https:'", '==', 'document.location.protocol', '?', "'https://ssl'", ':', "'http://www')", '+', "'.google-analytics.com/ga.js';", 'var', 's', '=', "document.getElementsByTagName('script')[0];", 's.parentNode.insertBefore(ga,', 's);', '})();', 'Notice:', 'While', 'Javascript', 'is', 'not', 'essential', 'for', 'this', 'website,', 'your', 'interaction', 'with', 'the', 'content', 'will', 'be', 'limited.', 'Please', 'turn', 'Javascript', 'on', 'for', 'the', 'full', 'experience.', 'Skip', 'to', 'content', '▼', 'Close', 'Python', 'PSF', 'Docs', 'PyPI', 'Jobs', 'Community', '▲', 'The', 'Python', 'Network']
'''

import operator
freq_dis = {}
for tok in tokens:
    if tok in freq_dis:
        freq_dis[tok] += 1
    else:
        freq_dis[tok] = 1
sorted_freq_dist = sorted(freq_dis.items(), key = operator.itemgetter(1), reverse = True)
print(sorted_freq_dist[:25])
'''
[('Python', 60), ('>>>', 24), ('and', 22), ('is', 18), ('the', 18), ('to', 17), ('of', 15), ('=', 14), ('Events', 11), ('News', 11), ('a', 10), ('for', 10), ('More', 9), ('#', 9), ('3', 8), ('in', 8), ('Community', 7), ('with', 7), ('...', 7), ('Docs', 6), ('Guide', 6), ('Software', 6), ('now', 5), ('that', 5), ('The', 5)]
'''

import nltk
Freq_dist_nltk = nltk.FreqDist(tokens)
print(Freq_dist_nltk)
'''<FreqDist with 600 samples and 1105 outcomes>'''
for k, v in Freq_dist_nltk.items():
    print(str(k) + ':' + str(v))
'''
This:1
[fruit.upper():1
Forums:2
Check:1
...
GUI:1
Intuitive:1
X:2
growth:1
advance:1
'''

# below is the plot for the frequency distributions
# 顯示累積詞頻
Freq_dist_nltk.plot(50, cumulative=False)

## 停用詞處理
#stopwords = [word.strip().lower() for word in open("PATH/english.stop.txt")]
#clean_tokens=[tok for tok in tokens if len(tok.lower()) > 1 and (tok.lower() not in stopwords)]
#Freq_dist_nltk = nltk.FreqDist(clean_tokens)
#Freq_dist_nltk.plot(50, cumulative = False)

02.02 文字的歧義及其清理(NLTKEssentials02.py)

# 《NLTK基礎教程--用NLTK和Python庫構建機器學習應用》02 文字的歧義及其清理
# win10 nltk3.2.4 python3.5.3/python3.6.1
# filename:NLTKEssentials02.py # 文字的歧義及其清理

# 標識化處理、詞幹提取、詞形還原(lemmatization)、停用詞移除

# 2.1 文字歧義
'''
# examples.csv
"test01",99
"test02",999
"test03",998
"test04",997
"test05",996
'''
import csv
with open('examples.csv', 'r', encoding='utf-8') as f:
    reader = csv.reader(f, delimiter = ',', quotechar = '"')
    for line in reader:
        print(line[1])
'''
99
999
998
997
996
'''

'''
# examples.json
{
  "array": [1, 2, 3, 4],
  "boolean": true,
  "object": {"a": "b"},
  "string": "Hello, World"
}
'''
import json
jsonfile = open('examples.json')
data = json.load(jsonfile)
print(data['string'])
'''Hello, World'''
with open('examples.json', 'r', encoding='utf-8') as f:
    data = json.load(f)
    print(data['string'])
'''Hello, World'''

# 2.2 文字清理

# 2.3 語句分離
import nltk
inputstring = 'This is an examples sent. The sentence splitter will split on sent markers. Ohh really !!'
from nltk.tokenize import sent_tokenize
#all_sent = sent_tokenize(inputstring, language="english")
all_sent = sent_tokenize(inputstring)
print(all_sent)

import nltk.tokenize.punkt
tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()


# 2.4分詞(標識化處理) toeknize http://text-processing.com/demo
s = "Hi Everyone ! hola gr8"
print(s.split())
from nltk.tokenize import word_tokenize
word_tokenize(s)

from nltk.tokenize import regexp_tokenize, wordpunct_tokenize, blankline_tokenize
regexp_tokenize(s, pattern = '\w+')
regexp_tokenize(s, pattern = '\d+')
wordpunct_tokenize(s)
blankline_tokenize(s)


# 2.5 詞幹提取(stemming)
# eat eatting eaten eats ==> eat
# 對於中文、日文，詞幹提取很難實現
import nltk
from nltk.stem import PorterStemmer
from nltk.stem.lancaster import LancasterStemmer
from nltk.stem.snowball import SnowballStemmer
pst = PorterStemmer()
lst = LancasterStemmer()
print(lst.stem("eating"))
'''eat'''
print(pst.stem("shopping"))
'''shop'''


# 2.6 詞形還原(lemmatization)，詞根(lemma)
from nltk.stem import WordNetLemmatizer
wlem = WordNetLemmatizer()
wlem.lemmatize("ate")
# Resource 'corpora/wordnet.zip/wordnet/' not found.  Please use the NLTK Downloader to obtain the resource:  >>> nltk.download()


# 2.7 停用詞移除(Stop word removal)
import nltk
from nltk.corpus import stopwords
stoplist = stopwords.words('english')
text = "This is just a test"
cleanwordlist = [word for word in text.split() if word not in stoplist]
print(cleanwordlist)
'''['This', 'test']'''


# 2.8 罕見詞移除
'''
import nltk
token = text.split()
freq_dist = nltk.FreqDist(token)
rarewords = freq_dist.keys()[-50:]
after_rare_words = [word for word in token not in rarewords]
print(after_rare_words)
'''

# 2.9 拼寫糾錯(speelchecker)
from nltk.metrics import edit_distance
print(edit_distance("rain", "shine")) # 3

02.03 詞性標註(NLTKEssentials03.py)

# 《NLTK基礎教程--用NLTK和Python庫構建機器學習應用》03 詞性標註
# win10 nltk3.2.4 python3.5.3/python3.6.1
# filename:NLTKEssentials03.py # 詞性標註

# 3.1 詞性標註
# 詞性(POS)
# PennTreebank

import nltk
from nltk import word_tokenize
s = "I was watching TV"
print(nltk.pos_tag(word_tokenize(s)))


tagged = nltk.pos_tag(word_tokenize(s))
allnoun = [word for word, pos in tagged if pos in ['NN', 'NNP']]
print(allnoun)


# 3.1.1 Stanford標註器
# https://nlp.stanford.edu/software/stanford-postagger-full-2017-06-09.zip
from nltk.tag.stanford import StanfordPOSTagger
import nltk
stan_tagger = StanfordPOSTagger('D:/nltk_data/stanford-postagger-full-2017-06-09/models/english-bidirectional-distsim.tagger',
                                'D:/nltk_data/stanford-postagger-full-2017-06-09/stanford-postagger.jar')
s = "I was watching TV"
tokens = nltk.word_tokenize(s)
stan_tagger.tag(tokens)

# 3.1.2 深入瞭解標註器
from nltk.corpus import brown
import nltk
tags = [tag for (word, tag) in brown.tagged_words(categories = 'news')]
print(nltk.FreqDist(tags))

brown_tagged_sents = brown.tagged_sents(categories = 'news')
default_tagger = nltk.DefaultTagger('NN')
print(default_tagger.evaluate(brown_tagged_sents))


# 3.1.3 順序性標註器
# 1 N-Gram標註器
from nltk.tag import UnigramTagger
from nltk.tag import DefaultTagger
from nltk.tag import BigramTagger
from nltk.tag import TrigramTagger
train_data = brown_tagged_sents[:int(len(brown_tagged_sents) * 0.9)]
test_data = brown_tagged_sents[int(len(brown_tagged_sents) * 0.9):]
unigram_tagger = UnigramTagger(train_data, backoff=default_tagger)
print(unigram_tagger.evaluate(test_data))
biggram_tagger = BigramTagger(train_data, backoff=unigram_tagger)
print(biggram_tagger.evaluate(test_data))
trigram_tagger = TrigramTagger(train_data, backoff=biggram_tagger)
print(trigram_tagger.evaluate(test_data))

# 2 正則表示式標註器
from nltk.tag.sequential import RegexpTagger
regexp_tagger = RegexpTagger(
    [(r'^-?[0-9]+(.[0-9]+)?$', 'CD'),  # cardinal numbers
     (r'(The|the|A|a|An|an)$', 'AT'),  # articles
     (r'.*able$', 'JJ'), # adjectives
     (r'.*ness$', 'NN'), # nouns formed from adj
     (r'.*ly$', 'RB'),   # adverbs
     (r'.*s$', 'NNS'),   # plural nouns
     (r'.*ing$', 'VBG'), # gerunds
     (r'.*ed$', 'VBD'),  # past tense verbs
     (r'.*', 'NN')       # nouns (default)
    ])
print(regexp_tagger.evaluate(test_data))

# 3.1.4 Brill 標註器
# 3.1.5 基於機器學習的標註器
# 最大熵分類器(MEC)
# 隱性馬爾科夫模型(HMM)
# 條件隨機場(CRF)

# 3.2 命名實體識別(NER)
# NER標註器
import nltk
from nltk import ne_chunk
sent = "Mark is studying at Stanford University in California"
print(ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent)), binary = False))

from nltk.tag.stanford import StanfordNERTagger
# https://nlp.stanford.edu/software/stanford-ner-2017-06-09.zip
st = StanfordNERTagger('D:/nltk_data/stanford-ner-2017-06-09/classifiers/english.all.3class.distsim.crf.ser.gz',
                       'D:/nltk_data/stanford-ner-2017-06-09/stanford-ner.jar')
st.tag('Rami Eid is studying at Stony Brook University in NY'.split())

02.04 文字結構解析(NLTKEssentials04.py)

# 《NLTK基礎教程--用NLTK和Python庫構建機器學習應用》04 文字結構解析
# win10 nltk3.2.4 python3.5.3/python3.6.1
# filename:NLTKEssentials04.py # 文字結構解析

# 4.1 淺解析與深解析
# CFG(context-free grammar)：上下文無關語法
# PCFG(probabilistic context-free grammar)：概率性上下文無關語法
# 淺解析：shallow parsing
# 深解析：deep parsing

# 4.2 兩種解析方法
# 基於規則
# 基於概率

# 4.3 為什麼需要進行解析
# 語法解析器(syntactic parser)
'''
import nltk
from nltk import CFG
toy_grammar = nltk.CFG.fromstring(
"""
S -> NP VP  # S indicate the entire sentence
VP -> V NP  # VP is verb phrase the
V -> "eats" | "drinks" # V is verb
NP -> Det N # NP is noun phrase (chunk that has noun in it)
Det -> "a" | "an" | "the" # Det is determiner used in the sentences
N -> "president" | "Obama" | "apple" | "coke" # N some example nouns
""")
toy_grammar.productions()
'''

# 4.4 不同的解析器型別
# 4.4.1 遞迴下降解析器
# 4.4.2 移位-規約解析器
# 4.4.3 圖表解析器
# 4.4.4 正則表示式解析器
import nltk
from nltk.chunk.regexp import *
chunk_rules = ChunkRule("<.*>+", "chunk everything")
reg_parser = RegexpParser('''
NP: {<DT>? <JJ>* <NN>*} # NP
P: {<IN>}               # Preposition
V: {<V.*>}              # Verb
PP: {<P> <NP>}          # PP -> P NP
VP: {<V> <NP|PP>*}      # VP -> V (NP|PP)*
''')
test_sent = "Mr. Obama played a big role in the Health insurance bill"
test_sent_pos = nltk.pos_tag(nltk.word_tokenize(test_sent))
paresed_out = reg_parser.parse(test_sent_pos)
print(paresed_out)

# 4.5 依存性文字解析(dependency parsing, DP)
# 基於概率的投射依存性解析器(probabilistic, projective dependency parser)
from nltk.parse.stanford import StanfordParser
# https://nlp.stanford.edu/software/stanford-parser-full-2017-06-09.zip
english_parser = StanfordParser('D:/nltk_data/stanford-parser-full-2017-06-09/stanford-parser.jar',
                                'D:/nltk_data/stanford-parser-full-2017-06-09/stanford-parser-3.8.0-models.jar')
english_parser.raw_parse_sents(("this is the english parser test"))

# 4.6 語塊解析
'''
from nltk.chunk.regexp import *
test_sent = "The prime minister announced he had asked the chief government whip, \
Philip Ruddock, to call a special party room meeting for 9am on Monday to consider the spill motion."
test_sent_pos = nltk.pos_tag(nltk.word_tokenize(test_sent))
rule_vp = ChunkRule(r'(<VB.*>)?（<VB.*>)+(<PRP>)?', 'Chunk VPs')
parser_vp = RegexpChunkParser([rule_vp], chunk_label = 'VP')
print(parser_vp.parse(test_sent_pos))
rule_np = ChunkRule(r'(<DT>?<RB>?)?<JJ|CD>*(<JJ|CD><,>)*(<NN.*>)+', 'Chunk NPs')
parser_np = RegexpChunkParser([rule_np], chunk_label="NP")
print(parser_np.parse(test_sent_pos))
'''

# 4.7 資訊提取
# 4.7.1 命名實體識別(NER)
f = open("D:/nltk_data/ner_sample.txt")#  absolute path for the file of text for which we want NER
text = f.read()
sentences = nltk.sent_tokenize(text)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
for sent in tagged_sentences:
    print(nltk.ne_chunk(sent))

# 4.7.2 關係提取
import re
IN = re.compile(r'.*\bin\b(?!\b.+ing)')
for doc in nltk.corpus.ieer.parsed_docs('NYT_19980315'):
    for rel in nltk.sem.extract_rels('ORG', 'LOC', doc, corpus = 'ieer', pattern = IN):
        print(nltk.sem.rtuple(rel))

02.06 文字分類(NLTKEssentials06.py)

# 《NLTK基礎教程--用NLTK和Python庫構建機器學習應用》06 文字分類
# win10 nltk3.2.4 python3.5.3/python3.6.1
# filename:NLTKEssentials06.py # 文字分類

# 6.2 文字分類
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import csv
def preprocessing(text):
    #text = text.decode("utf8")
    # tokenize into words
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    # remove stopwords
    stop = stopwords.words('english')
    tokens = [token for token in tokens if token not in stop]
    # remove words less than three letters
    tokens = [word for word in tokens if len(word) >= 3]
    # lower capitalization
    tokens = [word.lower() for word in tokens]
    # lemmatize
    lmtzr = WordNetLemmatizer()
    tokens = [lmtzr.lemmatize(word) for word in tokens]
    preprocessed_text = ' '.join(tokens)
    return preprocessed_text

# https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
sms = open('D:/nltk_data/SMSSpamCollection', encoding='utf8') # check the structure of this file!
sms_data = []
sms_labels = []
csv_reader = csv.reader(sms, delimiter = '\t')
for line in csv_reader:
    # adding the sms_id
    sms_labels.append(line[0])
    # adding the cleaned text We are calling preprocessing method
    sms_data.append(preprocessing(line[1]))
sms.close()

# 6.3 取樣操作
import sklearn
import numpy as np
trainset_size = int(round(len(sms_data)*0.70))
# i chose this threshold for 70:30 train and test split.
print('The training set size for this classifier is ' + str(trainset_size) + '\n')
x_train = np.array([''.join(el) for el in sms_data[0: trainset_size]])
y_train = np.array([el for el in sms_labels[0: trainset_size]])
x_test = np.array([''.join(el) for el in sms_data[trainset_size+1:len(sms_data)]])
y_test = np.array([el for el in sms_labels[trainset_size+1:len(sms_labels)]])
         #or el in sms_labels[trainset_size+1:len(sms_labels)]

print(x_train)
print(y_train)

from sklearn.feature_extraction.text import CountVectorizer
sms_exp = []
for line in sms_data:
    sms_exp.append(preprocessing(line))
vectorizer = CountVectorizer(min_df = 1, encoding='utf-8')
X_exp = vectorizer.fit_transform(sms_exp)
print("||".join(vectorizer.get_feature_names()))
print(X_exp.toarray())

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df = 2, ngram_range=(1, 2),
                             stop_words = 'english', strip_accents = 'unicode', norm = 'l2')
X_train = vectorizer.fit_transform(x_train)
X_test = vectorizer.transform(x_test)

# 6.3.1 樸素貝葉斯法
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
clf = MultinomialNB().fit(X_train, y_train)
y_nb_predicted = clf.predict(X_test)
print(y_nb_predicted)
print('\n confusion_matrix \n')
#cm = confusion_matrix(y_test, y_pred)
cm = confusion_matrix(y_test, y_nb_predicted)
print(cm)
print('\n Here is the classification report:')
print(classification_report(y_test, y_nb_predicted))

feature_names = vectorizer.get_feature_names()
coefs = clf.coef_
intercept = clf.intercept_
coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
n = 10
top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])
for (coef_1, fn_1), (coef_2, fn_2) in top:
    print('\t%.4f\t%-15s\t\t%.4f\t%-15s' %(coef_1, fn_1, coef_2, fn_2))

# 6.3.2 決策樹
from sklearn import tree
clf = tree.DecisionTreeClassifier().fit(X_train.toarray(), y_train)
y_tree_predicted = clf.predict(X_test.toarray())
print(y_tree_predicted)
print('\n Here is the classification report:')
print(classification_report(y_test, y_tree_predicted))

# 6.3.3 隨機梯度下降法
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import confusion_matrix
clf = SGDClassifier(alpha = 0.0001, n_iter=50).fit(X_train, y_train)
y_pred = clf.predict(X_test)
print('\n Here is the classification report:')
print(classification_report(y_test, y_pred))
print(' \n confusion_matrix \n')
cm = confusion_matrix(y_test, y_pred)
print(cm)

# 6.3.4 邏輯迴歸
# 6.3.5 支援向量機
from sklearn.svm import LinearSVC
svm_classifier = LinearSVC().fit(X_train, y_train)
y_svm_predicted = svm_classifier.predict(X_test)
print('\n Here is the classification report:')
print(classification_report(y_test, y_svm_predicted))
cm = confusion_matrix(y_test, y_pred)
print(cm)

# 6.4 隨機森林
from sklearn.ensemble import RandomForestClassifier
RF_clf = RandomForestClassifier(n_estimators=10).fit(X_train, y_train)
predicted = RF_clf.predict(X_test)
print('\n Here is the classification report:')
print(classification_report(y_test, predicted))
cm = confusion_matrix(y_test, y_pred)
print(cm)

# 6.5 文字聚類
# K 均值法
from sklearn.cluster import KMeans, MiniBatchKMeans
from collections import defaultdict
true_k = 5
km = KMeans(n_clusters = true_k, init='k-means++', max_iter=100, n_init= 1)
kmini = MiniBatchKMeans(n_clusters=true_k, init='k-means++', n_init=1, init_size=1000, batch_size=1000, verbose=2)
km_model = km.fit(X_train)
kmini_model = kmini.fit(X_train)
print("For K-mean clustering ")
clustering = defaultdict(list)
for idx, label in enumerate(km_model.labels_):
    clustering[label].append(idx)
print("For K-mean Mini batch clustering ")
clustering = defaultdict(list)
for idx, label in enumerate(kmini_model.labels_):
    clustering[label].append(idx)

# 6.6 文字中的主題建模
# https://pypi.python.org/pypi/gensim#downloads
import gensim
from gensim import corpora, models, similarities
from itertools import chain
import nltk
from nltk.corpus import stopwords
from operator import itemgetter
import re
documents = [document for document in sms_data]
stoplist = stopwords.words('english')
texts = [[word for word in document.lower().split() if word not in stoplist] for document in documents]
print(texts)


dictionary = corpora.Dictiona

 
 
              
           
              
              
            
            相關推薦
			   
            
            
            
 

    

    
    NLTK01 《NLTK基礎教程--用NLTK和Python庫構建機器學習應用》
      
							
							
							01 關於NLTK的認知

 很多介紹NLP的，都會提到NLTK庫。還以為NLTK是多牛逼的必需品。看了之後，感覺NLTK對實際專案，作用不大。很多內容都是從語義、語法方面解決NLP問題的。感覺不太靠譜。而且本身中文語料庫不多。很多介紹NLTK的書籍和blog都 

  
 

    

    
    Scikit-learn在Python中構建機器學習分類器
      機器學習是電腦科學、人工智慧和統計學的研究領域。機器學習的重點是訓練演算法以學習模式並根據資料進行預測。機器學習特別有價值，因為它讓我們可以使用計算機來自動化決策過程。 
在本教程中，您將使用Scikit-learn（Python的機器學習工具）在Python中實現一個簡單的機器學習演算法。您將使用Naive 

  
 

    

    
    《NLTK基礎教程》譯者序
      
說來也湊巧，在我簽下這本書的翻譯合同時，這個世界好像還不知道AlphaGo的存在。而在我完成這本書的翻譯之時，Master已經對人類頂級高手連勝60局了。至少從媒體的熱度來看，的確在近幾年，人工智慧似乎是越來越火了。其原因是Google在汽車駕駛和圍棋這兩個領域的專案得到了很好的進展和宣傳，而這兩個領域在過 

  
 

    

    
    （Python基礎教程之十三）Python中使用httplib2 – HTTP GET和POST示例
      1. [Python基礎教程](https://zthinker.com/archives/python基礎教程)
2. [在SublimeEditor中配置Python環境](https://zthinker.com/archives/在sublimeeditor中配置python環境)
3. [Pytho 

  
 

    

    
    安裝Ubuntu Server18.04（附與CentOS占用體積和Python版本的對比）
      blog   版本   自帶   17.   安裝   系統   inux   8.0   png   這邊只演示一下最新系統的安裝過程，設置之類的和以前講的Kali以及CentOS大同小異：https://www.cnblogs.com/dunitian/p/4822808.html#linux





 

  
 

    

    
    numpy基礎教程--淺拷貝和深拷貝
      在numpy中，使用等號（=）直接賦值返回的是一個檢視，屬於淺拷貝；要完整的拷貝一個numpy.ndarray型別的資料的話，只能呼叫copy()函式 
 
 # coding = utf-8
import numpy as np
t = np.zeros(24).reshape(4, 6)
t1 = t
t 

  
 

    

    
    99乘法表分別用java和python實現
       
  
  
 
 
  如何用java和python實現九九乘法表
  
   java
   python
   python一行實現
  
 
  
 java 
 class ChengFaBiao {
	public static void main(String[] args) {
		for  

  
 

    

    
    用openCV 和 Python 實現圖片對比，並標識出不同點
      
							
							
							最近專案中需要實現兩組圖片對比，並能將兩者的區別標識出來。



在網上搜索一大堆找到一篇大神的文章，最終實現該功能，在這裡記錄下：



想要實現此demo，首先我們得確保電腦上已安裝 openCV 和 Python 兩個工具以及scikit-image和im 

  
 

    

    
    Python:通過執行100萬次列印來比較C和python的效能，以及用C和python結合來解決效能問題的方法
      
                
  python作為動態語言，開發效率相當高，但如我們所知，動態語言的執行效率往往是比較低的，請看下面簡單的測試過程：
 一、 C語言實現100萬次列印：
  程式碼：
#include<stdio.h>
#include <time.h>

int 

  
 

    

    
    分別用Shell和Python遍歷查詢Hdfs檔案路徑
      
								
								            
						
                
1、使用Shell/Users/nisj/PycharmProjects/BiDataProc/getOssFileForDemo/getHdfsFilePath.sh#!/usr/bin/env b 

  
 

    

    
    （Python基礎教程之八）Python中的list操作
      1. [Python基礎教程](https://zthinker.com/archives/python基礎教程)
2. [在SublimeEditor中配置Python環境](https://zthinker.com/archives/在sublimeeditor中配置python環境)
3. [Pytho 

  
 

    

    
    用Python爬取股票資料，繪製K線和均線並用機器學習預測股價（來自我出的書）
          最近我出了一本書，《基於股票大資料分析的Python入門實戰 視訊教學版》，京東連結：https://item.jd.com/69241653952.html，在其中用股票範例講述Python爬蟲、資料分析和機器學習的技術，大家看了我的書，不僅能很快用比較熱門的案例學好Python 

  
 

    

    
    解決用pip安裝Python庫時可能會遇到的問題
      png   xxx   分享   問題   nbsp   sin   系統   錯誤   通過命令   筆者電腦系統是win7，同時安裝了Python2.7和Python3.6，但是在通過命令行直接使用“pip install XXX”安裝Python庫時出現了以下的錯誤信息：
Fatal error in  

  
 

    

    
    Git 基礎教程 之 創建版本庫
      初始化   .com   ima   版本   repo   -a   info   版本庫   新建   一，選擇一個合適的地方，創建空目錄，下面兩種方法都可
①   手動新建
②    使用命令： mkdir pythonwork

二，初始化，使目錄變成Git可管理的倉庫
       執行： git  

  
 

    

    
    用python來實現機器學習（一）：線性迴歸（linear regression）
       
  
  
 需要下載一個data：auto-mpg.data 
 第一步：顯示資料集圖 
 import pandas as pd
import matplotlib.pyplot as plt
columns = ["mpg","cylinders","displacement","horsepowe 

  
 

    

    
    用Python從頭實現機器學習演算法
       
 
  
  
 Machine Learning from scratch：僅使用Python和少量的第三方庫（Numpy/Pandas/PyTorch）函式實現基礎的機器學習演算法。實現的模型會與sklearn進行比較。 
 專案地址：https://github.com/anhquan0412/ba 

  
 

    

    
    【Python】用pip安裝python庫下載超時的解決辦法
      
								
								            
						
                
超時提示
During handling of the above exception, another exception occurred:

Traceback (most recent cal 

  
 

    

    
    用python構建機器學習模型分析空氣質量
       
  
  
 空氣質量（air quality）的好壞反映了空氣汙染程度，它是依據空氣中汙染物濃度的高低來判斷的。空氣汙染是一個複雜的現象，在特定時間和地點空氣汙染物濃度受到許多因素影響。來自固定和流動汙染源的人為汙染物排放大小是影響空氣質量的最主要因素之一，其中包括車輛、船舶、飛機的尾氣、工業企業生產排 

  
 

    

    
    TOP20你用了幾個？Python人工智慧與機器學習開源專案
      
                    

                    

                    
                    
                    主要發現     相對於2016年的報告，2018年《Top 20 Python AI and  

  
 

    

    
    《用Python構建機器學習》——第十章：計算機視覺-模式識別 讀後小結
      
                
本文是《Building Machine Learning Systems with Python》第十章的筆記。亞馬遜英文版連結（話說亞馬遜現在圖書的介紹影象做得很贊啊！）
這本書和圖靈出版的《機器學習實戰》一書有點類似。《機器學習實戰》那本書是非常建議購買一本的，如果這

NLTK01 《NLTK基礎教程--用NLTK和Python庫構建機器學習應用》

01 關於NLTK的認知

02 部分程式碼整理

02.01 自然語言處理簡介(NLTKEssentials01.py)

02.02 文字的歧義及其清理(NLTKEssentials02.py)

02.03 詞性標註(NLTKEssentials03.py)

02.04 文字結構解析(NLTKEssentials04.py)

02.06 文字分類(NLTKEssentials06.py)

NLTK01 《NLTK基礎教程--用NLTK和Python庫構建機器學習應用》

Scikit-learn在Python中構建機器學習分類器

《NLTK基礎教程》譯者序

（Python基礎教程之十三）Python中使用httplib2 – HTTP GET和POST示例

安裝Ubuntu Server18.04（附與CentOS占用體積和Python版本的對比）

numpy基礎教程--淺拷貝和深拷貝

99乘法表分別用java和python實現

用openCV 和 Python 實現圖片對比，並標識出不同點

Python:通過執行100萬次列印來比較C和python的效能，以及用C和python結合來解決效能問題的方法

分別用Shell和Python遍歷查詢Hdfs檔案路徑

（Python基礎教程之八）Python中的list操作

用Python爬取股票資料，繪製K線和均線並用機器學習預測股價（來自我出的書）

解決用pip安裝Python庫時可能會遇到的問題

Git 基礎教程之創建版本庫

用python來實現機器學習（一）：線性迴歸（linear regression）

用Python從頭實現機器學習演算法

【Python】用pip安裝python庫下載超時的解決辦法

用python構建機器學習模型分析空氣質量

TOP20你用了幾個？Python人工智慧與機器學習開源專案

《用Python構建機器學習》——第十章：計算機視覺-模式識別讀後小結