1. 程式人生 > >NLTK01 《NLTK基礎教程--用NLTK和Python庫構建機器學習應用》

NLTK01 《NLTK基礎教程--用NLTK和Python庫構建機器學習應用》

01 關於NLTK的認知

 很多介紹NLP的,都會提到NLTK庫。還以為NLTK是多牛逼的必需品。看了之後,感覺NLTK對實際專案,作用不大。很多內容都是從語義、語法方面解決NLP問題的。感覺不太靠譜。而且本身中文語料庫不多。很多介紹NLTK的書籍和blog都比較陳舊。
 《NLTK基礎教程--用NLTK和Python庫構建機器學習應用》雖然是2017年6月第一版。但內容大部分還是很陳舊的。基本都是採用英文的素材。書中排版類、文字類錯誤很多。
《Python自然語言處理》 [美] Steven Bird,Ewan Klein & Edward Loper著 陳濤 張旭 催楊 劉海平 譯 的介紹的更全面。程式碼及其陳舊,知識點很全面。

02 部分程式碼整理

下面整理了1、2、3、4、6、8章的程式碼。在win10 nltk3.2.4 python3.5.3/python3.6.1環境,可以正常執行。一定要注意nltk_data程式碼的下載,還有缺少庫的時候,按需安裝。其中 pywin32-221.win-amd64-py3.6.exe/pywin32-221.win-amd64-py3.5.exe 需要手工下載[https://sourceforge.net/projects/pywin32/files/pywin32/Build%20221/]
需要下載的資料,都在程式碼裡給出連結,或者說明。

02.01 自然語言處理簡介(NLTKEssentials01.py)

# 《NLTK基礎教程--用NLTK和Python庫構建機器學習應用》01 自然語言處理簡介
# win10 nltk3.2.4 python3.5.3/python3.6.1
# filename:NLTKEssentials01.py # 自然語言處理簡介

import nltk
#nltk.download() # 完全下載需要很久,很可能需要多次嘗試,才能下載成功
print("Python and NLTK installed successfully")
'''Python and NLTK installed successfully'''

# 1.2 先從Python開始
# 1.2.1 列表
lst = [1, 2, 3, 4] print(lst) '''[1, 2, 3, 4]''' # print('Fisrt element: ' + lst[0]) # '''TypeError: must be str, not int''' print('Fisrt element: ' + str(lst[0])) '''Fisrt element: 1''' print('First element: ' + str(lst[0])) print('last element: ' + str(lst[-1])) print('first three elemenets: ' + str(lst[0:2])) print('last three elements: ' + str(lst[-3:])) ''' First element: 1 last element: 4 first three elemenets: [1, 2] last three elements: [2, 3, 4] ''' # 1.2.2 自主功能 print(dir(lst)) ''' ['__add__', '__class__', '__contains__', '__delattr__', '__delitem__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__iadd__', '__imul__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__rmul__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', 'append', 'clear', 'copy', 'count', 'extend', 'index', 'insert', 'pop', 'remove', 'reverse', 'sort'] ''' print(' , '.join(dir(lst))) ''' __add__ , __class__ , __contains__ , __delattr__ , __delitem__ , __dir__ , __doc__ , __eq__ , __format__ , __ge__ , __getattribute__ , __getitem__ , __gt__ , __hash__ , __iadd__ , __imul__ , __init__ , __init_subclass__ , __iter__ , __le__ , __len__ , __lt__ , __mul__ , __ne__ , __new__ , __reduce__ , __reduce_ex__ , __repr__ , __reversed__ , __rmul__ , __setattr__ , __setitem__ , __sizeof__ , __str__ , __subclasshook__ , append , clear , copy , count , extend , index , insert , pop , remove , reverse , sort ''' help(lst.index) ''' Help on built-in function index: index(...) method of builtins.list instance L.index(value, [start, [stop]]) -> integer -- return first index of value. Raises ValueError if the value is not present. ''' mystring = "Monty Python ! And the holy Grail ! \n" print(mystring.split()) ''' ['Monty', 'Python', '!', 'And', 'the', 'holy', 'Grail', '!'] ''' print(mystring.strip()) '''Monty Python ! And the holy Grail !''' print(mystring.lstrip()) ''' Monty Python ! And the holy Grail ! ''' print(mystring.rstrip()) '''Monty Python ! And the holy Grail !''' print(mystring.upper()) ''' MONTY PYTHON ! AND THE HOLY GRAIL ! ''' print(mystring.replace('!', '''''')) ''' Monty Python And the holy Grail ''' # 1.2.3 正則表示式 import re if re.search('Python', mystring): print("We found python ") else: print("No ") '''We found python ''' import re print(re.findall('!', mystring)) '''['!', '!']''' # 1.2.4 字典 word_freq = {} for tok in mystring.split(): if tok in word_freq: word_freq[tok] += 1 else: word_freq[tok] = 1 print(word_freq) '''{'Monty': 1, 'Python': 1, '!': 2, 'And': 1, 'the': 1, 'holy': 1, 'Grail': 1}''' # 1.2.5 編寫函式 import sys def wordfreq(mystring): ''' Function to generated the frequency distribution of the given text ''' print(mystring) word_freq = {} for tok in mystring.split(): if tok in word_freq: word_freq[tok] += 1 else: word_freq[tok] = 1 print(word_freq) def main(): str = "This is my fist python program" wordfreq(str) if __name__ == '__main__': main() ''' This is my fist python program {'This': 1, 'is': 1, 'my': 1, 'fist': 1, 'python': 1, 'program': 1} ''' # 1.3 向NLTK邁進 from urllib import request response = request.urlopen('http://python.org/') html = response.read() html = html.decode('utf-8') print(len(html)) '''48141''' #print(html) tokens = [tok for tok in html.split()] print("Total no of tokens :" + str(len(tokens))) '''Total no of tokens :2901''' print(tokens[0: 100]) ''' ['<!doctype', 'html>', '<!--[if', 'lt', 'IE', '7]>', '<html', 'class="no-js', 'ie6', 'lt-ie7', 'lt-ie8', 'lt-ie9">', '<![endif]-->', '<!--[if', 'IE', '7]>', '<html', 'class="no-js', 'ie7', 'lt-ie8', 'lt-ie9">', '<![endif]-->', '<!--[if', 'IE', '8]>', '<html', 'class="no-js', 'ie8', 'lt-ie9">', '<![endif]-->', '<!--[if', 'gt', 'IE', '8]><!--><html', 'class="no-js"', 'lang="en"', 'dir="ltr">', '<!--<![endif]-->', '<head>', '<meta', 'charset="utf-8">', '<meta', 'http-equiv="X-UA-Compatible"', 'content="IE=edge">', '<link', 'rel="prefetch"', 'href="//ajax.googleapis.com/ajax/libs/jquery/1.8.2/jquery.min.js">', '<meta', 'name="application-name"', 'content="Python.org">', '<meta', 'name="msapplication-tooltip"', 'content="The', 'official', 'home', 'of', 'the', 'Python', 'Programming', 'Language">', '<meta', 'name="apple-mobile-web-app-title"', 'content="Python.org">', '<meta', 'name="apple-mobile-web-app-capable"', 'content="yes">', '<meta', 'name="apple-mobile-web-app-status-bar-style"', 'content="black">', '<meta', 'name="viewport"', 'content="width=device-width,', 'initial-scale=1.0">', '<meta', 'name="HandheldFriendly"', 'content="True">', '<meta', 'name="format-detection"', 'content="telephone=no">', '<meta', 'http-equiv="cleartype"', 'content="on">', '<meta', 'http-equiv="imagetoolbar"', 'content="false">', '<script', 'src="/static/js/libs/modernizr.js"></script>', '<link', 'href="/static/stylesheets/style.css"', 'rel="stylesheet"', 'type="text/css"', 'title="default"', '/>', '<link', 'href="/static/stylesheets/mq.css"', 'rel="stylesheet"', 'type="text/css"', 'media="not', 'print,', 'braille,'] ''' import re tokens = re.split('\W+', html) print(len(tokens)) '''6131''' print(tokens[0: 100]) ''' ['', 'doctype', 'html', 'if', 'lt', 'IE', '7', 'html', 'class', 'no', 'js', 'ie6', 'lt', 'ie7', 'lt', 'ie8', 'lt', 'ie9', 'endif', 'if', 'IE', '7', 'html', 'class', 'no', 'js', 'ie7', 'lt', 'ie8', 'lt', 'ie9', 'endif', 'if', 'IE', '8', 'html', 'class', 'no', 'js', 'ie8', 'lt', 'ie9', 'endif', 'if', 'gt', 'IE', '8', 'html', 'class', 'no', 'js', 'lang', 'en', 'dir', 'ltr', 'endif', 'head', 'meta', 'charset', 'utf', '8', 'meta', 'http', 'equiv', 'X', 'UA', 'Compatible', 'content', 'IE', 'edge', 'link', 'rel', 'prefetch', 'href', 'ajax', 'googleapis', 'com', 'ajax', 'libs', 'jquery', '1', '8', '2', 'jquery', 'min', 'js', 'meta', 'name', 'application', 'name', 'content', 'Python', 'org', 'meta', 'name', 'msapplication', 'tooltip', 'content', 'The', 'official'] ''' '''pip3 install bs4 lxml''' import nltk from bs4 import BeautifulSoup #clean = nltk.clean_html(html) #tokens = [tok for tok in clean.split()] soup = BeautifulSoup(html, "lxml") clean = soup.get_text() tokens = [tok for tok in clean.split()] print(tokens[:100]) ''' ['Welcome', 'to', 'Python.org', '{', '"@context":', '"http://schema.org",', '"@type":', '"WebSite",', '"url":', '"https://www.python.org/",', '"potentialAction":', '{', '"@type":', '"SearchAction",', '"target":', '"https://www.python.org/search/?q={search_term_string}",', '"query-input":', '"required', 'name=search_term_string"', '}', '}', 'var', '_gaq', '=', '_gaq', '||', '[];', "_gaq.push(['_setAccount',", "'UA-39055973-1']);", "_gaq.push(['_trackPageview']);", '(function()', '{', 'var', 'ga', '=', "document.createElement('script');", 'ga.type', '=', "'text/javascript';", 'ga.async', '=', 'true;', 'ga.src', '=', "('https:'", '==', 'document.location.protocol', '?', "'https://ssl'", ':', "'http://www')", '+', "'.google-analytics.com/ga.js';", 'var', 's', '=', "document.getElementsByTagName('script')[0];", 's.parentNode.insertBefore(ga,', 's);', '})();', 'Notice:', 'While', 'Javascript', 'is', 'not', 'essential', 'for', 'this', 'website,', 'your', 'interaction', 'with', 'the', 'content', 'will', 'be', 'limited.', 'Please', 'turn', 'Javascript', 'on', 'for', 'the', 'full', 'experience.', 'Skip', 'to', 'content', '', 'Close', 'Python', 'PSF', 'Docs', 'PyPI', 'Jobs', 'Community', '', 'The', 'Python', 'Network'] ''' import operator freq_dis = {} for tok in tokens: if tok in freq_dis: freq_dis[tok] += 1 else: freq_dis[tok] = 1 sorted_freq_dist = sorted(freq_dis.items(), key = operator.itemgetter(1), reverse = True) print(sorted_freq_dist[:25]) ''' [('Python', 60), ('>>>', 24), ('and', 22), ('is', 18), ('the', 18), ('to', 17), ('of', 15), ('=', 14), ('Events', 11), ('News', 11), ('a', 10), ('for', 10), ('More', 9), ('#', 9), ('3', 8), ('in', 8), ('Community', 7), ('with', 7), ('...', 7), ('Docs', 6), ('Guide', 6), ('Software', 6), ('now', 5), ('that', 5), ('The', 5)] ''' import nltk Freq_dist_nltk = nltk.FreqDist(tokens) print(Freq_dist_nltk) '''<FreqDist with 600 samples and 1105 outcomes>''' for k, v in Freq_dist_nltk.items(): print(str(k) + ':' + str(v)) ''' This:1 [fruit.upper():1 Forums:2 Check:1 ... GUI:1 Intuitive:1 X:2 growth:1 advance:1 ''' # below is the plot for the frequency distributions # 顯示累積詞頻 Freq_dist_nltk.plot(50, cumulative=False) ## 停用詞處理 #stopwords = [word.strip().lower() for word in open("PATH/english.stop.txt")] #clean_tokens=[tok for tok in tokens if len(tok.lower()) > 1 and (tok.lower() not in stopwords)] #Freq_dist_nltk = nltk.FreqDist(clean_tokens) #Freq_dist_nltk.plot(50, cumulative = False)

02.02 文字的歧義及其清理(NLTKEssentials02.py)

# 《NLTK基礎教程--用NLTK和Python庫構建機器學習應用》02 文字的歧義及其清理
# win10 nltk3.2.4 python3.5.3/python3.6.1
# filename:NLTKEssentials02.py # 文字的歧義及其清理

# 標識化處理、詞幹提取、詞形還原(lemmatization)、停用詞移除

# 2.1 文字歧義
'''
# examples.csv
"test01",99
"test02",999
"test03",998
"test04",997
"test05",996
'''
import csv
with open('examples.csv', 'r', encoding='utf-8') as f:
    reader = csv.reader(f, delimiter = ',', quotechar = '"')
    for line in reader:
        print(line[1])
'''
99
999
998
997
996
'''

'''
# examples.json
{
  "array": [1, 2, 3, 4],
  "boolean": true,
  "object": {"a": "b"},
  "string": "Hello, World"
}
'''
import json
jsonfile = open('examples.json')
data = json.load(jsonfile)
print(data['string'])
'''Hello, World'''
with open('examples.json', 'r', encoding='utf-8') as f:
    data = json.load(f)
    print(data['string'])
'''Hello, World'''

# 2.2 文字清理

# 2.3 語句分離
import nltk
inputstring = 'This is an examples sent. The sentence splitter will split on sent markers. Ohh really !!'
from nltk.tokenize import sent_tokenize
#all_sent = sent_tokenize(inputstring, language="english")
all_sent = sent_tokenize(inputstring)
print(all_sent)

import nltk.tokenize.punkt
tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()


# 2.4分詞(標識化處理) toeknize http://text-processing.com/demo
s = "Hi Everyone ! hola gr8"
print(s.split())
from nltk.tokenize import word_tokenize
word_tokenize(s)

from nltk.tokenize import regexp_tokenize, wordpunct_tokenize, blankline_tokenize
regexp_tokenize(s, pattern = '\w+')
regexp_tokenize(s, pattern = '\d+')
wordpunct_tokenize(s)
blankline_tokenize(s)


# 2.5 詞幹提取(stemming)
# eat eatting eaten eats ==> eat
# 對於中文、日文,詞幹提取很難實現
import nltk
from nltk.stem import PorterStemmer
from nltk.stem.lancaster import LancasterStemmer
from nltk.stem.snowball import SnowballStemmer
pst = PorterStemmer()
lst = LancasterStemmer()
print(lst.stem("eating"))
'''eat'''
print(pst.stem("shopping"))
'''shop'''


# 2.6 詞形還原(lemmatization),詞根(lemma)
from nltk.stem import WordNetLemmatizer
wlem = WordNetLemmatizer()
wlem.lemmatize("ate")
# Resource 'corpora/wordnet.zip/wordnet/' not found.  Please use the NLTK Downloader to obtain the resource:  >>> nltk.download()


# 2.7 停用詞移除(Stop word removal)
import nltk
from nltk.corpus import stopwords
stoplist = stopwords.words('english')
text = "This is just a test"
cleanwordlist = [word for word in text.split() if word not in stoplist]
print(cleanwordlist)
'''['This', 'test']'''


# 2.8 罕見詞移除
'''
import nltk
token = text.split()
freq_dist = nltk.FreqDist(token)
rarewords = freq_dist.keys()[-50:]
after_rare_words = [word for word in token not in rarewords]
print(after_rare_words)
'''

# 2.9 拼寫糾錯(speelchecker)
from nltk.metrics import edit_distance
print(edit_distance("rain", "shine")) # 3

02.03 詞性標註(NLTKEssentials03.py)

# 《NLTK基礎教程--用NLTK和Python庫構建機器學習應用》03 詞性標註
# win10 nltk3.2.4 python3.5.3/python3.6.1
# filename:NLTKEssentials03.py # 詞性標註

# 3.1 詞性標註
# 詞性(POS)
# PennTreebank

import nltk
from nltk import word_tokenize
s = "I was watching TV"
print(nltk.pos_tag(word_tokenize(s)))


tagged = nltk.pos_tag(word_tokenize(s))
allnoun = [word for word, pos in tagged if pos in ['NN', 'NNP']]
print(allnoun)


# 3.1.1 Stanford標註器
# https://nlp.stanford.edu/software/stanford-postagger-full-2017-06-09.zip
from nltk.tag.stanford import StanfordPOSTagger
import nltk
stan_tagger = StanfordPOSTagger('D:/nltk_data/stanford-postagger-full-2017-06-09/models/english-bidirectional-distsim.tagger',
                                'D:/nltk_data/stanford-postagger-full-2017-06-09/stanford-postagger.jar')
s = "I was watching TV"
tokens = nltk.word_tokenize(s)
stan_tagger.tag(tokens)

# 3.1.2 深入瞭解標註器
from nltk.corpus import brown
import nltk
tags = [tag for (word, tag) in brown.tagged_words(categories = 'news')]
print(nltk.FreqDist(tags))

brown_tagged_sents = brown.tagged_sents(categories = 'news')
default_tagger = nltk.DefaultTagger('NN')
print(default_tagger.evaluate(brown_tagged_sents))


# 3.1.3 順序性標註器
# 1 N-Gram標註器
from nltk.tag import UnigramTagger
from nltk.tag import DefaultTagger
from nltk.tag import BigramTagger
from nltk.tag import TrigramTagger
train_data = brown_tagged_sents[:int(len(brown_tagged_sents) * 0.9)]
test_data = brown_tagged_sents[int(len(brown_tagged_sents) * 0.9):]
unigram_tagger = UnigramTagger(train_data, backoff=default_tagger)
print(unigram_tagger.evaluate(test_data))
biggram_tagger = BigramTagger(train_data, backoff=unigram_tagger)
print(biggram_tagger.evaluate(test_data))
trigram_tagger = TrigramTagger(train_data, backoff=biggram_tagger)
print(trigram_tagger.evaluate(test_data))

# 2 正則表示式標註器
from nltk.tag.sequential import RegexpTagger
regexp_tagger = RegexpTagger(
    [(r'^-?[0-9]+(.[0-9]+)?$', 'CD'),  # cardinal numbers
     (r'(The|the|A|a|An|an)$', 'AT'),  # articles
     (r'.*able$', 'JJ'), # adjectives
     (r'.*ness$', 'NN'), # nouns formed from adj
     (r'.*ly$', 'RB'),   # adverbs
     (r'.*s$', 'NNS'),   # plural nouns
     (r'.*ing$', 'VBG'), # gerunds
     (r'.*ed$', 'VBD'),  # past tense verbs
     (r'.*', 'NN')       # nouns (default)
    ])
print(regexp_tagger.evaluate(test_data))

# 3.1.4 Brill 標註器
# 3.1.5 基於機器學習的標註器
# 最大熵分類器(MEC)
# 隱性馬爾科夫模型(HMM)
# 條件隨機場(CRF)

# 3.2 命名實體識別(NER)
# NER標註器
import nltk
from nltk import ne_chunk
sent = "Mark is studying at Stanford University in California"
print(ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent)), binary = False))

from nltk.tag.stanford import StanfordNERTagger
# https://nlp.stanford.edu/software/stanford-ner-2017-06-09.zip
st = StanfordNERTagger('D:/nltk_data/stanford-ner-2017-06-09/classifiers/english.all.3class.distsim.crf.ser.gz',
                       'D:/nltk_data/stanford-ner-2017-06-09/stanford-ner.jar')
st.tag('Rami Eid is studying at Stony Brook University in NY'.split())

02.04 文字結構解析(NLTKEssentials04.py)

# 《NLTK基礎教程--用NLTK和Python庫構建機器學習應用》04 文字結構解析
# win10 nltk3.2.4 python3.5.3/python3.6.1
# filename:NLTKEssentials04.py # 文字結構解析

# 4.1 淺解析與深解析
# CFG(context-free grammar):上下文無關語法
# PCFG(probabilistic context-free grammar):概率性上下文無關語法
# 淺解析:shallow parsing
# 深解析:deep parsing

# 4.2 兩種解析方法
# 基於規則
# 基於概率

# 4.3 為什麼需要進行解析
# 語法解析器(syntactic parser)
'''
import nltk
from nltk import CFG
toy_grammar = nltk.CFG.fromstring(
"""
S -> NP VP  # S indicate the entire sentence
VP -> V NP  # VP is verb phrase the
V -> "eats" | "drinks" # V is verb
NP -> Det N # NP is noun phrase (chunk that has noun in it)
Det -> "a" | "an" | "the" # Det is determiner used in the sentences
N -> "president" | "Obama" | "apple" | "coke" # N some example nouns
""")
toy_grammar.productions()
'''

# 4.4 不同的解析器型別
# 4.4.1 遞迴下降解析器
# 4.4.2 移位-規約解析器
# 4.4.3 圖表解析器
# 4.4.4 正則表示式解析器
import nltk
from nltk.chunk.regexp import *
chunk_rules = ChunkRule("<.*>+", "chunk everything")
reg_parser = RegexpParser('''
NP: {<DT>? <JJ>* <NN>*} # NP
P: {<IN>}               # Preposition
V: {<V.*>}              # Verb
PP: {<P> <NP>}          # PP -> P NP
VP: {<V> <NP|PP>*}      # VP -> V (NP|PP)*
''')
test_sent = "Mr. Obama played a big role in the Health insurance bill"
test_sent_pos = nltk.pos_tag(nltk.word_tokenize(test_sent))
paresed_out = reg_parser.parse(test_sent_pos)
print(paresed_out)

# 4.5 依存性文字解析(dependency parsing, DP)
# 基於概率的投射依存性解析器(probabilistic, projective dependency parser)
from nltk.parse.stanford import StanfordParser
# https://nlp.stanford.edu/software/stanford-parser-full-2017-06-09.zip
english_parser = StanfordParser('D:/nltk_data/stanford-parser-full-2017-06-09/stanford-parser.jar',
                                'D:/nltk_data/stanford-parser-full-2017-06-09/stanford-parser-3.8.0-models.jar')
english_parser.raw_parse_sents(("this is the english parser test"))

# 4.6 語塊解析
'''
from nltk.chunk.regexp import *
test_sent = "The prime minister announced he had asked the chief government whip, \
Philip Ruddock, to call a special party room meeting for 9am on Monday to consider the spill motion."
test_sent_pos = nltk.pos_tag(nltk.word_tokenize(test_sent))
rule_vp = ChunkRule(r'(<VB.*>)?(<VB.*>)+(<PRP>)?', 'Chunk VPs')
parser_vp = RegexpChunkParser([rule_vp], chunk_label = 'VP')
print(parser_vp.parse(test_sent_pos))
rule_np = ChunkRule(r'(<DT>?<RB>?)?<JJ|CD>*(<JJ|CD><,>)*(<NN.*>)+', 'Chunk NPs')
parser_np = RegexpChunkParser([rule_np], chunk_label="NP")
print(parser_np.parse(test_sent_pos))
'''

# 4.7 資訊提取
# 4.7.1 命名實體識別(NER)
f = open("D:/nltk_data/ner_sample.txt")#  absolute path for the file of text for which we want NER
text = f.read()
sentences = nltk.sent_tokenize(text)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
for sent in tagged_sentences:
    print(nltk.ne_chunk(sent))

# 4.7.2 關係提取
import re
IN = re.compile(r'.*\bin\b(?!\b.+ing)')
for doc in nltk.corpus.ieer.parsed_docs('NYT_19980315'):
    for rel in nltk.sem.extract_rels('ORG', 'LOC', doc, corpus = 'ieer', pattern = IN):
        print(nltk.sem.rtuple(rel))

02.06 文字分類(NLTKEssentials06.py)

# 《NLTK基礎教程--用NLTK和Python庫構建機器學習應用》06 文字分類
# win10 nltk3.2.4 python3.5.3/python3.6.1
# filename:NLTKEssentials06.py # 文字分類

# 6.2 文字分類
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import csv
def preprocessing(text):
    #text = text.decode("utf8")
    # tokenize into words
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    # remove stopwords
    stop = stopwords.words('english')
    tokens = [token for token in tokens if token not in stop]
    # remove words less than three letters
    tokens = [word for word in tokens if len(word) >= 3]
    # lower capitalization
    tokens = [word.lower() for word in tokens]
    # lemmatize
    lmtzr = WordNetLemmatizer()
    tokens = [lmtzr.lemmatize(word) for word in tokens]
    preprocessed_text = ' '.join(tokens)
    return preprocessed_text

# https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
sms = open('D:/nltk_data/SMSSpamCollection', encoding='utf8') # check the structure of this file!
sms_data = []
sms_labels = []
csv_reader = csv.reader(sms, delimiter = '\t')
for line in csv_reader:
    # adding the sms_id
    sms_labels.append(line[0])
    # adding the cleaned text We are calling preprocessing method
    sms_data.append(preprocessing(line[1]))
sms.close()

# 6.3 取樣操作
import sklearn
import numpy as np
trainset_size = int(round(len(sms_data)*0.70))
# i chose this threshold for 70:30 train and test split.
print('The training set size for this classifier is ' + str(trainset_size) + '\n')
x_train = np.array([''.join(el) for el in sms_data[0: trainset_size]])
y_train = np.array([el for el in sms_labels[0: trainset_size]])
x_test = np.array([''.join(el) for el in sms_data[trainset_size+1:len(sms_data)]])
y_test = np.array([el for el in sms_labels[trainset_size+1:len(sms_labels)]])
         #or el in sms_labels[trainset_size+1:len(sms_labels)]

print(x_train)
print(y_train)

from sklearn.feature_extraction.text import CountVectorizer
sms_exp = []
for line in sms_data:
    sms_exp.append(preprocessing(line))
vectorizer = CountVectorizer(min_df = 1, encoding='utf-8')
X_exp = vectorizer.fit_transform(sms_exp)
print("||".join(vectorizer.get_feature_names()))
print(X_exp.toarray())

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df = 2, ngram_range=(1, 2),
                             stop_words = 'english', strip_accents = 'unicode', norm = 'l2')
X_train = vectorizer.fit_transform(x_train)
X_test = vectorizer.transform(x_test)

# 6.3.1 樸素貝葉斯法
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
clf = MultinomialNB().fit(X_train, y_train)
y_nb_predicted = clf.predict(X_test)
print(y_nb_predicted)
print('\n confusion_matrix \n')
#cm = confusion_matrix(y_test, y_pred)
cm = confusion_matrix(y_test, y_nb_predicted)
print(cm)
print('\n Here is the classification report:')
print(classification_report(y_test, y_nb_predicted))

feature_names = vectorizer.get_feature_names()
coefs = clf.coef_
intercept = clf.intercept_
coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
n = 10
top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])
for (coef_1, fn_1), (coef_2, fn_2) in top:
    print('\t%.4f\t%-15s\t\t%.4f\t%-15s' %(coef_1, fn_1, coef_2, fn_2))

# 6.3.2 決策樹
from sklearn import tree
clf = tree.DecisionTreeClassifier().fit(X_train.toarray(), y_train)
y_tree_predicted = clf.predict(X_test.toarray())
print(y_tree_predicted)
print('\n Here is the classification report:')
print(classification_report(y_test, y_tree_predicted))

# 6.3.3 隨機梯度下降法
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import confusion_matrix
clf = SGDClassifier(alpha = 0.0001, n_iter=50).fit(X_train, y_train)
y_pred = clf.predict(X_test)
print('\n Here is the classification report:')
print(classification_report(y_test, y_pred))
print(' \n confusion_matrix \n')
cm = confusion_matrix(y_test, y_pred)
print(cm)

# 6.3.4 邏輯迴歸
# 6.3.5 支援向量機
from sklearn.svm import LinearSVC
svm_classifier = LinearSVC().fit(X_train, y_train)
y_svm_predicted = svm_classifier.predict(X_test)
print('\n Here is the classification report:')
print(classification_report(y_test, y_svm_predicted))
cm = confusion_matrix(y_test, y_pred)
print(cm)

# 6.4 隨機森林
from sklearn.ensemble import RandomForestClassifier
RF_clf = RandomForestClassifier(n_estimators=10).fit(X_train, y_train)
predicted = RF_clf.predict(X_test)
print('\n Here is the classification report:')
print(classification_report(y_test, predicted))
cm = confusion_matrix(y_test, y_pred)
print(cm)

# 6.5 文字聚類
# K 均值法
from sklearn.cluster import KMeans, MiniBatchKMeans
from collections import defaultdict
true_k = 5
km = KMeans(n_clusters = true_k, init='k-means++', max_iter=100, n_init= 1)
kmini = MiniBatchKMeans(n_clusters=true_k, init='k-means++', n_init=1, init_size=1000, batch_size=1000, verbose=2)
km_model = km.fit(X_train)
kmini_model = kmini.fit(X_train)
print("For K-mean clustering ")
clustering = defaultdict(list)
for idx, label in enumerate(km_model.labels_):
    clustering[label].append(idx)
print("For K-mean Mini batch clustering ")
clustering = defaultdict(list)
for idx, label in enumerate(kmini_model.labels_):
    clustering[label].append(idx)

# 6.6 文字中的主題建模
# https://pypi.python.org/pypi/gensim#downloads
import gensim
from gensim import corpora, models, similarities
from itertools import chain
import nltk
from nltk.corpus import stopwords
from operator import itemgetter
import re
documents = [document for document in sms_data]
stoplist = stopwords.words('english')
texts = [[word for word in document.lower().split() if word not in stoplist] for document in documents]
print(texts)


dictionary = corpora.Dictiona