1. 程式人生 > >Text Preprocessing in Python: Steps, Tools, and Examples

Text Preprocessing in Python: Steps, Tools, and Examples

Example 8. Stemming using NLTK:

Code:

from nltk.stem import PorterStemmerfrom nltk.tokenize import word_tokenizestemmer= PorterStemmer()input_str=”There are several types of stemming algorithms.”input_str=word_tokenize(input_str)for word in input_str:    print(stemmer.stem(word))

Output:

There are sever type of stem algorithm.

Lemmatization

The aim of lemmatization, like stemming, is to reduce inflectional forms to a common base form. As opposed to stemming, lemmatization does not simply chop off inflections. Instead it uses lexical knowledge bases to get the correct base forms of words.

Example 9. Lemmatization using NLTK:

Code:

from nltk.stem import WordNetLemmatizerfrom nltk.tokenize import word_tokenizelemmatizer=WordNetLemmatizer()input_str=”been had done languages cities mice”input_str=word_tokenize(input_str)for word in input_str:    print(lemmatizer.lemmatize(word))

Output:

be have do language city mouse

Part of speech tagging (POS)

Example 10. Part-of-speech tagging using TextBlob:

Code:

input_str=”Parts of speech examples: an article, to write, interesting, easily, and, of”from textblob import TextBlobresult = TextBlob(input_str)print(result.tags)

Output:

[(‘Parts’, u’NNS’), (‘of’, u’IN’), (‘speech’, u’NN’), (‘examples’, u’NNS’), (‘an’, u’DT’), (‘article’, u’NN’), (‘to’, u’TO’), (‘write’, u’VB’), (‘interesting’, u’VBG’), (‘easily’, u’RB’), (‘and’, u’CC’), (‘of’, u’IN’)]

Chunking (shallow parsing)

Chunking is a natural language process that identifies constituent parts of sentences (nouns, verbs, adjectives, etc.) and links them to higher order units that have discrete grammatical meanings (noun groups or phrases, verb groups, etc.) [23]. Chunking tools: NLTK, TreeTagger chunker, Apache OpenNLP, General Architecture for Text Engineering (GATE), FreeLing.

Example 11. Chunking using NLTK:

The first step is to determine the part of speech for each word:

Code:

input_str=”A black television and a white stove were bought for the new apartment of John.”from textblob import TextBlobresult = TextBlob(input_str)print(result.tags)

Output:

[(‘A’, u’DT’), (‘black’, u’JJ’), (‘television’, u’NN’), (‘and’, u’CC’), (‘a’, u’DT’), (‘white’, u’JJ’), (‘stove’, u’NN’), (‘were’, u’VBD’), (‘bought’, u’VBN’), (‘for’, u’IN’), (‘the’, u’DT’), (‘new’, u’JJ’), (‘apartment’, u’NN’), (‘of’, u’IN’), (‘John’, u’NNP’)]

The second step is chunking:

Code:

reg_exp = “NP: {<DT>?<JJ>*<NN>}”rp = nltk.RegexpParser(reg_exp)result = rp.parse(result.tags)print(result)

Output:

(S (NP A/DT black/JJ television/NN) and/CC (NP a/DT white/JJ stove/NN) were/VBD bought/VBN for/IN (NP the/DT new/JJ apartment/NN)of/IN John/NNP)

It’s also possible to draw the sentence tree structure using code result.draw()

Named entity recognition

Named-entity recognition (NER) aims to find named entities in text and classify them into pre-defined categories (names of persons, locations, organizations, times, etc.).