Text Preprocessing in Python: Steps, Tools, and Examples

阿新 • • 發佈：2018-12-28

Example 8. Stemming using NLTK:

Code:

from nltk.stem import PorterStemmerfrom nltk.tokenize import word_tokenizestemmer= PorterStemmer()input_str=”There are several types of stemming algorithms.”input_str=word_tokenize(input_str)for word in input_str:    print(stemmer.stem(word))

Output:

There are sever type of stem algorithm.

Lemmatization

The aim of lemmatization, like stemming, is to reduce inflectional forms to a common base form. As opposed to stemming, lemmatization does not simply chop off inflections. Instead it uses lexical knowledge bases to get the correct base forms of words.

Example 9. Lemmatization using NLTK:

Code:

from nltk.stem import WordNetLemmatizerfrom nltk.tokenize import word_tokenizelemmatizer=WordNetLemmatizer()input_str=”been had done languages cities mice”input_str=word_tokenize(input_str)for word in input_str:    print(lemmatizer.lemmatize(word))

Output:

be have do language city mouse

Part of speech tagging (POS)

Example 10. Part-of-speech tagging using TextBlob:

Code:

input_str=”Parts of speech examples: an article, to write, interesting, easily, and, of”from textblob import TextBlobresult = TextBlob(input_str)print(result.tags)

Output:

[(‘Parts’, u’NNS’), (‘of’, u’IN’), (‘speech’, u’NN’), (‘examples’, u’NNS’), (‘an’, u’DT’), (‘article’, u’NN’), (‘to’, u’TO’), (‘write’, u’VB’), (‘interesting’, u’VBG’), (‘easily’, u’RB’), (‘and’, u’CC’), (‘of’, u’IN’)]

Chunking (shallow parsing)

Chunking is a natural language process that identifies constituent parts of sentences (nouns, verbs, adjectives, etc.) and links them to higher order units that have discrete grammatical meanings (noun groups or phrases, verb groups, etc.) [23]. Chunking tools: NLTK, TreeTagger chunker, Apache OpenNLP, General Architecture for Text Engineering (GATE), FreeLing.

Example 11. Chunking using NLTK:

The first step is to determine the part of speech for each word:

Code:

input_str=”A black television and a white stove were bought for the new apartment of John.”from textblob import TextBlobresult = TextBlob(input_str)print(result.tags)

Output:

[(‘A’, u’DT’), (‘black’, u’JJ’), (‘television’, u’NN’), (‘and’, u’CC’), (‘a’, u’DT’), (‘white’, u’JJ’), (‘stove’, u’NN’), (‘were’, u’VBD’), (‘bought’, u’VBN’), (‘for’, u’IN’), (‘the’, u’DT’), (‘new’, u’JJ’), (‘apartment’, u’NN’), (‘of’, u’IN’), (‘John’, u’NNP’)]

The second step is chunking:

Code:

reg_exp = “NP: {<DT>?<JJ>*<NN>}”rp = nltk.RegexpParser(reg_exp)result = rp.parse(result.tags)print(result)

Output:

(S (NP A/DT black/JJ television/NN) and/CC (NP a/DT white/JJ stove/NN) were/VBD bought/VBN for/IN (NP the/DT new/JJ apartment/NN)of/IN John/NNP)

It’s also possible to draw the sentence tree structure using code result.draw()

Named entity recognition

Named-entity recognition (NER) aims to find named entities in text and classify them into pre-defined categories (names of persons, locations, organizations, times, etc.).

Text Preprocessing in Python: Steps, Tools, and Examples

Example 8. Stemming using NLTK:Code:from nltk.stem import PorterStemmerfrom nltk.tokenize import word_tokenizestemmer= PorterStemmer()input_str=”There are

Issue 147: 10 common beginners mistakes, Pytest, GC in Python, DSL, Jupyter and more

Worthy Read

[Python] How to unpack and pack collection in Python?

ide ont add off art video lec ref show It is a pity that i can not add the video here. As a result, i offer the link as below: How to

[D3] SVG Graphics Containers and Text Elements in D3 v4

ont name canvas scores guess format selectall char style SVG is a great output format for data visualizations because of its scalability,

Mutable and Immutable Variables in Python

pytho src 賦值參數解決再看變量復制 left 本文解決python中比較令人困惑的一個小問題：傳遞到函數中的參數若在函數中進行了重新賦值，對於函數外的原變量有何影響。看一個小栗子： def fun(a): a=2 return a=1

DIY 3D SCANNER BASED ON STRUCTURED LIGHT AND STEREO VISION IN PYTHON LANGUAGE

可惜拿不到原始碼，有獲取到原始碼的朋友可以共享下（學術用），謝謝！以下是轉載內容： Published Mar 19th, 2015DownloadFavorite Intro: DIY 3D Scanner Based on Structured Light

Python文摘：More About Unicode in Python 2 and 3

原文地址：http://lucumr.pocoo.org/2014/1/5/unicode-in-2-and-3/ It's becoming increasingly harder to have reasonable discussions about the differences b

資料分析文摘：Reading and Writing JSON to a File in Python

原文地址：https://stackabuse.com/reading-and-writing-json-to-a-file-in-python/ Over the last 5-10 years, the JSON format has been one of, if

How to use *args and **kwargs in Python

這篇文章寫的滿好的耶，結論： 1星= array, 2星=dictionary. 1星範例： def test_var_args(farg, *args): print "formal arg:", farg for arg in args: print "an

Top 4 Steps for Data Preprocessing in Machine Learning

Data Processing in the machine learning is a data mining technique. In this process, the raw data gathered and you analyze the data to find a way to transf

Cleaning and Preparing Data in Python

Cleaning and Preparing Data in PythonThat boring part of every data scientist’s workData Science sounds like something cool and awesome. It’s pictured as s

Blockchain: how mining works and transactions are processed in seven steps

This is how miners need to find an eligible signature, and it is also the reason that so much computational power is needed to solve this mathematical prob

Extracting and Transforming Data in Python

Extracting and Transforming Data in PythonIn order to get insights from data you have to play with it a little..It is important to be able to extract, filt

Customer Profiling and Segmentation in Python An Overview & Demo

While most marketing managers understand that all customers have different preferences, these differences still tend to raise quite a challenge when it com

lambda, map and filter in Python

lambda, map and filter in Pythonlambdalambda operator or lambda function is used for creating small, one-time and anonymous function objects in Python.Basi

Simple and Multiple Linear Regression in Python

Linear Regression in PythonThere are two main ways to perform linear regression in Python — with Statsmodels and scikit-learn. It is also possible to use t

Simplifying Sentiment Analysis using VADER in Python (on Social Media Text)

What is Sentiment Analysis?Sentiment Analysis, or Opinion Mining, is a sub-field of Natural Language Processing (NLP) that tries to identify and extract op

Reading and Writing to CSVs in Python

Reading and Writing to CSVs in PythonPlaying with tabular data the native Python way.Tables. Cells. Two-dimensional data. We here at Hackers & Slackers

Lists and Tuples in Python

Lists and tuples are arguably Python’s most versatile, useful data types. You will find them in virtually every nontrivial Python program. Here’s what y

Reading and Writing CSV Files in Python β Real Python

Let’s face it: you need to get information into and out of your programs through more than just the keyboard and console. Exchanging information through

Text Preprocessing in Python: Steps, Tools, and Examples

Lemmatization

Part of speech tagging (POS)

Chunking (shallow parsing)

Named entity recognition

相關推薦