1. 程式人生 > >How to Clean Text for Machine Learning with Python

How to Clean Text for Machine Learning with Python

You cannot go straight from raw text to fitting a machine learning or deep learning model.

You must clean your text first, which means splitting it into words and handling punctuation and case.

In fact, there is a whole suite of text preparation methods that you may need to use, and the choice of methods really depends on your natural language processing task.

In this tutorial, you will discover how you can clean and prepare your text ready for modeling with machine learning.

After completing this tutorial, you will know:

  • How to get started by developing your own very simple text cleaning tools.
  • How to take a step up and use the more sophisticated methods in the NLTK library.
  • How to prepare text when using modern text representation methods like word embeddings.

Let’s get started.

  • Update Nov/2017: Fixed a code typo in the ‘split into words’ section, thanks David Comfort.
How to Develop Multilayer Perceptron Models for Time Series Forecasting

How to Develop Multilayer Perceptron Models for Time Series Forecasting
Photo by Bureau of Land Management, some rights reserved.

Tutorial Overview

This tutorial is divided into 6 parts; they are:

  1. Metamorphosis by Franz Kafka
  2. Text Cleaning is Task Specific
  3. Manual Tokenization
  4. Tokenization and Cleaning with NLTK
  5. Additional Text Cleaning Considerations
  6. Tips for Cleaning Text for Word Embedding

Need help with Deep Learning for Text Data?

Take my free 7-day email crash course now (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

Metamorphosis by Franz Kafka

Let’s start off by selecting a dataset.

In this tutorial, we will use the text from the book Metamorphosis by Franz Kafka. No specific reason, other than it’s short, I like it, and you may like it too. I expect it’s one of those classics that most students have to read in school.

The full text for Metamorphosis is available for free from Project Gutenberg.

You can download the ASCII text version of the text here:

Download the file and place it in your current working directory with the file name “metamorphosis.txt“.

The file contains header and footer information that we are not interested in, specifically copyright and license information. Open the file and delete the header and footer information and save the file as “metamorphosis_clean.txt“.

The start of the clean file should look like:

One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin.

The file should end with:

And, as if in confirmation of their new dreams and good intentions, as soon as they reached their destination Grete was the first to get up and stretch out her young body.

Poor Gregor…

Text Cleaning Is Task Specific

After actually getting a hold of your text data, the first step in cleaning up text data is to have a strong idea about what you’re trying to achieve, and in that context review your text to see what exactly might help.

Take a moment to look at the text. What do you notice?

Here’s what I see:

  • It’s plain text so there is no markup to parse (yay!).
  • The translation of the original German uses UK English (e.g. “travelling“).
  • The lines are artificially wrapped with new lines at about 70 characters (meh).
  • There are no obvious typos or spelling mistakes.
  • There’s punctuation like commas, apostrophes, quotes, question marks, and more.
  • There’s hyphenated descriptions like “armour-like”.
  • There’s a lot of use of the em dash (“-“) to continue sentences (maybe replace with commas?).
  • There are names (e.g. “Mr. Samsa“)
  • There does not appear to be numbers that require handling (e.g. 1999)
  • There are section markers (e.g. “II” and “III”), and we have removed the first “I”.

I’m sure there is a lot more going on to the trained eye.

We are going to look at general text cleaning steps in this tutorial.

Nevertheless, consider some possible objectives we may have when working with this text document.

For example:

  • If we were interested in developing a Kafkaesque language model, we may want to keep all of the case, quotes, and other punctuation in place.
  • If we were interested in classifying documents as “Kafka” and “Not Kafka,” maybe we would want to strip case, punctuation, and even trim words back to their stem.

Use your task as the lens by which to choose how to ready your text data.

Manual Tokenization

Text cleaning is hard, but the text we have chosen to work with is pretty clean already.

We could just write some Python code to clean it up manually, and this is a good exercise for those simple problems that you encounter. Tools like regular expressions and splitting strings can get you a long way.

1. Load Data

Let’s load the text data so that we can work with it.

The text is small and will load quickly and easily fit into memory. This will not always be the case and you may need to write code to memory map the file. Tools like NLTK (covered in the next section) will make working with large files much easier.

We can load the entire “metamorphosis_clean.txt” into memory as follows:

12345 # load textfilename='metamorphosis_clean.txt'file=open(filename,'rt')text=file.read()file.close()

Running the example loads the whole file into memory ready to work with.

2. Split by Whitespace

Clean text often means a list of words or tokens that we can work with in our machine learning models.

This means converting the raw text into a list of words and saving it again.

A very simple way to do this would be to split the document by white space, including ” “, new lines, tabs and more. We can do this in Python with the split() function on the loaded string.

12345678 # load textfilename='metamorphosis_clean.txt'file=open(filename,'rt')text=file.read()file.close()# split into words by white spacewords=text.split()print(words[:100])

Running the example splits the document into a long list of words and prints the first 100 for us to review.

We can see that punctuation is preserved (e.g. “wasn’t” and “armour-like“), which is nice. We can also see that end of sentence punctuation is kept with the last word (e.g. “thought.”), which is not great.

1 ['One', 'morning,', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams,', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin.', 'He', 'lay', 'on', 'his', 'armour-like', 'back,', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly,', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections.', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment.', 'His', 'many', 'legs,', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him,', 'waved', 'about', 'helplessly', 'as', 'he', 'looked.', '"What\'s', 'happened', 'to', 'me?"', 'he', 'thought.', 'It', "wasn't", 'a', 'dream.', 'His', 'room,', 'a', 'proper', 'human']

3. Select Words

Another approach might be to use the regex model (re) and split the document into words by selecting for strings of alphanumeric characters (a-z, A-Z, 0-9 and ‘_’).

For example:

123456789 # load textfilename='metamorphosis_clean.txt'file=open(filename,'rt')text=file.read()file.close()# split based on words onlyimport rewords=re.split(r'\W+',text)print(words[:100])

Again, running the example we can see that we get our list of words. This time, we can see that “armour-like” is now two words “armour” and “like” (fine) but contractions like “What’s” is also two words “What” and “s” (not great).

1 ['One', 'morning', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', 'He', 'lay', 'on', 'his', 'armour', 'like', 'back', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', 'His', 'many', 'legs', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', 'waved', 'about', 'helplessly', 'as', 'he', 'looked', 'What', 's', 'happened', 'to', 'me', 'he', 'thought', 'It', 'wasn', 't', 'a', 'dream', 'His', 'room']

3. Split by Whitespace and Remove Punctuation

Note: This example was written for Python 3.

We may want the words, but without the punctuation like commas and quotes. We also want to keep contractions together.

One way would be to split the document into words by white space (as in “2. Split by Whitespace“), then use string translation to replace all punctuation with nothing (e.g. remove it).

Python provides a constant called string.punctuation that provides a great list of punctuation characters. For example:

1 print(string.punctuation)

Results in:

1 !"#$%&'()*+,-./:;<=>[email protected][\]^_`{|}~

Python offers a function called translate() that will map one set of characters to another.

We can use the function maketrans() to create a mapping table. We can create an empty mapping table, but the third argument of this function allows us to list all of the characters to remove during the translation process. For example:

1 table=str.maketrans('','',string.punctuation)

We can put all of this together, load the text file, split it into words by white space, then translate each word to remove the punctuation.

123456789101112 # load textfilename='metamorphosis_clean.txt'file=open(filename,'rt')text=file.read()file.close()# split into words by white spacewords=text.split()# remove punctuation from each wordimport stringtable=str.maketrans('','',string.punctuation)stripped=[w.translate(table)forwinwords]print(stripped[:100])

We can see that this has had the desired effect, mostly.

Contractions like “What’s” have become “Whats” but “armour-like” has become “armourlike“.

1 ['One', 'morning', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', 'He', 'lay', 'on', 'his', 'armourlike', 'back', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', 'His', 'many', 'legs', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', 'waved', 'about', 'helplessly', 'as', 'he', 'looked', 'Whats', 'happened', 'to', 'me', 'he', 'thought', 'It', 'wasnt', 'a', 'dream', 'His', 'room', 'a', 'proper', 'human']

If you know anything about regex, then you know things can get complex from here.

4. Normalizing Case

It is common to convert all words to one case.

This means that the vocabulary will shrink in size, but some distinctions are lost (e.g. “Apple” the company vs “apple” the fruit is a commonly used example).

We can convert all words to lowercase by calling the lower() function on each word.

For example:

123456789 filename='metamorphosis_clean.txt'file=open(filename,'rt')text=file.read()file.close()# split into words by white spacewords=text.split()# convert to lower casewords=[word.lower()forwordinwords]print(words[:100])

Running the example, we can see that all words are now lowercase.

1 ['one', 'morning,', 'when', 'gregor', 'samsa', 'woke', 'from', 'troubled', 'dreams,', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin.', 'he', 'lay', 'on', 'his', 'armour-like', 'back,', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly,', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections.', 'the', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment.', 'his', 'many', 'legs,', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him,', 'waved', 'about', 'helplessly', 'as', 'he', 'looked.', '"what\'s', 'happened', 'to', 'me?"', 'he', 'thought.', 'it', "wasn't", 'a', 'dream.', 'his', 'room,', 'a', 'proper', 'human']

Note

Cleaning text is really hard, problem specific, and full of tradeoffs.

Remember, simple is better.

Simpler text data, simpler models, smaller vocabularies. You can always make things more complex later to see if it results in better model skill.

Next, we’ll look at some of the tools in the NLTK library that offer more than simple string splitting.

Tokenization and Cleaning with NLTK

The Natural Language Toolkit, or NLTK for short, is a Python library written for working and modeling text.

It provides good tools for loading and cleaning text that we can use to get our data ready for working with machine learning and deep learning algorithms.

1. Install NLTK

You can install NLTK using your favorite package manager, such as pip:

1 sudo pip install -U nltk

After installation, you will need to install the data used with the library, including a great set of documents that you can use later for testing other tools in NLTK.

There are few ways to do this, such as from within a script:

12 import nltknltk.download()

Or from the command line:

1 python -m nltk.downloader all

For more help installing and setting up NLTK, see:

2. Split into Sentences

A good useful first step is to split the text into sentences.

Some modeling tasks prefer input to be in the form of paragraphs or sentences, such as word2vec. You could first split your text into sentences, split each sentence into words, then save each sentence to file, one per line.

NLTK provides the sent_tokenize() function to split text into sentences.

The example below loads the “metamorphosis_clean.txt” file into memory, splits it into sentences, and prints the first sentence.

123456789 # load datafilename='metamorphosis_clean.txt'file=open(filename,'rt')text=file.read()file.close()# split into sentencesfrom nltk import sent_tokenizesentences=sent_tokenize(text)print(sentences[0])

Running the example, we can see that although the document is split into sentences, that each sentence still preserves the new line from the artificial wrap of the lines in the original document.

One morning, when Gregor Samsa woke from troubled dreams, he found
himself transformed in his bed into a horrible vermin.

3. Split into Words

NLTK provides a function called word_tokenize() for splitting strings into tokens (nominally words).

It splits tokens based on white space and punctuation. For example, commas and periods are taken as separate tokens. Contractions are split apart (e.g. “What’s” becomes “What” “‘s“). Quotes are kept, and so on.

For example:

123456789 # load datafilename='metamorphosis_clean.txt'file=open(filename,'rt')text=file.read()file.close()# split into wordsfrom nltk.tokenize import word_tokenizetokens=word_tokenize(text)print(tokens[:100])

Running the code, we can see that punctuation are now tokens that we could then decide to specifically filter out.