Natural Language Processing for Fuzzy String Matching with Python

阿新 • • 發佈：2018-12-28

Fuzzy string search can be used in various applications, such as:

A spell checker and spelling-error, typos corrector. For example, a user types “Missisaga” into Google, a list of hits is returned along with “Showing results for mississauga”. That is, search query returns results even if the user input contains additional or missing characters, or other types of spelling error.

A software can be used to check for duplicate records. For example, if a customer is listed multiple times with different purchases in the database due to different spellings of their name (i.e. Abigail Martin vs. Abigail Martinez) a new address, or a mistakenly-entered phone number.

Speaking of dedupe, it may not as easy as it sounds, in particular if you have hundred thousands of records. Even Expedia does not make it 100% right:

This post will explain what fuzzy string matching is together with its use cases and give examples using Python’s Fuzzywuzzy library.

Each hotel has its own nomenclature to name its rooms, the same scenario goes to Online Travel Agency (OTA). For example, one room in the same hotel, Expedia calls “Studio, 1 King Bed with Sofa bed, Corner”, Booking.com may find safe in showing the room simply as a “Corner King Studio”.

There is nothing wrong here, but it could lead to confusion when we want to compare room rate between OTAs, or one OTA wants to make sure another OTA follows the rate parity agreement. In another word, to be able to compare price, we must make sure that we are comparing apples to apples.

One of most consistently frustrating issues for price comparison websites and apps is trying to figure out whether two items (or hotel rooms) are for the same thing, automatically.

FuzzyWuzzy in Python

Fuzzywuzzy is a Python library uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.

In order to demonstrate, I create my own data set, that is, for the same hotel property, I take a room type from Expedia, lets say “Suite, 1 King Bed (Parlor)”, then I match it to a room type in Booking.com which is “King Parlor Suite”. With a little bit experience, most human would know they are the same thing. Follow this methodology, I create a small data set with over 100 room type pairs that can be found on Github.

Using this data set, we are going to test how Fuzzywuzzy thinks. In another words, we are using Fuzzywuzzy to match records between two data sources.

import pandas as pd

df = pd.read_csv('room_type.csv')df.head(10)

The data set was created by myself, so, it is very clean.

There are several ways to compare two strings in Fuzzywuzzy, let’s try them one by one.

ratio , compares the entire string similarity, in order.

from fuzzywuzzy import fuzzfuzz.ratio('Deluxe Room, 1 King Bed', 'Deluxe King Room')

This is telling us that the “Deluxe Room, 1 King Bed” and “Deluxe King Room” pair are about 62% the same.

fuzz.ratio('Traditional Double Room, 2 Double Beds', 'Double Room with Two Double Beds')

The “Traditional Double Room, 2 Double Beds” and “Double Room with Two Double Beds” pair are about 69% the same.

fuzz.ratio('Room, 2 Double Beds (19th to 25th Floors)', 'Two Double Beds - Location Room (19th to 25th Floors)')

The “Room, 2 Double Beds (19th to 25th Floors)” and “Two Double Beds — Location Room (19th to 25th Floors)” pair are about 74% the same.

I am disappointed with these. It turns out, the naive approach is far too sensitive to minor differences in word order, missing or extra words, and other such issues.

partial_ratio , compares partial string similarity.

We are still using the same data pairs.

fuzz.partial_ratio('Deluxe Room, 1 King Bed', 'Deluxe King Room')

fuzz.partial_ratio('Traditional Double Room, 2 Double Beds', 'Double Room with Two Double Beds')

fuzz.partial_ratio('Room, 2 Double Beds (19th to 25th Floors)', 'Two Double Beds - Location Room (19th to 25th Floors)')

For my data set, comparing partial string does not bring better results overall. Let’s continue.

token_sort_ratio , ignores word order.

fuzz.token_sort_ratio('Deluxe Room, 1 King Bed', 'Deluxe King Room')

fuzz.token_sort_ratio('Traditional Double Room, 2 Double Beds', 'Double Room with Two Double Beds')

fuzz.token_sort_ratio('Room, 2 Double Beds (19th to 25th Floors)', 'Two Double Beds - Location Room (19th to 25th Floors)')

Best so far.

token_set_ratio , ignores duplicated words. It is similar with token sort ratio, but a little bit more flexible.

fuzz.token_set_ratio('Deluxe Room, 1 King Bed', 'Deluxe King Room')

100

fuzz.token_set_ratio('Traditional Double Room, 2 Double Beds', 'Double Room with Two Double Beds')

fuzz.token_set_ratio('Room, 2 Double Beds (19th to 25th Floors)', 'Two Double Beds - Location Room (19th to 25th Floors)')

Looks like token_set_ratio is the best fit for my data. According to this discovery, I decided to apply token_set_ratio to my entire data set.

def get_ratio(row):    name = row['Expedia']    name1 = row['Booking.com']    return fuzz.token_set_ratio(name, name1)len(df[df.apply(get_ratio, axis=1) > 70]) / len(df)

0.9029126213592233

When setting ratio > 70, over 90% of the pairs exceed a match score of 70. Not so shabby!

Natural Language Processing for Fuzzy String Matching with Python

FuzzyWuzzy in Python

Natural Language Processing for Fuzzy String Matching with Python

Biopharma Navigator: Natural Language Processing for Life Sciences

How to Get Started with Deep Learning for Natural Language Processing (7

論文閱讀：A Primer on Neural Network Models for Natural Language Processing（1）

CS224n: Natural Language Processing with Deep Learning 學習筆記

natural language processing blog: Many opportunities for discrimination in deploying machine learning systems

Deep Learning for Natural Language Processing Archives

Why is Natural Language Processing relevant for the insurance industry

Review of Stanford Course on Deep Learning for Natural Language Processing

Coursera, Deep Learning 5, Sequence Models, week2, Natural Language Processing & Word Embeddings

語言模型和RNN CS244n 大作業 Natural Language Processing

Recent Trends in Deep Learning Based Natural Language Processing(arXiv)筆記

Hands-Natural-language-processing-python 1: NLTK

Investing in AI: When natural language processing pays off

See this simple introduction to Natural Language Processing (NLP)

natural language processing blog: finite state methods

natural language processing blog: information retrieval

natural language processing blog: Yet another list of things we can do to have more diverse sets of invited speakers

natural language processing blog: structured prediction

natural language processing blog: machine translation

Natural Language Processing for Fuzzy String Matching with Python

FuzzyWuzzy in Python

相關推薦