1. 程式人生 > >Natural Language Processing for Fuzzy String Matching with Python

Natural Language Processing for Fuzzy String Matching with Python

Fuzzy string search can be used in various applications, such as:

  • A spell checker and spelling-error, typos corrector. For example, a user types “Missisaga” into Google, a list of hits is returned along with “Showing results for mississauga”. That is, search query returns results even if the user input contains additional or missing characters, or other types of spelling error.
  • A software can be used to check for duplicate records. For example, if a customer is listed multiple times with different purchases in the database due to different spellings of their name (i.e. Abigail Martin vs. Abigail Martinez) a new address, or a mistakenly-entered phone number.

Speaking of dedupe, it may not as easy as it sounds, in particular if you have hundred thousands of records. Even Expedia does not make it 100% right:

Source: Expedia

This post will explain what fuzzy string matching is together with its use cases and give examples using Python’s Fuzzywuzzy library.

Each hotel has its own nomenclature to name its rooms, the same scenario goes to Online Travel Agency (OTA). For example, one room in the same hotel, Expedia calls “Studio, 1 King Bed with Sofa bed, Corner”, Booking.com may find safe in showing the room simply as a “Corner King Studio”.

There is nothing wrong here, but it could lead to confusion when we want to compare room rate between OTAs, or one OTA wants to make sure another OTA follows the rate parity agreement. In another word, to be able to compare price, we must make sure that we are comparing apples to apples.

One of most consistently frustrating issues for price comparison websites and apps is trying to figure out whether two items (or hotel rooms) are for the same thing, automatically.

FuzzyWuzzy in Python

Fuzzywuzzy is a Python library uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.

In order to demonstrate, I create my own data set, that is, for the same hotel property, I take a room type from Expedia, lets say “Suite, 1 King Bed (Parlor)”, then I match it to a room type in Booking.com which is “King Parlor Suite”. With a little bit experience, most human would know they are the same thing. Follow this methodology, I create a small data set with over 100 room type pairs that can be found on Github.

Using this data set, we are going to test how Fuzzywuzzy thinks. In another words, we are using Fuzzywuzzy to match records between two data sources.

import pandas as pd
df = pd.read_csv('room_type.csv')df.head(10)
Figure 1

The data set was created by myself, so, it is very clean.

There are several ways to compare two strings in Fuzzywuzzy, let’s try them one by one.

  • ratio , compares the entire string similarity, in order.
from fuzzywuzzy import fuzzfuzz.ratio('Deluxe Room, 1 King Bed', 'Deluxe King Room')

62

This is telling us that the “Deluxe Room, 1 King Bed” and “Deluxe King Room” pair are about 62% the same.

fuzz.ratio('Traditional Double Room, 2 Double Beds', 'Double Room with Two Double Beds')

69

The “Traditional Double Room, 2 Double Beds” and “Double Room with Two Double Beds” pair are about 69% the same.

fuzz.ratio('Room, 2 Double Beds (19th to 25th Floors)', 'Two Double Beds - Location Room (19th to 25th Floors)')

74

The “Room, 2 Double Beds (19th to 25th Floors)” and “Two Double Beds — Location Room (19th to 25th Floors)” pair are about 74% the same.

I am disappointed with these. It turns out, the naive approach is far too sensitive to minor differences in word order, missing or extra words, and other such issues.

  • partial_ratio , compares partial string similarity.

We are still using the same data pairs.

fuzz.partial_ratio('Deluxe Room, 1 King Bed', 'Deluxe King Room')

69

fuzz.partial_ratio('Traditional Double Room, 2 Double Beds', 'Double Room with Two Double Beds')

83

fuzz.partial_ratio('Room, 2 Double Beds (19th to 25th Floors)', 'Two Double Beds - Location Room (19th to 25th Floors)')

63

For my data set, comparing partial string does not bring better results overall. Let’s continue.

  • token_sort_ratio , ignores word order.
fuzz.token_sort_ratio('Deluxe Room, 1 King Bed', 'Deluxe King Room')

84

fuzz.token_sort_ratio('Traditional Double Room, 2 Double Beds', 'Double Room with Two Double Beds')

78

fuzz.token_sort_ratio('Room, 2 Double Beds (19th to 25th Floors)', 'Two Double Beds - Location Room (19th to 25th Floors)')

83

Best so far.

  • token_set_ratio , ignores duplicated words. It is similar with token sort ratio, but a little bit more flexible.
fuzz.token_set_ratio('Deluxe Room, 1 King Bed', 'Deluxe King Room')

100

fuzz.token_set_ratio('Traditional Double Room, 2 Double Beds', 'Double Room with Two Double Beds')

78

fuzz.token_set_ratio('Room, 2 Double Beds (19th to 25th Floors)', 'Two Double Beds - Location Room (19th to 25th Floors)')

97

Looks like token_set_ratio is the best fit for my data. According to this discovery, I decided to apply token_set_ratio to my entire data set.

def get_ratio(row):    name = row['Expedia']    name1 = row['Booking.com']    return fuzz.token_set_ratio(name, name1)len(df[df.apply(get_ratio, axis=1) > 70]) / len(df)

0.9029126213592233

When setting ratio > 70, over 90% of the pairs exceed a match score of 70. Not so shabby!