1. 程式人生 > >An NLP Approach to Mining Online Reviews using Topic Modeling (with Python codes)

An NLP Approach to Mining Online Reviews using Topic Modeling (with Python codes)

An NLP Approach to Mining Online Reviews using Topic Modeling (with Python codes)

E-commerce has revolutionized the way we shop. That phone you’ve been saving up to buy for months? It’s just a search and a few clicks away. Items are delivered within a matter of days (sometimes even the next day!).

For online retailers, there are no constraints related to inventory management or space management They can sell as many different products as they want. Brick and mortar stores can keep only a limited number of products due to the finite space they have available.

I remember when I used to place orders for books at my local bookstore, and it used to take over a week for the book to arrive. It seems like a story from the ancient times now!

But online shopping comes with its own caveats. One of the biggest challenges is verifying the authenticity of a product. Is it as good as advertised on the e-commerce site? Will the product last more than a year? Are the reviews given by other customers really true or are they false advertising? These are important questions customers need to ask before splurging their money.

This is a great place to experiment and apply Natural Language Processing (NLP) techniques. This article will help you understand the significance of harnessing online product reviews with the help of Topic Modeling.

Please go through the below articles in case you need a quick refresher on Topic Modeling:

Table of Contents

  1. Importance of Online Reviews
  2. Problem Statement
  3. Why Topic Modeling for this task?
  4. Python Implementation
  5. Other methods to leverage online reviews
  6. What’s Next?

Importance of Online Reviews

A few days back, I took the e-commerce plunge and purchased a smartphone online. It was well within my budget, and it had an above decent rating of 4.5 out of 5.

Unfortunately, it turned out to be a bad decision as the battery backup was well below par. I didn’t go through the reviews of the product and made a hasty decision to buy it based on its ratings alone. And I know I’m not the only one out there who made this mistake!

Ratings alone do not give a complete picture of the products we wish to purchase, as I found to my detriment. So, as a precautionary measure, I always advise people to read a product’s reviews before deciding whether to buy it or not.

But then an interesting problem comes up. What if the number of reviews is in the hundreds or thousands? It’s just not feasible to go through all those reviews, right? And this is where natural language processing comes up trumps.

Setting the Problem Statement

A problem statement is the seed from which your analysis blooms. Therefore, it is really important to have a solid, clear and well-defined problem statement.

How we can analyze a large number of online reviews using Natural Language Processing (NLP)? Let’s define this problem.

Online product reviews are a great source of information for consumers. From the sellers’ point of view, online reviews can be used to gauge the consumers feedback on the products or services they are selling. However, since these online reviews are quite often overwhelming in terms of numbers and information, an intelligent system, capable of finding key insights (topics) from these reviews, will be of great help for both the consumers and the sellers. This system will serve two purposes:

  1. Enable consumers to quickly extract the key topics covered by the reviews without having to go through all of them
  2. Help the sellers/retailers get consumer feedback in the form of topics (extracted from the consumer reviews)

To solve this task, we will use the concept of Topic Modeling (LDA) on Amazon Automotive Review data. You can download it from this link. Similar datasets for other categories of products can be found here.

Why Should you use Topic Modeling for this task?

As the name suggests, Topic Modeling is a process to automatically identify topics present in a text object and to derive hidden patterns exhibited by a text corpus. Topic Models are very useful for multiple purposes, including:

  • Document clustering
  • Organizing large blocks of textual data
  • Information retrieval from unstructured text
  • Feature selection

A good topic model, when trained on some text about the stock market, should result in topics like “bid”, “trading”, “dividend”, “exchange”, etc. The below image illustrates how a typical topic model works:

In our case, instead of text documents, we have thousands of online product reviews for the items listed under the ‘Automotive’ category. Our aim here is to extract a certain number of groups of important words from the reviews. These groups of words are basically the topics which would help in ascertaining what the consumers are actually talking about in the reviews.

Python Implementation

In this section, we’ll power up our Jupyter notebooks (or any other IDE you use for Python!). Here we’ll work on the problem statement defined above to extract useful topics from our online reviews dataset using the concept of Latent Dirichlet Allocation (LDA).

Note: As I mentioned in the introduction, I highly recommend going through this article to understand what LDA is and how it works.

Let’s first load all the necessary libraries:

import nltk from nltk import FreqDist nltk.download('stopwords') # run this one time
import pandas as pd pd.set_option("display.max_colwidth", 200) import numpy as np import re import spacy import gensim from gensim import corpora 
# libraries for visualization import pyLDAvis import pyLDAvis.gensim import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline

To import the data, first extract the data to your working directory and then use the read_json( ) function of pandas to read it into a pandas dataframe.

df = pd.read_json('Automotive_5.json', lines=True) df.head()