1. 程式人生 > >Project Spotlight: Event Recommendation in Python with Artem Yankov

Project Spotlight: Event Recommendation in Python with Artem Yankov

This is a project spotlight with Artem Yankov.

Could you please introduce yourself?

My name is Artem Yankov, I have worked as a software engineer for Badgeville for the last 3 years. I’m using there Ruby and Scala although my prior background includes use of various languages such as: Assembly, C/C++, Python, Clojure and JS.

I love hacking on small projects and exploring different fields, for instance two almost random fields I’ve looked at were Robotics and Malware Analysis. I can’t say I became an expert, but I did have a lot of fun. A small robot I built looked amazingly ugly but it could mirror my arms motion by “seeing” it through

MS Kinect.

I didn’t do any machine learning at all until the last year when I finished Andrew Ng’s course on Coursera and I really loved it.

What is your project called and what does it do?

The project is called hapsradar.com and it’s an event recommendation site focused on what’s happening right now or in the near future.

I am a terrible planner of my weekends and I usually find myself wondering what to do if I suddenly decided to do something outside of my home/internet. The typical algorithm for me to find out what’s going on was to go to sites like meetup.com and eventbrite, browse through tons of categories, click lots of buttons and read list of the current events.

So when I finished machine learning course and started to look for projects to practice my skills I thought that I could really automate this event seeking process by fetching event lists from those sites and then building recommendation based on what I like.

HapsRadar by Artem Yankov

HapsRadar by Artem Yankov

The site is very minimalistic and currently provides events only from two sites: meetup.com and eventbrite.com. A user needs to rate at least 100 events before recommendation engine kicks in. Then it runs every night and trains using user likes/dislikes and then trying to predict events user might like.

How did you get started?

I started just because I wanted to practice my machine learning skills and to make it more fun I chose to solve a real problem I had. After some evaluation I decided to use python for my recommender. Here’s the tools I used:

Events are fetched using standard APIs provided by meetup.com and eventbrite.com and stored in postgresql. I emailed them before I started my crawlers to double-check if I can do such a thing, specifically because I wanted to run this crawlers every day to keep my database updated with all the events.

The guys were very nice about that and eventbrite even bumped up my API rate limit without any questions. And meetup.com has a nice streaming API that allows you to subscribe to all the changes as they happening. I wanted to crawl yelp.com as well since they have event lists, but they prohibited this completely.

After I had a first cut of the data I built a simple site that displayed the events within some range of a given zip code (I currently only fetch events for the US).

Now the recommender part. The main material to build my features were the event title and event description. I decided that things like time of the day when event is happening, or how far it is from your home won’t add much of a value because I just wanted a simple answer to the question: is this event relevant to my interests?

Idea #1. Predict topics

Some of the fetched events have tags or categories, some of them don’t.

Initially I thought I should try to use the tagged events to predict tags for untagged events and then use them as training features. After spending some time on that I figured it might not be a good idea. Most of the tagged events had just 1-3 tags and they often were very inaccurate or even completely random.

I think eventbrite allows clients to type anything as a tag and people are just not very good at coming up with the good words. Plus the number of tags per event was usually low and wasn’t enough for judging about the event even if you used human intelligence 🙂

Of course it was possible to find already accurately classified text and use it for predicting topics, but that again, posed a lot of additional questions: Where to get classified text? How relevant it would be to my events descriptions? How many tags I should use? So I decided to find another ideas.

Idea #2. LDA Topic modeling

After some research I found an awesome python library called gensim which implements LDA (Latent Dirichlet Allocation) for topic modelling.

It’s worth noting that the use of topics here is not meant topics defined in English like “sports”, “music” or “programming”. Topics in LDA are probability distributions over words. Roughly speaking it finds clusters of words that come together with the certain probability. Each such cluster is a “topic”. You then feed the model a new document and it’s inferring topics for it as well.

Using LDA is pretty straight forward. First, I cleaned the documents (in my case document is event’s description and title) by removing stop English words, commas, html tags, etc. Then I build dictionary based on all events descriptions:

from gensim import corpora, models
dct = corpora.Dictionary(clean_documents)

Then I filter very rare words

dct.filter_extremes(no_below=2)

To train model all documents need to be converted into bag of words:

corpus = [dct.doc2bow(doc) for doc in clean_documents]

And then model is created like this

lda = ldamodel.LdaModel(corpus=corpus, id2word=dct, num_topics=num_topics)

Where num_topics is a number of topics that need to be modelled on the documents. In my case it was 100. Then to convert any document in form of bag of words to its topics representation in form of sparse matrix:

x = lda[doc_bow]

So now I can get a matrix of features for any given event and I can easily get a training matrix for the events user rated:

docs_bow = [dct.doc2bow(doc) for doc in rated_events]
X_train = [lda[doc_bow] for doc_bow in docs_bow]

That looked like more or less decent solution, using SVM (Support Vector Machine) classifier I got about 85% accuracy and when I looked at predicted events for me it did look quite accurate.

Note: Not all classifiers support sparse matrixes and sometimes you need to convert it to a full matrix. Gensim has a way to do that.

gensim.matutils.sparse2full(sparse_matrix, num_topics)

Idea #3. TF-IDF Vectorizer

Another idea I wanted to try for building features was TF-IDF vectorizer.

Scikit-learn supports it out-of-the-box and what’s it’s doing is assigning a weight for each word in the document based on frequency of this word in the document divided by the frequency of the word in a corpus of the documents. So the weight of the word will be low if you see it very often and that allows to filter out the noise. To build vectorizer out of all the documents:

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_df=0.5, sublinear_tf = True, stop_words='english')
vectorizer.fit(all_events)

And then to transform given set of documents to their TF-IDF representation:

X_train = vectorizer.transform(rated_events)

Now, when I tried to feed that to a classifier that was really taking a long time plus results were bad. That’s actually not a surprise because in this case almost every word is a feature. So I started to look for a way to select best performing features.

Scikit-learn provides methods SelectKBest to which you can pass a scoring function and a number of features to select and it performs the magic for you. For scoring I used chi2 (chi-squared test) and I won’t say you exactly why. I just empirically found that it performed better in my case and put “study a theory behind chi2” in my todo bucket.

from sklearn.feature_selection import SelectKBest, chi2
num_features = 100
ch2 = SelectKBest(chi2, k=num_features)
X_train = ch2.fit_transform(X_train, y_train).toarray()

And that’s it. X_train is my training set.

Training classifier

I’m not happy to admit that but there wasn’t much science involved in how I chose classifier. I just tried bunch of them and choose the one that performed best. In my case it was SVM. As for the parameters I used Grid Search to choose the best ones and all of that scikit-learn provides out of the box. In code it looks like this:

clf = svm.SVC()
params = dict(gamma=[0.001, 0.01,0.1, 0.2, 1, 10, 100],C=[1,10,100,1000], kernel=["linear", "rb"])
clf = grid_search.GridSearchCV(clf,param_grid=params,cv=5, scoring='f1')

I chose f1-score as a scoring method just because it’s the one I more or less understand. Grid Search will try all combination of the parameters above, perform cross-validations and find the parameters that performs best.

I tried to feed this classifier both X_train with topics modelled with LDA and TF-IDF + Chi2. Both performed similarly, but subjectively it looked like TF-IDF + Chi2 solution generated better predictions. I was pretty much satisfied with the results for the v1 and spent the rest of the time fixing website’s UI.

What are some interesting discoveries you made?

One of the things I learnt is that if you are building a recommendation system and expect your users to come and rate a bunch of things at once so it can work – you are wrong.

I tried the site on my friends and although rating process seemed very easy and fast to me it was pretty hard to make them spend few minutes clicking a “like” button. Although It was alright since my main goal was to practice skills and build a tool for myself I figured if I want to make something bigger out of it I need to figure out how to make rating process simpler.

Another thing I learnt is that in order to be more efficient I need to understand algorithms more. Tweaking parameters is way more fun when you understand what you are doing.

What do you want to do next on the project?

My main problem currently is UI. I want to keep it minimalistic, but I need to figure out how to do the rating process more fun and convenient. Also events browsing could be better.

After this part is done I’m thinking to search for new sources of events: conferences, concerts, etc. Maybe I’ll add a mobile app for that as well.

Learn More

Thanks Artem.

Do you have a machine learning side project?

If you have a side project that uses machine learning and want to be featured like Artem, please contact me.

Frustrated With Python Machine Learning?

Master Machine Learning With Python

Develop Your Own Models in Minutes

…with just a few lines of scikit-learn code

Covers self-study tutorials and end-to-end projects like:
Loading data, visualization, modeling, tuning, and much more…

Finally Bring Machine Learning To
Your Own Projects

Skip the Academics. Just Results.

相關推薦

Project Spotlight: Event Recommendation in Python with Artem Yankov

Tweet Share Share Google Plus This is a project spotlight with Artem Yankov. Could you please i

Interactive Brokers in Python with backtrader

With the client running, we additionally need to do a couple of thingsUnder File -> Global Configuration choose Settings -> API andCheck Enable Activ

Interactive Data Visualization in Python With Bokeh

Bokeh prides itself on being a library for interactive data visualization. Unlike popular counterparts in the Python visualization space, like Matplotl

Rescaling Data for Machine Learning in Python with Scikit

Tweet Share Share Google Plus Your data must be prepared before you can build models. The data p

Feature Selection in Python with Scikit

Tweet Share Share Google Plus Not all data attributes are created equal. More is not always bett

Save and Load Machine Learning Models in Python with scikit

Hello Jason, I am new to machine learning. I am your big fan and read a lot of your blog and books. Thank you very much for teaching us machine le

Prepare Data for Machine Learning in Python with Pandas

Tweet Share Share Google Plus If you are using the Python stack for studying and applying machin

How to Load Data in Python with Scikit

Tweet Share Share Google Plus Before you can build machine learning models, you need to load you

pip is configured with locations that require TLS/SSL, however the ssl module in Python is not available.

make 重新 test however con conf ins ava OS # 背景 安裝pip後發現執行pip install pytest,提示下面錯誤 pip is configured with locations that require TLS/S

Deploying a Python serverless function in minutes with GCP

A few questionsWhat is Cloud Functions?Cloud Functions is a managed service for serverless functions. The acronym describing such a service is FaaS (Functi

Analysis of Stock Market Cycles with fbprophet package in Python

Introduction to fbprophetFbprophet is an open source released by Facebook in order to provide some useful guidance for producing forecast at scale. By defa

Intro to Image Processing in OpenCV with Python

Intro to Image Processing in OpenCV with PythonWelcome to this tutorial covering OpenCV. This type of program is most commonly used for video and image ana

Create a bot with NLU in Python @ Alex Pliutau's Blog

At Wizeline we have Python courses, and recent topic was how to build a Bot in Python. I always wanted to try Natural Language Understanding

Working With JSON Data in Python

Since its inception, JSON has quickly become the de facto standard for information exchange. Chances are you’re here because you need to transport some

Caching in Django With Redis – Real Python

Application performance is vital to the success of your product. In an environment where users expect website response times of less than a second, the

Analyzing Obesity in England With Python – Real Python

I saw a sign at the gym yesterday that said, “Children are getting fatter every decade”. Below that sign stood a graph that basically showed that in fiv

time bushfire alerting with Complex Event Processing in Apache Flink on Amazon EMR and IoT sensor network | AWS Big Data Blog

Bushfires are frequent events in the warmer months of the year when the climate is hot and dry. Countries like Australia and the United States are

Processing in Python: How I learned to love parallelized applies with Dask and Numba

Go fast with Numba and DaskAs a master’s candidate of Data Science at the University of San Francisco, I get to regularly wrangle with data. Applies are on

Getting Started With Testing in Python

This tutorial is for anyone who has written a fantastic application in Python but hasn’t yet written any tests. Testing in Python is a huge topic and ca