使用自然語言處理構建基於內容的推薦系統

阿新 • • 發佈：2018-11-27

資料下載地址：https://query.data.world/s/uikepcpffyo2nhig52xxeevdialfl7
 1.提取資料---電影標題，電影型別，電影導演，電影演員，電影劇情
 2.清洗資料---
      電影劇情使用rake_nltk去除停定詞，對關鍵詞排序。
      電影導演，電影演員去除空格，把姓和名作為一個單詞
 3.把所有的關鍵詞拼接成bag_of_words,計算相似度。
 4.對指定電影進行top10推薦。
 主要的技術點：rate_nltk,sklean中的cosine_similarity,skean中的CountVectorizer

#!/usr/bin/Python
# -*- coding: utf-8 -*

import pandas as pd
from rake_nltk import Rake
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer

pd.set_option('display.max_columns', 100)
df = pd.read_csv('IMDB_Top250Engmovies2_OMDB_Detailed.csv')
print(df.head())
print(df.shape)
df = df[['Title','Genre','Director','Actors','Plot']]
df.head()
# discarding the commas between the actors' full names and getting only the first three names
df['Actors'] = df['Actors'].map(lambda x: x.split(',')[:3])
print("====演員列表====")
print(df['Actors'][:3])

# putting the genres in a list of words
df['Genre'] = df['Genre'].map(lambda x: x.lower().split(','))
print("===型別列表===")
print(df['Genre'][:3])


df['Director'] = df['Director'].map(lambda x: x.split(' '))
print("===導演列表====")
print(df['Director'][:3])

# merging together first and last name for each actor and director, so it's considered as one word
# and there is no mix up between people sharing a first name
for index, row in df.iterrows():
    # 把姓和名變成一個單詞
    row['Actors'] = [x.lower().replace(' ','') for x in row['Actors']]
    # print("變化前")
    # print(row['Director'])
    row['Director'] = ''.join(row['Director']).lower()
    # print("變化後")
    # print(row['Director'])

for index, row in df.iterrows():
    if(index<3):
        print("===演員列表===")
        print(row['Actors'])
        print("===導演列表===")
        print(row['Director'])

# initializing the new column
df['Key_words'] = ""

for index, row in df.iterrows():
    plot = row['Plot']

    # instantiating Rake, by default is uses english stopwords from NLTK
    # and discard all puntuation characters
    r = Rake()

    # extracting the words by passing the text
    r.extract_keywords_from_text(plot)

    # getting the dictionary whith key words and their scores
    key_words_dict_scores = r.get_word_degrees()
    print("===key_words_dict_scores===")
    print(key_words_dict_scores)

    # assigning the key words to the new column
    row['Key_words'] = list(key_words_dict_scores.keys())
    print("===key_words===")
    print(row['Key_words'])

# dropping the Plot column
df.drop(columns=['Plot'], inplace=True)

df.set_index('Title', inplace = True)
print(df.head())

df['bag_of_words'] = ''
print("===df.columns====")
print(df.columns)
columns = df.columns
for index, row in df.iterrows():
    words = ''
    for col in columns:
        if col != 'Director':
            words = words + ' '.join(row[col]) + ' '
        else:
            words = words + row[col] + ' '
        print("====words:====")
        print(words)
    row['bag_of_words'] = words

for index, row in df.iterrows():
    print("===bag_of_words===")
    print(row['bag_of_words'])

df.drop(columns=[col for col in df.columns if col != 'bag_of_words'], inplace=True)

print("===head:===")
print(df.head())

# instantiating and generating the count matrix
count = CountVectorizer()
count_matrix = count.fit_transform(df['bag_of_words'])

# creating a Series for the movie titles so they are associated to an ordered numerical
# list I will use later to match the indexes
indices = pd.Series(df.index)
indices[:5]
print("===indices:===")
print(indices)
# generating the cosine similarity matrix
cosine_sim = cosine_similarity(count_matrix, count_matrix)
# cosine_sim
print("===cosine_sim:===")
print(cosine_sim)


# function that takes in movie title as input and returns the top 10 recommended movies
def recommendations(title, cosine_sim=cosine_sim):
    recommended_movies = []

    # gettin the index of the movie that matches the title
    idx = indices[indices == title].index[0]
    print("===idx:====")
    print(idx)

    # creating a Series with the similarity scores in descending order
    score_series = pd.Series(cosine_sim[idx]).sort_values(ascending=False)
    print("===score_series===")
    print(score_series)

    # getting the indexes of the 10 most similar movies
    top_10_indexes = list(score_series.iloc[1:11].index)

    # populating the list with the titles of the best 10 matching movies
    for i in top_10_indexes:
        recommended_movies.append(list(df.index)[i])

    return recommended_movies

print(recommendations('Fargo'))

使用自然語言處理構建基於內容的推薦系統

資料下載地址：https://query.data.world/s/uikepcpffyo2nhig52xxeevdialfl7 1.提取資料---電影標題，電影型別，電影導演，電影演員，電影劇情 2.清洗資料--- 電影劇情使用rake_nltk去除停定詞，對關

Python 自然語言處理（基於jieba分詞和NLTK）

----------歡迎加入學習交流QQ群：657341423 自然語言處理是人工智慧的類別之一。自然語言處理主要有那些功能？我們以百度AI為例從上述的例子可以看到，自然語言處理最基本的功能是詞法分析，詞法分析的功能主要有：分詞分句詞語標註詞法時態

Python 自然語言處理（基於SnowNLP）

----------歡迎加入學習交流QQ群：657341423 SnowNLP是一個python寫的類庫,可以方便的處理中文文字內容。如中文分詞詞性標註情感分析文字分類提取文字關鍵詞文字相似度計算安裝：pip install snownlp

自然語言處理技術（NLP）在推薦系統中的應用

作者：張相於，58集團演算法架構師，轉轉搜尋推薦部負責人，負責搜尋、推薦以及演算法相關工作。多年來主要從事推薦系統以及機器學習，也做過計算廣告、反作弊等相關工作，並熱衷於探索大資料和機器學習技術在其他領域的應用實踐。責編：何永燦（[email

Coursera課程下載和存檔計劃三：機器學習 & 自然語言處理 & 推薦系統 & 資料探勘相關公開課

週末對之前儲存和下載的Coursera課程做了一下整理和歸類，先送出機器學習、自然語言處理、推薦系統和資料探勘相關的14門課程資源。這些公開課資源很多來自於之前課程圖譜群內朋友的或者微博上的朋友的分享，這裡做了一些補充，主要針對Coursera舊課程平臺的課程進行備份和分享

《深入淺出Python機器學習(段小手)》PDF代碼+《推薦系統與深度學習》PDF及代碼+《自然語言處理理論與實戰(唐聃)》PDF代碼源程序

數學分析 tar 認知愛好者代碼 pdf ima 收獲 c++ prime 《深入淺出Python機器學習》PDF，280頁，帶書簽目錄，文字可以復制；配套源代碼。作者：段小手下載: https://pan.baidu.com/s/1XUs-94n0qKR1F9

文本情感分析的基礎在於自然語言處理、情感詞典、機器學習方法等內容。以下是我總結的一些資源。

建議中心這場分詞自然語言處理目前能力開放計算推薦算法文本情感分析的基礎在於自然語言處理、情感詞典、機器學習方法等內容。以下是我總結的一些資源。詞典資源：SentiWordNet《知網》中文版中文情感極性詞典 NTUSD情感詞匯本體下載自然語言處理

斯坦福大學-自然語言處理入門筆記第二十一課問答系統（2）

一、問答系統中的總結（summarization）目標：產生一個摘要文字包含那些對使用者重要和相關的資訊總結的應用領域：任何文件的摘要和大綱，郵件摘要等等根據總結的內容，我們可以把總結分為兩類：單文件總結：給出一個單一文件的摘要、大綱、標題

斯坦福大學-自然語言處理入門筆記第二十課問答系統（question answering）

1、什麼是問答系統問答系統是最早的NLP任務，根據問題的依存關係，找到適合的依存關係的回答。在現代系統中問題被分為兩類事實問題的回答一般都是一個簡單的片語或者是命名實體兩種問答系統的正規化基於資訊檢索的路徑：TREC; I

自然語言處理-錯字識別（基於Python）kenlm、pycorrector

轉載出處：https://blog.csdn.net/HHTNAN 中文文字糾錯劃分中文文字糾錯任務，常見錯誤型別包括：諧音字詞，如配副眼睛-配副眼鏡混淆音字詞，如流浪織女-牛郎織女字詞順序顛倒，如伍迪艾倫-艾倫伍迪字詞補全，如愛有天意-

自然語言處理之：搭建基於HanLP的開發環境（轉）

環境搭建比FNLP的簡單，具體參考：https://github.com/hankcs/HanLP 各個版本的下載：https://github.com/hankcs/HanLP/releases 完畢後有一個報錯：字元型別對應表載入失敗: D:/eclipse_workspace

系統學習自然語言處理（一）--綜述

今天開始，進入NLP方向，目前在看《自然語言處理綜論》作為入門基礎，又不高興自己手打，所以，就參考了這篇部落格，作了一些修改。另外，這本書的第二版，還沒有討論深度學習在NLP的應用，因此，可以作為一個基礎讀物，搞明白NLP是什麼，做什麼，怎麼做這些問題，但它比一般的N

基於百度AI的自然語言處理文字分類

前言：需要在百度AI平臺註冊登入並建立專案。爬蟲程式碼 1 import scrapy 2 from BaiDuAi.items import BaiduaiItem 3 4 class AiSpider(scrapy.Spider): 5 name =

Pytext：Facebook基於PyTorch的自然語言處理（NLP）開源框架

自然語言處理(NLP)在現代深度學習生態中越來越常見。從流行的深度學習框架到雲端API的支援，例如Google雲、Azure、AWS或Bluemix，NLP是深度學習平臺不可或缺的部分。儘管已經取得了令人難以置信的進步，但構建大規模的NLP應用依然還有極大的挑戰，在學習研究和生產部署之間還存在很多摩擦。作為當