1. 程式人生 > >python數據挖掘

python數據挖掘

自己 .data 表示 print nts 集合 方法 child lse

數據挖掘旨在讓計算機根據已有數據做出決策

數據挖掘的第一步一般是創建數據集,數據集能夠描述真實世界的某一方面。數據集主要包括1.表示真實世界中物體的樣本。2.描述數據集中樣本的特征

接下來是調整算法。每種數據挖掘算法都有參數,它們或者是算法自身包含的,或者是使用 者添加的。這些參數會影響算法的具體決策

規則的優劣有多種衡量方法,常用的是支持度(support)和置信度(confidence)

  支持度指數據集中規則應驗的次數,支持度衡量的是給定規則應驗的比例,而置信度衡量的則是規則準確率如何,即符合給定條 件(即規則的“如果”語句所表示的前提條件)的所有規則裏,跟當前規則結論一致的比例有多 大。計算方法為首先統計當前規則的出現次數,再用它來除以條件(“如果”語句)相同的規則 數量

親和性分析:根據樣本個體(物體)之間的相似度,確定它們關系的親疏

從數據集中頻繁出現的商品中選取共同出現的商品組成頻繁項集

Apriori 算法

Apriori算法是親和性分析的一部分,專門用於查找數據集中的頻繁項集。基本流程是從前一 步找到的頻繁項集中找到新的備選集合,接著檢測備選集合的頻繁程度是否夠高,然後算法像下 面這樣進行叠代。

(1) 把各項目放到只包含自己的項集中,生成最初的頻繁項集。只使用達到最小支持度的項目。

(2) 查找現有頻繁項集的超集,發現新的頻繁項集,並用其生成新的備選項集。

(3) 測試新生成的備選項集的頻繁程度,如果不夠頻繁,則舍棄。如果沒有新的頻繁項集, 就跳到最後一步。

(4) 存儲新發現的頻繁項集,跳到步驟(2)。

(5) 返回發現的所有頻繁項集。

技術分享圖片

親和性分析方法推薦電影

# Author:song
# coding = utf-8
import os
import pandas as pd
import sys
from collections import defaultdict

ratings_filename = os.path.join(os.getcwd(), "Data","ml-100k","u.data")
all_ratings = pd.read_csv(ratings_filename,delimiter=\t,
                          header
=None,names=[UserID,MovieID,Rating,Datetime])#加載數據集時,把分隔符設置為制表符,告訴pandas不要把第一行作為表頭(header=None),設置好各列的名稱 all_ratings["Datetime"] = pd.to_datetime(all_ratings[Datetime],unit=s) #解析時間戳數據 #Apriori 算法的實現,規則:如果用戶喜歡某些電影,那麽他們也會喜歡這部電影。作為對上述規則的擴展,我們還將討論喜歡某幾部電影的用戶,是否喜歡另一部電影 all_ratings["Favorable"] = all_ratings["Rating"] > 3 #創建新特征Favorable,確定用戶是不是喜歡某一部電影 ratings = all_ratings[all_ratings[UserID].isin(range(200))] #選取一部分數據用作訓練集,減少搜索空間,提升Apriori算法的速度。 favorable_ratings = ratings[ratings["Favorable"]]#數據集(只包括用戶喜歡某部電影的數據行) favorable_reviews_by_users = dict((k, frozenset(v.values)) for k,v in favorable_ratings.groupby(UserID)[MovieID])#每個用戶各喜歡哪些電影,按照User ID進行分組,並遍歷每個用戶看過的每一部電影 num_favorable_by_movie = ratings[["MovieID", "Favorable"]].groupby("MovieID").sum() #每部電影的影迷數量 # print(num_favorable_by_movie.sort("Favorable", ascending=False)[:5]) frequent_itemsets = {} min_support = 50 #設置最小支持度 frequent_itemsets[1] = dict((frozenset((movie_id,)),row["Favorable"]) for movie_id, row in num_favorable_by_movie.iterrows() if row["Favorable"] > min_support) #每一部電影生成只包含它自己的項集,檢測它是否夠頻繁。電影編號使用frozenset def find_frequent_itemsets(favorable_reviews_by_users, k_1_itemsets,min_support): #接收新發現的頻繁項集,創建超集,檢測頻繁程度 counts = defaultdict(int) for user, reviews in favorable_reviews_by_users.items():#遍歷所有用戶和他們的打分數據 for itemset in k_1_itemsets:#遍歷前面找出的項集,判斷它們是否是當前評分項集的子集。如果是,表明用戶已經為子集中的電影打過分 if itemset.issubset(reviews): for other_reviewed_movie in reviews - itemset:#遍歷用戶打過分卻沒有出現在項集裏的電影,用它生成超集,更新該項集的計數 current_superset = itemset | frozenset((other_reviewed_movie,)) counts[current_superset] += 1 return dict([(itemset, frequency) for itemset, frequency in counts.items() if frequency >= min_support]) #檢測達到支持度要求的項集 for k in range(2, 20): cur_frequent_itemsets =find_frequent_itemsets(favorable_reviews_by_users,frequent_itemsets[k-1],min_support) frequent_itemsets[k] = cur_frequent_itemsets if len(cur_frequent_itemsets) == 0: print("Did not find any frequent itemsets of length {}".format(k)) sys.stdout.flush() break else: print("I found {} frequent itemsets of length{}".format(len(cur_frequent_itemsets), k)) sys.stdout.flush() del frequent_itemsets[1] #刪除長度為1的項集 #遍歷不同長度的頻繁項集,為每個項集生成規則 candidate_rules = [] for itemset_length, itemset_counts in frequent_itemsets.items(): for itemset in itemset_counts.keys(): for conclusion in itemset: premise = itemset - set((conclusion,)) candidate_rules.append((premise, conclusion)) print(candidate_rules[:5]) #創建兩個字典,用來存儲規則應驗(正例)和規則不適用(反例)的次數 correct_counts = defaultdict(int) incorrect_counts = defaultdict(int) for user, reviews in favorable_reviews_by_users.items(): for candidate_rule in candidate_rules: premise, conclusion = candidate_rule if premise.issubset(reviews): if conclusion in reviews: correct_counts[candidate_rule] += 1 else: incorrect_counts[candidate_rule] += 1 #用規則應驗的次數除以前提條件出現的總次數,計算每條規則的置信度 rule_confidence = {candidate_rule: correct_counts[candidate_rule]/ float(correct_counts[candidate_rule] +incorrect_counts[candidate_rule]) for candidate_rule in candidate_rules} from operator import itemgetter sorted_confidence = sorted(rule_confidence.items(),key=itemgetter(1), reverse=True) for index in range(5): print("Rule #{0}".format(index + 1)) (premise, conclusion) = sorted_confidence[index][0] print("Rule: If a person recommends {0} they will alsorecommend {1}".format(premise, conclusion)) print(" - Confidence:{0:.3f}".format(rule_confidence[(premise, conclusion)])) print("") #用pandas從u.items文件加載電影名稱信息 movie_name_filename = os.path.join(os.getcwd(), "Data","ml-100k","u.item") movie_name_data = pd.read_csv(movie_name_filename, delimiter="|",header=None, encoding = "mac-roman") movie_name_data.columns = ["MovieID", "Title", "Release Date", "Video Release", "IMDB", "<UNK>", "Action", "Adventure","Animation", "Children‘s", "Comedy", "Crime", "Documentary","Drama", "Fantasy", "Film-Noir","Horror", "Musical", "Mystery", "Romance", "Sci-Fi", "Thriller","War", "Western"] def get_movie_name(movie_id):#用電影編號獲取名稱 title_object = movie_name_data[movie_name_data["MovieID"] == movie_id]["Title"] title = title_object.values[0] return title for index in range(5):#輸出的規則 print("Rule #{0}".format(index + 1)) (premise, conclusion) =sorted_confidence[index][0] premise_names = ", ".join(get_movie_name(idx) for idx in premise) conclusion_name = get_movie_name(conclusion) print("Rule: If a person recommends {0} they willalso recommend {1}".format(premise_names, conclusion_name)) print(" - Confidence: {0:.3f}".format(rule_confidence[(premise,conclusion)])) print("") #進行評估,計算規則應驗的數量,方法跟之前相同。唯一的不同就是這次使用的是測試數據而不是訓練數據 test_dataset = all_ratings[~all_ratings["UserID"].isin(range(200))]#選取除了訓練集以外的數據 test_favorable = test_dataset[test_dataset["Favorable"]] test_favorable_by_users = dict((k, frozenset(v.values)) for k, v in test_favorable.groupby("UserID")["MovieID"]) correct_counts = defaultdict(int) incorrect_counts = defaultdict(int) for user, reviews in test_favorable_by_users.items(): for candidate_rule in candidate_rules: premise,conclusion = candidate_rule if premise.issubset(reviews): if conclusion in reviews: correct_counts[candidate_rule] += 1 else: incorrect_counts[candidate_rule] += 1 test_confidence = {candidate_rule: correct_counts[candidate_rule]/float(correct_counts[candidate_rule] + incorrect_counts[candidate_rule])for candidate_rule in rule_confidence} for index in range(5): print("Rule #{0}".format(index + 1)) (premise, conclusion) = sorted_confidence[index][0] premise_names = ", ".join(get_movie_name(idx) for idx in premise) conclusion_name = get_movie_name(conclusion) print("Rule: If a person recommends {0} they will alsorecommend {1}".format(premise_names, conclusion_name)) print(" - Train Confidence:{0:.3f}".format(rule_confidence.get((premise, conclusion), -1))) print(" - Test Confidence:{0:.3f}".format(test_confidence.get((premise, conclusion),-1))) print("")

運行結果

I found 93 frequent itemsets of length2
I found 295 frequent itemsets of length3
I found 593 frequent itemsets of length4
I found 785 frequent itemsets of length5
I found 677 frequent itemsets of length6
I found 373 frequent itemsets of length7
I found 126 frequent itemsets of length8
I found 24 frequent itemsets of length9
I found 2 frequent itemsets of length10
Did not find any frequent itemsets of length 11
[(frozenset({79}), 258), (frozenset({258}), 79), (frozenset({50}), 64), (frozenset({64}), 50), (frozenset({127}), 181)]
Rule #1
Rule: If a person recommends frozenset({98, 172, 127, 174, 7}) they will alsorecommend 64
 - Confidence:1.000

Rule #2
Rule: If a person recommends frozenset({56, 1, 64, 127}) they will alsorecommend 98
 - Confidence:1.000

Rule #3
Rule: If a person recommends frozenset({64, 100, 181, 174, 79}) they will alsorecommend 56
 - Confidence:1.000

Rule #4
Rule: If a person recommends frozenset({56, 100, 181, 174, 127}) they will alsorecommend 50
 - Confidence:1.000

Rule #5
Rule: If a person recommends frozenset({98, 100, 172, 79, 50, 56}) they will alsorecommend 7
 - Confidence:1.000

Rule #1
Rule: If a person recommends Silence of the Lambs, The (1991), Empire Strikes Back, The (1980), Godfather, The (1972), Raiders of the Lost Ark (1981), Twelve Monkeys (1995) they willalso recommend Shawshank Redemption, The (1994)
 - Confidence: 1.000

Rule #2
Rule: If a person recommends Pulp Fiction (1994), Toy Story (1995), Shawshank Redemption, The (1994), Godfather, The (1972) they willalso recommend Silence of the Lambs, The (1991)
 - Confidence: 1.000

Rule #3
Rule: If a person recommends Shawshank Redemption, The (1994), Fargo (1996), Return of the Jedi (1983), Raiders of the Lost Ark (1981), Fugitive, The (1993) they willalso recommend Pulp Fiction (1994)
 - Confidence: 1.000

Rule #4
Rule: If a person recommends Pulp Fiction (1994), Fargo (1996), Return of the Jedi (1983), Raiders of the Lost Ark (1981), Godfather, The (1972) they willalso recommend Star Wars (1977)
 - Confidence: 1.000

Rule #5
Rule: If a person recommends Silence of the Lambs, The (1991), Fargo (1996), Empire Strikes Back, The (1980), Fugitive, The (1993), Star Wars (1977), Pulp Fiction (1994) they willalso recommend Twelve Monkeys (1995)
 - Confidence: 1.000

Rule #1
Rule: If a person recommends Silence of the Lambs, The (1991), Empire Strikes Back, The (1980), Godfather, The (1972), Raiders of the Lost Ark (1981), Twelve Monkeys (1995) they will alsorecommend Shawshank Redemption, The (1994)
 - Train Confidence:1.000
 - Test Confidence:0.854

Rule #2
Rule: If a person recommends Pulp Fiction (1994), Toy Story (1995), Shawshank Redemption, The (1994), Godfather, The (1972) they will alsorecommend Silence of the Lambs, The (1991)
 - Train Confidence:1.000
 - Test Confidence:0.870

Rule #3
Rule: If a person recommends Shawshank Redemption, The (1994), Fargo (1996), Return of the Jedi (1983), Raiders of the Lost Ark (1981), Fugitive, The (1993) they will alsorecommend Pulp Fiction (1994)
 - Train Confidence:1.000
 - Test Confidence:0.756

Rule #4
Rule: If a person recommends Pulp Fiction (1994), Fargo (1996), Return of the Jedi (1983), Raiders of the Lost Ark (1981), Godfather, The (1972) they will alsorecommend Star Wars (1977)
 - Train Confidence:1.000
 - Test Confidence:0.975

Rule #5
Rule: If a person recommends Silence of the Lambs, The (1991), Fargo (1996), Empire Strikes Back, The (1980), Fugitive, The (1993), Star Wars (1977), Pulp Fiction (1994) they will alsorecommend Twelve Monkeys (1995)
 - Train Confidence:1.000
 - Test Confidence:0.609


Process finished with exit code 0

python數據挖掘