1. 程式人生 > >Python資料探勘學習——親和性分析

Python資料探勘學習——親和性分析

最近了解了一些Python資料探勘方面的內容,主要學習了《Python資料探勘入門與實踐》這本書的內容,在這裡對書中的內容以及我遇到的一些問題進行整理。

資料探勘旨在讓計算機根據已有的資料作出決策。

資料探勘的第一步一般是建立資料集,資料集主要包括:

(1)樣本:表示真實世界中的物體

(2)特徵:描述資料集中樣本

學習的第一步接觸的就是親和性分析,親和性分析是通過樣本個體之間的相似度確定它們之間關係的親疏。

這個例子中採用商品購買的一個數據集,商品共有:麵包,牛奶,乳酪,蘋果,香蕉這幾種。

這裡每個特徵都有且只可能有0或者1兩個值——表示是否購買該商品,而非購買的數量。

在得到樣品及特徵後,我們要找出規則,比如“購買了X,那麼可能會購買Y”

找出規則後還需要判斷其優劣,這裡涉及到兩個指標——支援度和置信度。

程式碼如下:

"""
《Python資料探勘入門與實踐》
親和性分析
資料集每一列代表:是否購買——麵包、牛奶、乳酪、蘋果、香蕉
支援度support——規則應驗的次數
置信度confidence——規則應驗的比例
"""
import numpy as np
from collections import defaultdict #預設字典——如果沒有對應的鍵,返回預設值0
from operator import itemgetter #針對字典進行排序


dataset_filename = r'F:\Python\pycharm\DataAnalysis_test\data\affinity_dataset.txt'
X = np.loadtxt(dataset_filename)
# print(X[:15])#顯示前15行資料
features = ["bread", "milk", "cheese", "apple", "banana"]#特徵列表

"""檢視有多少人購買了蘋果"""
# num_apple_buy = 0
# for sample in X:
#     if sample[3] == 1:
#         num_apple_buy +=1
# print("{0} people bought Apples".format(num_apple_buy))

"""構建規則字典"""
valid_rules = defaultdict(int)#規則應驗
invalid_rules = defaultdict(int)#規則無效
num_occurances = defaultdict(int)#符合A條件(如果。。。)的所有情況
n_features = 5#共有幾項特徵
for sample in X:
    for premise in range(n_features):
        if sample[premise] == 0:
            continue
        else:
            num_occurances[premise] += 1#符合A條件的情況+1
            for conclusion in range(n_features):
                if premise == conclusion:
                    continue
                else:
                    if sample[conclusion] == 1:
                        valid_rules[(premise, conclusion)] +=1 #規則應驗
                    else:
                        invalid_rules[(premise, conclusion)] +=1 #規則無效

#計算每條規則的置信度(confidence規則的準確率如何)和支援度(support規則應驗的次數)
support = valid_rules
confidence = defaultdict(float)
for (premise, conclusion) in valid_rules.keys():
    rule = (premise, conclusion)
    confidence[rule] = valid_rules[rule] / num_occurances[premise]

"""定義輸出每條規則及其置信度和支援度的函式"""
def print_rule(premise, conclusion, support, confidence, features):
    premise_name = features[premise]
    conclusion_name = features[conclusion]
    print("rule: if a person buys {0} they will also buy {1}".format(premise_name, conclusion_name))
    print("置信度confidence: {0:.3f}".format(confidence[(premise, conclusion)]))
    print("支援度support:{0}".format(support[(premise, conclusion)]))


"""排序找出最佳規則"""
def best_rule():
    sorted_support = sorted(support.items(),
                            key=itemgetter(1), #以字典的值的次序進行排序
                            reverse=True)#降序
    sorted_confidence = sorted(confidence.items(), key=itemgetter(1), reverse=True)
    for index in range(5):#輸出排序最高的五個規則
        print("RULE #{0}".format(index + 1))
        premise, conclusion = sorted_support[index][0]
        print_rule(premise, conclusion, support, confidence, features)

if __name__ == '__main__':
    premise = 2
    conclusion = 4
    # print_rule(premise, conclusion, support, confidence, features)
    best_rule()
    # print(valid_rules)

輸出結果為規則的評價結果:

RULE #1
rule: if a person buys cheese they will also buy banana
置信度confidence: 0.659
支援度support:27
RULE #2
rule: if a person buys banana they will also buy cheese
置信度confidence: 0.458
支援度support:27
RULE #3
rule: if a person buys apple they will also buy cheese
置信度confidence: 0.694
支援度support:25
RULE #4
rule: if a person buys cheese they will also buy apple
置信度confidence: 0.610
支援度support:25
RULE #5
rule: if a person buys banana they will also buy apple
置信度confidence: 0.356
支援度support:21

 這個例子中的資料集下載連結:商品購買資料集下載