1. 程式人生 > >資料探勘 | 親和性分析(二)

資料探勘 | 親和性分析(二)

上回講了親和性分析的簡單分析,但只計算了一條規則的支援度和置信度,現在來說說怎麼計算所有規則的支援度和置信度

首先先建立字典,分別建立有效規則字典、無效規則字典以及條件相同的規則數量

# 建立字典,儲存規則有效資料及無效資料
from collections import defaultdict
valid_rules = defaultdict(int)
invalid_rules = defaultdict(int)
num_occurences = defaultdict(int)   # 條件相同的規則數量

字典建立完成後,開始統計規則,判斷是否有效,然後存進相應的字典裡

裡面的鍵值表示為(1,2),表示為買了牛奶又買了乳酪,對應的值為7,表示有7個人買了牛奶又買了乳酪

通過迴圈,判斷各項條件是否成立,然後存進對應的字典裡

# 字典鍵值表示為(1,2),表示購買了牛奶和乳酪,對應的值表示次數
for sample in X:
    for premise in range(n_features):
        if sample[premise] == 0: continue  # 前提:購買了某一種商品
        num_occurences[premise] += 1       # 滿足前提,存進字典,記錄該前提出現次數
        for conclusion in range(n_features):  # 結論,滿足前提條件下還購買了什麼
if premise == conclusion: # 過濾條件和結論相同情況 continue if sample[conclusion] == 1: # 規則成立,存進規則有效字典,計算次數 valid_rules[(premise, conclusion)] += 1 else: # 否則不成,存進規則無效字典,計算次數 invalid_rules[(premise,
conclusion)] += 1

列印三個字典出來看一下,結果如下

defaultdict(<class 'int'>, {(1, 2): 8, (1, 4): 20, (2, 1): 8, (2, 4): 29, (4, 1): 20, (4, 2): 29, (2, 3): 19, (3, 2): 19, (3, 4): 24, (4, 3): 24, (1, 3): 15, (3, 1): 15, (0, 1): 23, (0, 2): 4, (0, 3): 11, (0, 4): 21, (1, 0): 23, (2, 0): 4, (3, 0): 11, (4, 0): 21}
defaultdict(<class 'int'>, {(1, 0): 23, (1, 3): 31, (2, 0): 32, (2, 3): 17, (4, 0): 40, (4, 3): 37, (1, 2): 38, (1, 4): 26, (2, 1): 28, (3, 0): 28, (3, 1): 24, (4, 1): 41, (3, 2): 20, (3, 4): 15, (2, 4): 7, (0, 1): 20, (0, 3): 32, (4, 2): 32, (0, 2): 39, (0, 4): 22}
defaultdict(<class 'int'>, {1: 46, 2: 36, 4: 61, 3: 39, 0: 43})

有了上述字典,我們還需要計算各個規則的支援度和置信度,對此我們還要建立支援度字典和置信度字典

# 計算支援度和置信度,得到字典
support = valid_rules # 規則應驗次數
confidence = defaultdict(float) # 規則準確率
for premise, conclusion in valid_rules.keys():
    rule = (premise, conclusion)
    confidence[rule] = valid_rules[rule] / num_occurences[premise]

列印結果如下

defaultdict(<class 'int'>, {(1, 2): 8, (1, 4): 20, (2, 1): 8, (2, 4): 29, (4, 1): 20, (4, 2): 29, (2, 3): 19, (3, 2): 19, (3, 4): 24, (4, 3): 24, (1, 3): 15, (3, 1): 15, (0, 1): 23, (0, 2): 4, (0, 3): 11, (0, 4): 21, (1, 0): 23, (2, 0): 4, (3, 0): 11, (4, 0): 21})
defaultdict(<class 'float'>, {(1, 2): 0.17391304347826086, (1, 4): 0.43478260869565216, (2, 1): 0.2222222222222222, (2, 4): 0.8055555555555556, (4, 1): 0.32786885245901637, (4, 2): 0.47540983606557374, (2, 3): 0.5277777777777778, (3, 2): 0.48717948717948717, (3, 4): 0.6153846153846154, (4, 3): 0.39344262295081966, (1, 3): 0.32608695652173914, (3, 1): 0.38461538461538464, (0, 1): 0.5348837209302325, (0, 2): 0.09302325581395349, (0, 3): 0.2558139534883721, (0, 4): 0.4883720930232558, (1, 0): 0.5, (2, 0): 0.1111111111111111, (3, 0): 0.28205128205128205, (4, 0): 0.3442622950819672})

有了這些字典,我們就可以查詢任意規則的支援度和置信度啦

先定義一個列表,方便我們讀懂資料

然後定義一個輸出函式,將規則資訊、支援度和置信度全部輸出

程式碼如下

features = ["麵包", "牛奶", "乳酪", "蘋果", "香蕉"]

def print_rule(premise, conclusion, support,confidence, features):
    premise_name = features[premise]
    conclusion_name = features[conclusion]
    rule = (premise, conclusion)
    print("規則:如果顧客購買了{0},他們還會買{1}".format(premise_name, conclusion_name))
    print("支援度:{0}".format(support[rule]))
    print("置信度:{0:.3f}".format(confidence[rule])

if __name__ == '__main__':
    premise = 0
    conclusion = 1
    print_rule(premise, conclusion, support,confidence, features)

這裡表示查詢買了麵包又買了牛奶這條規則的支援度和置信度,結果如下

規則:如果顧客購買了麵包,他們還會買牛奶
支援度:23
置信度:0.535

完整程式碼:

#coding: utf-8
import numpy as np
# 定義資料集檔名
dataset_filename = "affinity_dataset.txt"
# 載入資料集
X = np.loadtxt(dataset_filename)
n_samples, n_features = X.shape

# 尋找規則:如果購買了X,可能願意購買Y
# 判斷規則優劣:支援度(規則應驗次數)和置信度(規則準確率)
# 一條規則由前提條件和結論兩部分組成


# 建立字典,儲存規則有效資料及無效資料
from collections import defaultdict
valid_rules = defaultdict(int)
invalid_rules = defaultdict(int)
num_occurences = defaultdict(int)   # 條件相同的規則數量

# 字典鍵值表示為(1,2),表示購買了牛奶和乳酪,對應的值表示次數
for sample in X:
    for premise in range(n_features):
        if sample[premise] == 0: continue  # 前提:購買了某一種商品
        num_occurences[premise] += 1       # 滿足前提,存進字典,記錄該前提出現次數
        for conclusion in range(n_features):  # 結論,滿足前提條件下還購買了什麼
            if premise == conclusion:      # 過濾條件和結論相同情況
                continue
            if sample[conclusion] == 1:    # 規則成立,存進規則有效字典,計算次數
                valid_rules[(premise, conclusion)] += 1
            else:                          # 否則不成,存進規則無效字典,計算次數
                invalid_rules[(premise, conclusion)] += 1

# 計算支援度和置信度,得到字典
support = valid_rules # 規則應驗次數
confidence = defaultdict(float) # 規則準確率
for premise, conclusion in valid_rules.keys():
    rule = (premise, conclusion)
    confidence[rule] = valid_rules[rule] / num_occurences[premise]

features = ["麵包", "牛奶", "乳酪", "蘋果", "香蕉"]

def print_rule(premise, conclusion, support,confidence, features):
    premise_name = features[premise]
    conclusion_name = features[conclusion]
    rule = (premise, conclusion)
    print("規則:如果顧客購買了{0},他們還會買{1}".format(premise_name, conclusion_name))
    print("支援度:{0}".format(support[rule]))
    print("置信度:{0:.3f}".format(confidence[rule]))


if __name__ == '__main__':
    premise = 0
    conclusion = 1
    print_rule(premise, conclusion, support,confidence, features)

那麼,已經統計出了所有規則的支援度和置信度了,下一次來講如何排序,選出最優規則