python數據挖掘

阿新 • • 發佈：2018-05-22

自己 .data 表示 print nts 集合方法 child lse

數據挖掘旨在讓計算機根據已有數據做出決策。

數據挖掘的第一步一般是創建數據集，數據集能夠描述真實世界的某一方面。數據集主要包括1.表示真實世界中物體的樣本。2.描述數據集中樣本的特征

接下來是調整算法。每種數據挖掘算法都有參數，它們或者是算法自身包含的，或者是使用者添加的。這些參數會影響算法的具體決策

規則的優劣有多種衡量方法，常用的是支持度（support）和置信度（confidence）

　　支持度指數據集中規則應驗的次數，支持度衡量的是給定規則應驗的比例，而置信度衡量的則是規則準確率如何，即符合給定條件（即規則的“如果”語句所表示的前提條件）的所有規則裏，跟當前規則結論一致的比例有多大。計算方法為首先統計當前規則的出現次數，再用它來除以條件（“如果”語句）相同的規則數量

親和性分析：根據樣本個體（物體）之間的相似度，確定它們關系的親疏

從數據集中頻繁出現的商品中選取共同出現的商品組成頻繁項集

Apriori 算法

Apriori算法是親和性分析的一部分，專門用於查找數據集中的頻繁項集。基本流程是從前一步找到的頻繁項集中找到新的備選集合，接著檢測備選集合的頻繁程度是否夠高，然後算法像下面這樣進行叠代。

(1) 把各項目放到只包含自己的項集中，生成最初的頻繁項集。只使用達到最小支持度的項目。

(2) 查找現有頻繁項集的超集，發現新的頻繁項集，並用其生成新的備選項集。

(3) 測試新生成的備選項集的頻繁程度，如果不夠頻繁，則舍棄。如果沒有新的頻繁項集，就跳到最後一步。

(4) 存儲新發現的頻繁項集，跳到步驟(2)。

(5) 返回發現的所有頻繁項集。

技術分享圖片

親和性分析方法推薦電影

# Author:song
# coding = utf-8
import os
import pandas as pd
import sys
from collections import defaultdict

ratings_filename = os.path.join(os.getcwd(), "Data","ml-100k","u.data")
all_ratings = pd.read_csv(ratings_filename,delimiter=‘\t‘,
                          header 
=None,names=[‘UserID‘,‘MovieID‘,‘Rating‘,‘Datetime‘])#加載數據集時，把分隔符設置為制表符，告訴pandas不要把第一行作為表頭（header=None），設置好各列的名稱
all_ratings["Datetime"] = pd.to_datetime(all_ratings[‘Datetime‘],unit=‘s‘) #解析時間戳數據

#Apriori 算法的實現,規則：如果用戶喜歡某些電影，那麽他們也會喜歡這部電影。作為對上述規則的擴展，我們還將討論喜歡某幾部電影的用戶，是否喜歡另一部電影

all_ratings["Favorable"] = all_ratings["Rating"] > 3  #創建新特征Favorable，確定用戶是不是喜歡某一部電影
ratings = all_ratings[all_ratings[‘UserID‘].isin(range(200))] #選取一部分數據用作訓練集,減少搜索空間,提升Apriori算法的速度。
favorable_ratings = ratings[ratings["Favorable"]]#數據集(只包括用戶喜歡某部電影的數據行)

favorable_reviews_by_users = dict((k, frozenset(v.values))
                                  for k,v in favorable_ratings.groupby(‘UserID‘)[‘MovieID‘])#每個用戶各喜歡哪些電影，按照User ID進行分組，並遍歷每個用戶看過的每一部電影
num_favorable_by_movie = ratings[["MovieID", "Favorable"]].groupby("MovieID").sum() #每部電影的影迷數量
# print(num_favorable_by_movie.sort("Favorable", ascending=False)[:5])
frequent_itemsets = {}
min_support = 50 #設置最小支持度
frequent_itemsets[1] = dict((frozenset((movie_id,)),row["Favorable"])
                            for movie_id, row in num_favorable_by_movie.iterrows()
                            if row["Favorable"] > min_support) #每一部電影生成只包含它自己的項集，檢測它是否夠頻繁。電影編號使用frozenset

def find_frequent_itemsets(favorable_reviews_by_users, k_1_itemsets,min_support): #接收新發現的頻繁項集，創建超集，檢測頻繁程度
    counts = defaultdict(int)
    for user, reviews in favorable_reviews_by_users.items():#遍歷所有用戶和他們的打分數據
        for itemset in k_1_itemsets:#遍歷前面找出的項集，判斷它們是否是當前評分項集的子集。如果是，表明用戶已經為子集中的電影打過分
            if itemset.issubset(reviews):
                for other_reviewed_movie in reviews - itemset:#遍歷用戶打過分卻沒有出現在項集裏的電影，用它生成超集，更新該項集的計數
                    current_superset = itemset | frozenset((other_reviewed_movie,))
                    counts[current_superset] += 1
    return dict([(itemset, frequency) for itemset, frequency in counts.items() if frequency >= min_support]) #檢測達到支持度要求的項集
for k in range(2, 20):
    cur_frequent_itemsets =find_frequent_itemsets(favorable_reviews_by_users,frequent_itemsets[k-1],min_support)
    frequent_itemsets[k] = cur_frequent_itemsets
    if len(cur_frequent_itemsets) == 0:
        print("Did not find any frequent itemsets of length {}".format(k))
        sys.stdout.flush()
        break
    else:
        print("I found {} frequent itemsets of length{}".format(len(cur_frequent_itemsets), k))
        sys.stdout.flush()

del frequent_itemsets[1] #刪除長度為1的項集


#遍歷不同長度的頻繁項集，為每個項集生成規則
candidate_rules = []
for itemset_length, itemset_counts in frequent_itemsets.items():
 for itemset in itemset_counts.keys():
     for conclusion in itemset:
         premise = itemset - set((conclusion,))
         candidate_rules.append((premise, conclusion))
print(candidate_rules[:5])


#創建兩個字典，用來存儲規則應驗（正例）和規則不適用（反例）的次數
correct_counts = defaultdict(int)
incorrect_counts = defaultdict(int)

for user, reviews in favorable_reviews_by_users.items():
    for candidate_rule in candidate_rules:
        premise, conclusion = candidate_rule
        if premise.issubset(reviews):
            if conclusion in reviews:
                correct_counts[candidate_rule] += 1
            else:
                incorrect_counts[candidate_rule] += 1

#用規則應驗的次數除以前提條件出現的總次數，計算每條規則的置信度
rule_confidence = {candidate_rule: correct_counts[candidate_rule]/
                   float(correct_counts[candidate_rule] +incorrect_counts[candidate_rule])
                   for candidate_rule in candidate_rules}

from operator import itemgetter
sorted_confidence = sorted(rule_confidence.items(),key=itemgetter(1), reverse=True)
for index in range(5):
    print("Rule #{0}".format(index + 1))
    (premise, conclusion) = sorted_confidence[index][0]
    print("Rule: If a person recommends {0} they will alsorecommend {1}".format(premise, conclusion))
    print(" - Confidence:{0:.3f}".format(rule_confidence[(premise, conclusion)]))
    print("")


#用pandas從u.items文件加載電影名稱信息
movie_name_filename = os.path.join(os.getcwd(), "Data","ml-100k","u.item")
movie_name_data = pd.read_csv(movie_name_filename, delimiter="|",header=None, encoding = "mac-roman")
movie_name_data.columns = ["MovieID", "Title", "Release Date",
                           "Video Release", "IMDB", "<UNK>",
                           "Action", "Adventure","Animation",
                           "Children‘s", "Comedy", "Crime",
                           "Documentary","Drama", "Fantasy",
                           "Film-Noir","Horror", "Musical",
                           "Mystery", "Romance", "Sci-Fi",
                           "Thriller","War", "Western"]

def get_movie_name(movie_id):#用電影編號獲取名稱
    title_object = movie_name_data[movie_name_data["MovieID"] == movie_id]["Title"]
    title = title_object.values[0]
    return title

for index in range(5):#輸出的規則
    print("Rule #{0}".format(index + 1))
    (premise, conclusion) =sorted_confidence[index][0]
    premise_names = ", ".join(get_movie_name(idx) for idx in premise)
    conclusion_name = get_movie_name(conclusion)
    print("Rule: If a person recommends {0} they willalso recommend {1}".format(premise_names, conclusion_name))
    print(" - Confidence: {0:.3f}".format(rule_confidence[(premise,conclusion)]))
    print("")


#進行評估,計算規則應驗的數量，方法跟之前相同。唯一的不同就是這次使用的是測試數據而不是訓練數據
test_dataset = all_ratings[~all_ratings["UserID"].isin(range(200))]#選取除了訓練集以外的數據
test_favorable = test_dataset[test_dataset["Favorable"]]
test_favorable_by_users = dict((k, frozenset(v.values)) for k, v in test_favorable.groupby("UserID")["MovieID"])
correct_counts = defaultdict(int)
incorrect_counts = defaultdict(int)
for user, reviews in test_favorable_by_users.items():
    for candidate_rule in candidate_rules:
        premise,conclusion = candidate_rule
        if premise.issubset(reviews):
            if conclusion in reviews:
                correct_counts[candidate_rule] += 1
            else:
                incorrect_counts[candidate_rule] += 1

test_confidence = {candidate_rule: correct_counts[candidate_rule]/float(correct_counts[candidate_rule] + incorrect_counts[candidate_rule])for candidate_rule in rule_confidence}

for index in range(5):
    print("Rule #{0}".format(index + 1))
    (premise, conclusion) = sorted_confidence[index][0]
    premise_names = ", ".join(get_movie_name(idx) for idx in premise)
    conclusion_name = get_movie_name(conclusion)
    print("Rule: If a person recommends {0} they will alsorecommend {1}".format(premise_names, conclusion_name))
    print(" - Train Confidence:{0:.3f}".format(rule_confidence.get((premise, conclusion), -1)))
    print(" - Test Confidence:{0:.3f}".format(test_confidence.get((premise, conclusion),-1)))
    print("")

運行結果

I found 93 frequent itemsets of length2
I found 295 frequent itemsets of length3
I found 593 frequent itemsets of length4
I found 785 frequent itemsets of length5
I found 677 frequent itemsets of length6
I found 373 frequent itemsets of length7
I found 126 frequent itemsets of length8
I found 24 frequent itemsets of length9
I found 2 frequent itemsets of length10
Did not find any frequent itemsets of length 11
[(frozenset({79}), 258), (frozenset({258}), 79), (frozenset({50}), 64), (frozenset({64}), 50), (frozenset({127}), 181)]
Rule #1
Rule: If a person recommends frozenset({98, 172, 127, 174, 7}) they will alsorecommend 64
 - Confidence:1.000

Rule #2
Rule: If a person recommends frozenset({56, 1, 64, 127}) they will alsorecommend 98
 - Confidence:1.000

Rule #3
Rule: If a person recommends frozenset({64, 100, 181, 174, 79}) they will alsorecommend 56
 - Confidence:1.000

Rule #4
Rule: If a person recommends frozenset({56, 100, 181, 174, 127}) they will alsorecommend 50
 - Confidence:1.000

Rule #5
Rule: If a person recommends frozenset({98, 100, 172, 79, 50, 56}) they will alsorecommend 7
 - Confidence:1.000

Rule #1
Rule: If a person recommends Silence of the Lambs, The (1991), Empire Strikes Back, The (1980), Godfather, The (1972), Raiders of the Lost Ark (1981), Twelve Monkeys (1995) they willalso recommend Shawshank Redemption, The (1994)
 - Confidence: 1.000

Rule #2
Rule: If a person recommends Pulp Fiction (1994), Toy Story (1995), Shawshank Redemption, The (1994), Godfather, The (1972) they willalso recommend Silence of the Lambs, The (1991)
 - Confidence: 1.000

Rule #3
Rule: If a person recommends Shawshank Redemption, The (1994), Fargo (1996), Return of the Jedi (1983), Raiders of the Lost Ark (1981), Fugitive, The (1993) they willalso recommend Pulp Fiction (1994)
 - Confidence: 1.000

Rule #4
Rule: If a person recommends Pulp Fiction (1994), Fargo (1996), Return of the Jedi (1983), Raiders of the Lost Ark (1981), Godfather, The (1972) they willalso recommend Star Wars (1977)
 - Confidence: 1.000

Rule #5
Rule: If a person recommends Silence of the Lambs, The (1991), Fargo (1996), Empire Strikes Back, The (1980), Fugitive, The (1993), Star Wars (1977), Pulp Fiction (1994) they willalso recommend Twelve Monkeys (1995)
 - Confidence: 1.000

Rule #1
Rule: If a person recommends Silence of the Lambs, The (1991), Empire Strikes Back, The (1980), Godfather, The (1972), Raiders of the Lost Ark (1981), Twelve Monkeys (1995) they will alsorecommend Shawshank Redemption, The (1994)
 - Train Confidence:1.000
 - Test Confidence:0.854

Rule #2
Rule: If a person recommends Pulp Fiction (1994), Toy Story (1995), Shawshank Redemption, The (1994), Godfather, The (1972) they will alsorecommend Silence of the Lambs, The (1991)
 - Train Confidence:1.000
 - Test Confidence:0.870

Rule #3
Rule: If a person recommends Shawshank Redemption, The (1994), Fargo (1996), Return of the Jedi (1983), Raiders of the Lost Ark (1981), Fugitive, The (1993) they will alsorecommend Pulp Fiction (1994)
 - Train Confidence:1.000
 - Test Confidence:0.756

Rule #4
Rule: If a person recommends Pulp Fiction (1994), Fargo (1996), Return of the Jedi (1983), Raiders of the Lost Ark (1981), Godfather, The (1972) they will alsorecommend Star Wars (1977)
 - Train Confidence:1.000
 - Test Confidence:0.975

Rule #5
Rule: If a person recommends Silence of the Lambs, The (1991), Fargo (1996), Empire Strikes Back, The (1980), Fugitive, The (1993), Star Wars (1977), Pulp Fiction (1994) they will alsorecommend Twelve Monkeys (1995)
 - Train Confidence:1.000
 - Test Confidence:0.609


Process finished with exit code 0

python數據挖掘

python 第一周（第一天）我的python成長記一個月搞定python數據挖掘！

__name__ -c pass class port .py contact 成長 class a python代碼的組織方式： .py 文件模塊文件樣式： #!/usr/bin/python#-*-coding:utf8-*- """@author: yugengde

python 第一周（第三天）我的python成長記一個月搞定python數據挖掘！(04)

數字 date .get raw dict 元素 upd 轉換成 efault 字符串 str 和 unicode str 字節流 unicode 字符流 (中文，英文，等等) => 如何轉換成計算機中的01代碼呢？　　出現了編碼 ascii, iso8859

python 第二周（第八天）我的python成長記一個月搞定python數據挖掘！(14)

num print 數據 span python rate string spa rom from lxml import etreedoubanhtml = ‘‘‘‘‘‘doc = etree.fromstring(doubanhtml)for eachbook in d

python 第二周（第八天）我的python成長記一個月搞定python數據挖掘！(15)

center project ron 高層 web 快速 art start mes scrapy爬蟲企業級爬蟲：python開發的一個快速，高層次的web抓取框架，用於抓取web站點並從頁面提取結構化的數據。 scrapy用途廣泛，可用於數據挖掘，數據監測和自動化測試

python 第二周（第十一天）我的python成長記一個月搞定python數據挖掘！(19) -scrapy + mongo

msg 步驟 [0 ssi xtra tin perl overflow tab mongoDB 3.2之後默認是使用wireTiger引擎在啟動時更改存儲引擎：　　mongod --storageEngine mmapv1 --dbpath d:\data\db 這

Python數據挖掘與機器學習技術入門實戰

機器學習摘要：什麽是數據挖掘？什麽是機器學習？又如何進行Python數據預處理？本文將帶領大家一同了解數據挖掘和機器學習技術，通過淘寶商品案例進行數據預處理實戰，通過鳶尾花案例介紹各種分類算法。課程主講簡介：韋瑋，企業家，資深IT領域專家/講師/作家，暢銷書《精通Python網絡爬蟲》作者，阿裏雲社區技術

python數據挖掘

自己 .data 表示 print nts 集合方法 child lse 數據挖掘旨在讓計算機根據已有數據做出決策。數據挖掘的第一步一般是創建數據集，數據集能夠描述真實世界的某一方面。數據集主要包括1.表示真實世界中物體的樣本。2.描述數據集中樣本的特征接下來是調整算

python數據挖掘（從數據集中抽取特征）

lec 刪除 nsf clas 世界創建模型 efault TP join 大多數數據挖掘算法都依賴於數值或類別型特征，從數據集中抽取數值和類別型特征，並選出最佳特征。特征可用於建模，模型以機器挖掘算法能夠理解的近似的方式來表示現實特征選擇的另一個優點在於：降低真實

Python數據挖掘(爬蟲強化)

雙擊分享圖片 tex .org ima 登錄 value 什麽事屬性（我喜歡雨天，因為雨天我可以回到童年踩水花！哈！） 2018年 --7月--12日：多雲又暴雨 T—T 前言我要把爬蟲的終極利器介紹一下，這個只要是我們肉眼能看到的，就算在

Python數據挖掘-中文分詞

index 一個 ins 模塊字典 pytho 漢字 font afr 將一個漢字序列切分成一個一個單獨的詞安裝分詞模塊： pip install jieba 分詞在特殊場合的實用性，調用add_word()，把我們要添加的分詞加入jieba詞庫高效方法：將t

Python數據挖掘-詞頻統計-實現

pytho row str dict err 金庸 nump 由於 dir 詞頻：某個詞在該文檔中出現的內容 1、語料庫搭建 import jieba jieba.load_userdict("D:\\Python\\Python數據挖掘\\Python數據挖掘實戰課

Python數據挖掘-詞雲

pen agg val nump columns 背景 sort wordcloud 分享圖片詞雲繪制 1、語料庫的搭建、分詞來源、移除停用詞、詞頻統計使用方法：os.path.join(path,name) #連接目錄與文件名或目錄結果為path/name

Python數據挖掘-詞雲美化

round edge ner hit 數據 odin fit segments content 1、語料庫構建由於不像之前是對很多個文件進行詞頻統計，所以不需要使用os.walk()方法遍歷每一個文件；只需使用codecs.open()打開相應的文件，(記得close)

Python數據挖掘-相關性-相關分析

bsp 相關系數 div 相關性公式 nbsp font style afr 所需模塊 numpy、pandas 相關系數計算首先使用numpy.mean()方法求出均值，Xsd=numpy.std()方法求出標準差；然後在通過(X-Xmean)/Xsd公式求出z分數

Python數據挖掘—回歸—一元非線性回歸

python 顯示 mil source 地址 false eight 數據集 for 1、使用scatter_matrix判斷個特征的數據分布及其關系散步矩陣(scatter_matrix) Pandas中散步矩陣的函數原理 1 def scatter_matrix(

Python數據挖掘—回歸—邏輯回歸

dsl type near vid sselect pan input dia 取數概念針對因變量為分類變量而進行回歸分析的一種統計方法，屬於概率型非線性回歸　　優點：算法易於實現和部署，執行效率和準確度高　　缺點：離散型的自變量數據需要通過生成虛擬變量的方式來使用

Python數據挖掘—回歸—神經網絡

format 數據挖掘 school dsl iat pri sch ora view 概念：神經網絡：全稱為人工神經網絡，是一種模仿生物神經網絡（動物的中樞神經系統，特別是大腦）的結構和功能的數學模型或計算模型生物神經網絡：神經細胞是構成神經系統的基本單元，稱為生物神

Python數據挖掘—回歸—貝葉斯分類

方程分享圖片 users pytho afr port code ike 設置 pandas之get_dummies 方法：pandas.get_dummies(data,prefix=None,prefix_sep="_",dummy_na=False,columns=

Python數據挖掘—特征工程—特征選擇

from res 6.2 最好的 python features import 方差過多如何選擇特征根據是否發散及是否相關來選擇方差選擇法先計算各個特征的方差，根據閾值，選擇方差大於閾值的特征方差過濾使用到的是VarianceThreshold類，該類有個參數t

分享《Python數據挖掘入門與實踐》高清中文版+高清英文版+源代碼

講解英文版書簽英文 vpd 中英文 .com alt size 下載：https://pan.baidu.com/s/1J7DOGrjoF7HnaSZ8LvFh_A更多資料分享：http://blog.51cto.com/3215120 《Python數據挖掘入門與實

python數據挖掘

相關推薦