數據分析系列精彩濃縮（三）

阿新 • • 發佈：2019-01-07

param 無法 gin 打印 can tput swe 數據分析 inf

數據分析（三）

在分析UCI數據之前，有必要先了解一些決策樹的概念（decision tree）

此處推薦一個關於決策樹的博客地址：
```
http://www.cnblogs.com/yonghao/p/5061873.html
```
決策樹（decision tree (DT)）的基本特征
- DT 是一個監督學習方法（supervised learning method）
- DT is a supervised learning method, thus we need labeled data
- It is one process only thus it is not good for giant datasets
- PS: It is pretty good on small and clean datasets
UCI數據特征: UCI credit approval data set
- 690 data entries, relatively small dataset
- 15 attributes, pretty tiny to be honest
- missing value is only 5%
- 2 class data
By looking at these two, we know DT should work well for our dataset

綜上，就可以嘗試用代碼實現決策樹的功能了，此時使用段老師提供的skeleton（框架），按照以下步驟寫自己的代碼

Copy and paste your code to function readfile(file_name) under the comment # Your code here.
Make sure your input and output matches how I descirbed in the docstring
Make a minor improvement to handle missing data, in this case let‘s use string "missing" to represent missing data. Note that it is given as "?"

.
Implement is_missing(value), class_counts(rows), is_numeric(value) as directed in the docstring
Implement class Determine. This object represents a node of our DT. 這個對象表示的是決策樹的節點。
- It has 2 inputs and a function. 有兩個輸入，一個方法
- We can think of it as the Question we are asking at each node. 可以理解成決策樹中每個節點我們所提出的“問題”
Implement the method partition(rows, question)as described in the docstring
- Use Determine class to partition data into 2 groups
Implement the method gini(rows) as described in the docstring
- Here is the formula for Gini impurity: $技術分享圖片$
  - where n is the number of classes
  - $技術分享圖片$ is the percentage of the given class i
Implement the method info_gain(left, right, current_uncertainty) as described in the docstring
- Here is the formula for Information Gain: $技術分享圖片$
  - where $技術分享圖片$
  - $技術分享圖片$ is current_uncertainty
  - $技術分享圖片$ is the percentage/probability of left branch, same story for $技術分享圖片$

my code is as follows , for reference only(以下是我的代碼，僅供參考)

def readfile(file_name):
    """
    This function reads data file and returns structured and cleaned data in a list
    :param file_name: relative path under data folder
    :return: data, in this case it should be a 2-D list of the form
    [[data1_1, data1_2, ...],
     [data2_1, data2_2, ...],
     [data3_1, data3_2, ...],
     ...]
    
    i.e.
    [[‘a‘, 58.67, 4.46, ‘u‘, ‘g‘, ‘q‘, ‘h‘, 3.04, ‘t‘, ‘t‘, 6.0, ‘f‘, ‘g‘, ‘00043‘, 560.0, ‘+‘],
     [‘a‘, 24.5, 0.5, ‘u‘, ‘g‘, ‘q‘, ‘h‘, 1.5, ‘t‘, ‘f‘, 0.0, ‘f‘, ‘g‘, ‘00280‘, 824.0, ‘+‘],
     [‘b‘, 27.83, 1.54, ‘u‘, ‘g‘, ‘w‘, ‘v‘, 3.75, ‘t‘, ‘t‘, 5.0, ‘t‘, ‘g‘, ‘00100‘, 3.0, ‘+‘],
    ...]
    
    Couple things you should note:
    1. You need to handle missing data. In this case let‘s use "missing" to represent all missing data
    2. Be careful of data types. For instance,
        "58.67" and "0.2356" should be number and not a string
        "00043" should be string but not a number
        It is OK to treat all numbers as float in this case. (You don‘t need to worry about differentiating integer and float)
    """
    # Your code here
    data_ = open(file_name, ‘r‘)
    # print(data_)
    lines = data_.readlines()
    output = []
    # never use built-in names unless you mean to replace it
    for list_str in lines:
        str_list = list_str[:-1].split(",")
        # keep it
        # str_list.remove(str_list[len(str_list)-1])
        data = []
        for substr in str_list:
            if substr.isdigit():
                if len(substr) > 1 and substr.startswith(‘0‘):
                    data.append(substr)
                else:
                    substr = int(substr)
                    data.append(substr)
            else:
                try:
                    current = float(substr)
                    data.append(current)
                except ValueError as e:
                    if substr == ‘?‘:
                        substr = ‘missing‘
                    data.append(substr)
        output.append(data)
    return output
?
?
?
?
def is_missing(value):
    """
    Determines if the given value is a missing data, please refer back to readfile() where we defined what is a "missing" data
    :param value: value to be checked
    :return: boolean (True, False) of whether the input value is the same as our "missing" notation
    """
    return value == ‘missing‘
?
?
def class_counts(rows):
    """
    Count how many data samples there are for each label
    數每個標簽的樣本數
    :param rows: Input is a 2D list in the form of what you have returned in readfile()
    :return: Output is a dictionary/map in the form:
    {"label_1": #count,
     "label_2": #count,
     "label_3": #count,
     ...
    }
    """
    # 這個方法是一個死方法 只使用於當前給定標簽（‘+’，‘-’）的數據統計   為了達到能使更多不確定標簽的數據的統計  擴展出下面方法
    # label_dict = {}
    # count1 = 0
    # count2 = 0
    # # rows 是readfile返回的結果
    # for row in rows:
    #     if row[-1] == ‘+‘:
    #         count1 += 1
    #     elif row[-1] == ‘-‘:
    #         count2 += 1
    # label_dict[‘+‘] = count1
    # label_dict[‘-‘] = count2
    # return label_dict
?
    # 擴展方法一
    # 這個方法可以完成任何不同標簽的數據的統計 使用了兩個循環 第一個循環是統計出所有數據中存在的不同類型的標簽 得到一個標簽列表lable_list
    # 然後遍歷lable_list中的標簽  重要的是在其中嵌套了遍歷所有數據的循環 同時在當前循環中統計出所有數據的標簽中和lable_list中標簽相同的總數
    # label_dict = {}
    # lable_list = []
    # for row in rows:
    #     lable = row[-1]
    #     if lable_list == []:
    #         lable_list.append(lable)
    #     else:
    #         if lable in lable_list:
    #             continue
    #         else:
    #             lable_list.append(lable)
    #
    # for lable_i in lable_list:
    #     count_row_i = 0
    #     for row_i in rows:
    #         if lable_i == row_i[-1]:
    #             count_row_i += 1
    #     label_dict[lable_i] = count_row_i
    # print(label_dict)
    # return label_dict
    #
?
    # 擴展方法二
    # 此方法是巧妙的使用了dict.key()函數將所有的狀態進行保存以及對出現的次數進行累計
    label_dict = {}
    for row in rows:
        keys = label_dict.keys()
        if row[-1] in keys:
            label_dict[row[-1]] += 1
        elif row[-1] not in keys:
            label_dict[row[-1]] = 1
    return label_dict
?
?
def is_numeric(value):
    print(type(value),‘-----‘)
    print(value)
    """
    Test if the input is a number(float/int)   
    :param value: Input is a value to be tested     
    :return: Boolean (True/False)    
    """
    # Your code here
    # 此處用到eavl()函數：將字符串string對象轉換為有效的表達式參與求值運算返回計算結果
    # if type(eval(str(value))) == int or type(eval(str(value))) == float:
    #     return True
    # 不用eval()也可以 而且有博客說eval()存在一定安全隱患
?
    # if value is letter(字母)  和將以0開頭的字符串檢出來
    if str(value).isalpha() or str(value).startswith(‘0‘):
        return False
    return type(int(value)) == int or type(float(value)) == float
?
?
class Determine:
    """
    這個class用來對比。取列序號和值
    match方法比較數值或者字符串
    可以理解為決策樹每個節點所提出的“問題”，如：
        今天溫度是冷還是熱？
        今天天氣是晴，多雲，還是有雨？
    """
    def __init__(self, column, value):
        """
        initial structure of our object
        :param column: column index of our "question"
        :param value: splitting value of our "question"
        """
        self.column = column
        self.value = value
?
    def match(self, example):
        """
        Compares example data and self.value
        note that you need to determine whether the data asked is numeric or categorical/string
        Be careful for missing data
        :param example: a full row of data
        :return: boolean(True/False) of whether input data is GREATER THAN self.value (numeric) or the SAME AS self.value (string)
        """
        # Your code here . missing is string too  so don‘t judge(判斷)
        e_index = self.column
        value_node = self.value
        # 此處and之後的條件是在e_index = 10是補充的，因為此列的數據類型不統一，包括0開頭的字符串，還有int型數字，這就尷尬了，int 和 str 無法做compare
        if is_numeric(example[e_index]) and type(value_node) is int or type(value_node) is float:
            return example[e_index] > value_node
        else:
            return example[e_index] == value_node
?
?
    def __repr__(self):
        """
        打印樹的時候用
        :return:
        """
        if is_numeric(self.value):
            condition = ">="
        else:
            condition = "是"
        return "{} {} {}?".format(
            header[self.column], condition, str(self.value))
?
?
def partition(rows, question):
    """
    將數據分割，如果滿足上面Question條件則被分入true_row，否則被分入false_row
    :param rows: data set/subset
    :param question: Determine object you implemented above
    :return: 2 lists based on the answer of the question
    """
    # Your code here . question is Determine‘s object
    true_rows, false_rows = [], []
    # 此處將二維數組進行遍歷的目的是Determine對象中match方法只處理每個一維列表中指定索引的數據
    for row in rows:
        if question.match(row):
            true_rows.append(row)
        else:
            false_rows.append(row)
    return true_rows, false_rows
?
?
def gini(rows):
    """
    計算一串數據的Gini值，即離散度的一種表達方式
    :param rows: data set/subset
    :return: gini值，”不純度“ impurity
    """
    data_set_size = len(rows)    # 所有數據的總長度
    class_dict = class_counts(rows)
    sum_subgini = 0
    for class_dict_value in class_dict.values():
        sub_gini = (class_dict_value/data_set_size) ** 2
        sum_subgini += sub_gini
    gini = 1 - sum_subgini
    return gini
?
?
?
def info_gain(left, right, current_uncertainty):
    """
    計算信息增益
    Please refer to the .md tutorial for details
    :param left: left branch
    :param right: right branch
    :param current_uncertainty: current uncertainty (data)
    """
    p_left = len(left) / (len(left) + len(right))
    p_right = 1 - p_left
    return current_uncertainty - p_left * gini(left) - p_right * gini(right)
?
?
?
?
# 使用這組數據測試自己代碼的質量
data = readfile("E:\data\crx.data")
t, f = partition(data, Determine(2,‘1.8‘))
print(info_gain(t, f, gini(data)))
?
?

January 2, 2019

數據分析系列精彩濃縮（三）

param 無法 gin 打印 can tput swe 數據分析 inf 數據分析（三）在分析UCI數據之前，有必要先了解一些決策樹的概念（decision tree）此處推薦一個關於決策樹的博客地址： http://www.cnblogs.com/yonghao

資料分析系列精彩濃縮（三）

資料分析（三）在分析UCI資料之前，有必要先了解一些決策樹的概念（decision tree）此處推薦一個關於決策樹的部落格地址： http://www.cnblogs.com/yonghao/p/5061873.html 決策樹（decision tree (DT)）的基本特徵

資料分析系列精彩濃縮（二）

資料分析系列精彩濃縮（二）那麼我們有了UCI提供的datasets，我們怎麼Perfect operation呢？ First，we download a data file to the localhost , such as crx.data file we will use pur

Java數據結構和算法（三）——冒泡、選擇、插入排序算法

我們逆序排列 pub 多少 img 目錄 http 最小數據結構目錄 1、冒泡排序 2、選擇排序 3、插入排序 4、總結　　上一篇博客我們實現的數組結構是無序的，也就是純粹按照插入順序進行排列，那麽如何進行元素排序，本篇博客我們介紹幾種簡單的排序算

《數據挖掘導論筆記》（三）

如果理解陣列 olap 結構 back 頻率預處理和數探索數據數據探索有助於選擇合適的數據預處理和數據分析技術。它甚至可以處理一些通常由數據挖掘解決的問題，例如，有時可以通過對數據進行直觀檢查來發現模式。此外數據探索中使用的某些技術（如可視化）可以用於理解和解釋數

Saltstack數據系統Grains和Pillar（三）

centos服務器 items har zabb ini highstate fqdn clas deb Saltstack數據系統分為Grains和Pillar SaltStack 數據系統 Grains （谷粒） Pillar （柱子）

數據庫mysql的學習（三）

har 一個磁盤類別 name lec 不同的自己的一模一樣刪除數據庫表 drop table [if exists] 表一，表二.....; 表分區：比如圖書信息表有1000萬個圖書信息，如何優化他，其中一種方式就是表分區。就是把一張表的數據分成多個區塊，這些區

數據結構：鏈表（三）

尾插法 .com 鏈表 inf fin 數據 code com python 一、鏈表基礎 1、什麽是鏈表？鏈表中每一個元素都是一個對象，每個對象稱為一個節點，包含有數據域key和指向下一個節點的指針next。通過各個節點之間的相互連接，最終串聯成一個鏈表。 2、節點定義

Java數據結構和算法（三）：常用排序算法與經典題型

bre 操作五步增量排序計算 -- clu 冒泡 i+1 常用的八種排序算法 1.直接插入排序我們經常會到這樣一類排序問題：把新的數據插入到已經排好的數據列中。將第一個數和第二個數排序，然後構成一個有序序列將第三個數插入進去，構成一個新的有序序列。對第四

如何用數據分析規劃產品呢（一）

感覺哪些收集數據訪談部分通過行為分享異常現如今，信息時代衍生出了很多詞匯，分別就是大數據、物聯網、人工智能以及數據分析，因為現在的各行各業的發展都相對比較成熟，所以也就積累了很多的數據，而數據分析行業也是比較火爆的，尤其是在互聯網時代。我們以產品為例以前

R語言數據分析系列之五

r語來看 tab barplot code 繪制 ber map lib R語言數據分析系列之五 —— by comaple.zhang 本節來討論一下R語言的基本圖形展示,先來看一張效果圖吧。這是一張用R語言生成的，虛擬的wordcloud雲圖，詳細

大數據學習之Scala中main函數的分析以及基本規則（2）

語言 python rgs 數字 popu 結束圖片區別返回一、main函數的分析首先來看我們在上一節最後看到的這個程序，我們先來簡單的分析一下。有助於後面的學習 object HelloScala { def main(args:

CDA數據分析師&無限極（中國）有限公司成本效益數據分析課程培訓圓滿結束

華盛頓老師流程 ali 民族傳播周期性科技委員會 2018年9月27日、28日，CDA數據分析師在無限極（中國）有限公司進行了成本效益數據分析課程培訓，績效管理部、計劃生產部等多個部門參加了此次培訓，培訓人數近40人。內訓企業介紹無限

php數據訪問-註冊審核（重點）

require upd input 100% div header pda font bmi 關於審核，如發表文章的審核、員工請假的審核、藥品申請的審核等等，代碼大同小異。一.註冊功能（zhece.php chuli.php） 1.zhece.php 1 &

Mysql數據庫性能優化（一）

效率 dir sort variables 緩存模型 mysql5.6 包含 dpt 參考 http://www.jb51.net/article/82254.htm 今天，數據庫的操作越來越成為整個應用的性能瓶頸了，這點對於Web應用尤其明顯。關於數據庫的性能，這並不只

Android解析HTML網頁數據第一個方法Jsoup（一）

原生日誌 href attr mage connect auto htm baidu 最近發現一些無聊的東西，就是抓取網頁上的數據，然後使用安卓原生代碼顯示出來，或者說借用網頁數據，用自定義的View顯示。借助jsoup-1.10.2.jar庫，獲取並解析數據。（Jso

深入淺出數據結構C語言版（9）——多重表（廣義表）

不同滿足大學 logs 維數我會明顯 http 多維　　在深入淺出數據結構系列前面的文章中，我們一直在討論的表其實是“線性表”，其形式如下：　　由a1,a2,a3,……a(n-1)個元素組成的序列，其中每一個元素ai(0<i<n)都是一個“原子”，“

數據庫SQL Server2012筆記（七）——java 程序操作sql server

jdb 統一 col select 封裝 query size api color 1、crud(增刪改查)介紹：create/retrieve/update/delete 2、JDBC介紹 1）JDBC（java database connectivi

mysql之 mysql數據庫壓力測試工具（mysqlslap）

root .cn this 用戶 cas bench 測試逗號 complete mysqlslap是從MySQL的5.1.4版開始就開始官方提供的壓力測試工具。通過模擬多個並發客戶端並發訪問MySQL來執行壓力測試，同時提供了較詳細的SQL執行數據性能報告，並且能很好的

數據庫的事務隔離（轉）

級別缺少總結包括順序執行同一行 oracl 做了 server 　　一、簡述：　　　　在數據庫操作中，為了效保證並發讀取數據的正確性，提出的事務隔離級別。數據庫事務的隔離級別4個，由低到高依次為：　　　　　　Read uncommitted(未授權讀取、讀未提交

數據分析系列精彩濃縮（三）

數據分析（三）

在分析UCI數據之前，有必要先了解一些決策樹的概念（decision tree）

此處推薦一個關於決策樹的博客地址：

決策樹（decision tree (DT)）的基本特征

UCI數據特征: UCI credit approval data set

綜上，就可以嘗試用代碼實現決策樹的功能了，此時使用段老師提供的skeleton（框架），按照以下步驟寫自己的代碼

Implement `is_missing(value)`, `class_counts(rows)`, `is_numeric(value)` as directed in the docstring

Implement class `Determine`. This object represents a node of our DT. 這個對象表示的是決策樹的節點。

Implement the method `partition(rows, question)`as described in the docstring

Implement the method `gini(rows)` as described in the docstring

Implement the method `info_gain(left, right, current_uncertainty)` as described in the docstring

my code is as follows , for reference only(以下是我的代碼，僅供參考)

January 2, 2019

數據分析系列精彩濃縮（三）

資料分析系列精彩濃縮（三）

資料分析系列精彩濃縮（二）

Java數據結構和算法（三）——冒泡、選擇、插入排序算法

《數據挖掘導論筆記》（三）

Saltstack數據系統Grains和Pillar（三）

數據庫mysql的學習（三）

數據結構：鏈表（三）

Java數據結構和算法（三）：常用排序算法與經典題型

如何用數據分析規劃產品呢（一）

R語言數據分析系列之五

大數據學習之Scala中main函數的分析以及基本規則（2）

CDA數據分析師&無限極（中國）有限公司成本效益數據分析課程培訓圓滿結束

php數據訪問-註冊審核（重點）

Mysql數據庫性能優化（一）

Android解析HTML網頁數據第一個方法Jsoup（一）

深入淺出數據結構C語言版（9）——多重表（廣義表）

數據庫SQL Server2012筆記（七）——java 程序操作sql server

mysql之 mysql數據庫壓力測試工具（mysqlslap）

數據庫的事務隔離（轉）

數據分析系列精彩濃縮（三）

數據分析（三）

在分析UCI數據之前，有必要先了解一些決策樹的概念（decision tree）

此處推薦一個關於決策樹的博客地址：

決策樹（decision tree (DT)）的基本特征

UCI數據特征: UCI credit approval data set

綜上，就可以嘗試用代碼實現決策樹的功能了，此時使用段老師提供的skeleton（框架），按照以下步驟寫自己的代碼

Implement is_missing(value), class_counts(rows), is_numeric(value) as directed in the docstring

Implement class Determine. This object represents a node of our DT. 這個對象表示的是決策樹的節點。

Implement the method partition(rows, question)as described in the docstring

Implement the method gini(rows) as described in the docstring

Implement the method info_gain(left, right, current_uncertainty) as described in the docstring

my code is as follows , for reference only(以下是我的代碼，僅供參考)

January 2, 2019

相關推薦

Implement `is_missing(value)`, `class_counts(rows)`, `is_numeric(value)` as directed in the docstring

Implement class `Determine`. This object represents a node of our DT. 這個對象表示的是決策樹的節點。

Implement the method `partition(rows, question)`as described in the docstring

Implement the method `gini(rows)` as described in the docstring

Implement the method `info_gain(left, right, current_uncertainty)` as described in the docstring