1. 程式人生 > >數據分析系列精彩濃縮(三)

數據分析系列精彩濃縮(三)

param 無法 gin 打印 can tput swe 數據分析 inf

數據分析(三)

在分析UCI數據之前,有必要先了解一些決策樹的概念(decision tree)

  • 此處推薦一個關於決策樹的博客地址:
    http://www.cnblogs.com/yonghao/p/5061873.html
  • 決策樹(decision tree (DT))的基本特征

    • DT 是一個監督學習方法(supervised learning method)

    • DT is a supervised learning method, thus we need labeled data

    • It is one process only thus it is not good for giant datasets

    • PS: It is pretty good on small and clean datasets

  • UCI數據特征: UCI credit approval data set

    • 690 data entries, relatively small dataset

    • 15 attributes, pretty tiny to be honest

    • missing value is only 5%

    • 2 class data

  • By looking at these two, we know DT should work well for our dataset

綜上,就可以嘗試用代碼實現決策樹的功能了,此時使用段老師提供的skeleton(框架),按照以下步驟寫自己的代碼

  • Copy and paste your code to function readfile(file_name) under the comment # Your code here.

  • Make sure your input and output matches how I descirbed in the docstring

  • Make a minor improvement to handle missing data, in this case let‘s use string "missing" to represent missing data. Note that it is given as "?"

    .

  • Implement is_missing(value), class_counts(rows), is_numeric(value) as directed in the docstring
  • Implement class Determine. This object represents a node of our DT. 這個對象表示的是決策樹的節點。
    • It has 2 inputs and a function. 有兩個輸入,一個方法

    • We can think of it as the Question we are asking at each node. 可以理解成決策樹中每個節點我們所提出的“問題”

  • Implement the method partition(rows, question)as described in the docstring
    • Use Determine class to partition data into 2 groups

  • Implement the method gini(rows) as described in the docstring
    • Here is the formula for Gini impurity: 技術分享圖片

      • where n is the number of classes

      • 技術分享圖片 is the percentage of the given class i

  • Implement the method info_gain(left, right, current_uncertainty) as described in the docstring
    • Here is the formula for Information Gain: 技術分享圖片

      • where 技術分享圖片

      • 技術分享圖片 is current_uncertainty

      • 技術分享圖片 is the percentage/probability of left branch, same story for 技術分享圖片

  • my code is as follows , for reference only(以下是我的代碼,僅供參考)

    def readfile(file_name):
    """
    This function reads data file and returns structured and cleaned data in a list
    :param file_name: relative path under data folder
    :return: data, in this case it should be a 2-D list of the form
    [[data1_1, data1_2, ...],
    [data2_1, data2_2, ...],
    [data3_1, data3_2, ...],
    ...]

    i.e.
    [[‘a‘, 58.67, 4.46, ‘u‘, ‘g‘, ‘q‘, ‘h‘, 3.04, ‘t‘, ‘t‘, 6.0, ‘f‘, ‘g‘, ‘00043‘, 560.0, ‘+‘],
    [‘a‘, 24.5, 0.5, ‘u‘, ‘g‘, ‘q‘, ‘h‘, 1.5, ‘t‘, ‘f‘, 0.0, ‘f‘, ‘g‘, ‘00280‘, 824.0, ‘+‘],
    [‘b‘, 27.83, 1.54, ‘u‘, ‘g‘, ‘w‘, ‘v‘, 3.75, ‘t‘, ‘t‘, 5.0, ‘t‘, ‘g‘, ‘00100‘, 3.0, ‘+‘],
    ...]

    Couple things you should note:
    1. You need to handle missing data. In this case let‘s use "missing" to represent all missing data
    2. Be careful of data types. For instance,
    "58.67" and "0.2356" should be number and not a string
    "00043" should be string but not a number
    It is OK to treat all numbers as float in this case. (You don‘t need to worry about differentiating integer and float)
    """
    # Your code here
    data_ = open(file_name, ‘r‘)
    # print(data_)
    lines = data_.readlines()
    output = []
    # never use built-in names unless you mean to replace it
    for list_str in lines:
    str_list = list_str[:-1].split(",")
    # keep it
    # str_list.remove(str_list[len(str_list)-1])
    data = []
    for substr in str_list:
    if substr.isdigit():
    if len(substr) > 1 and substr.startswith(‘0‘):
    data.append(substr)
    else:
    substr = int(substr)
    data.append(substr)
    else:
    try:
    current = float(substr)
    data.append(current)
    except ValueError as e:
    if substr == ‘?‘:
    substr = ‘missing‘
    data.append(substr)
    output.append(data)
    return output
    ?
    ?
    ?
    ?
    def is_missing(value):
    """
    Determines if the given value is a missing data, please refer back to readfile() where we defined what is a "missing" data
    :param value: value to be checked
    :return: boolean (True, False) of whether the input value is the same as our "missing" notation
    """
    return value == ‘missing‘
    ?
    ?
    def class_counts(rows):
    """
    Count how many data samples there are for each label
    數每個標簽的樣本數
    :param rows: Input is a 2D list in the form of what you have returned in readfile()
    :return: Output is a dictionary/map in the form:
    {"label_1": #count,
    "label_2": #count,
    "label_3": #count,
    ...
    }
    """
    # 這個方法是一個死方法 只使用於當前給定標簽(‘+’,‘-’)的數據統計 為了達到能使更多不確定標簽的數據的統計 擴展出下面方法
    # label_dict = {}
    # count1 = 0
    # count2 = 0
    # # rows 是readfile返回的結果
    # for row in rows:
    # if row[-1] == ‘+‘:
    # count1 += 1
    # elif row[-1] == ‘-‘:
    # count2 += 1
    # label_dict[‘+‘] = count1
    # label_dict[‘-‘] = count2
    # return label_dict
    ?
    # 擴展方法一
    # 這個方法可以完成任何不同標簽的數據的統計 使用了兩個循環 第一個循環是統計出所有數據中存在的不同類型的標簽 得到一個標簽列表lable_list
    # 然後遍歷lable_list中的標簽 重要的是在其中嵌套了遍歷所有數據的循環 同時在當前循環中統計出所有數據的標簽中和lable_list中標簽相同的總數
    # label_dict = {}
    # lable_list = []
    # for row in rows:
    # lable = row[-1]
    # if lable_list == []:
    # lable_list.append(lable)
    # else:
    # if lable in lable_list:
    # continue
    # else:
    # lable_list.append(lable)
    #
    # for lable_i in lable_list:
    # count_row_i = 0
    # for row_i in rows:
    # if lable_i == row_i[-1]:
    # count_row_i += 1
    # label_dict[lable_i] = count_row_i
    # print(label_dict)
    # return label_dict
    #
    ?
    # 擴展方法二
    # 此方法是巧妙的使用了dict.key()函數將所有的狀態進行保存以及對出現的次數進行累計
    label_dict = {}
    for row in rows:
    keys = label_dict.keys()
    if row[-1] in keys:
    label_dict[row[-1]] += 1
    elif row[-1] not in keys:
    label_dict[row[-1]] = 1
    return label_dict
    ?
    ?
    def is_numeric(value):
    print(type(value),‘-----‘)
    print(value)
    """
    Test if the input is a number(float/int)
    :param value: Input is a value to be tested
    :return: Boolean (True/False)
    """
    # Your code here
    # 此處用到eavl()函數:將字符串string對象轉換為有效的表達式參與求值運算返回計算結果
    # if type(eval(str(value))) == int or type(eval(str(value))) == float:
    # return True
    # 不用eval()也可以 而且有博客說eval()存在一定安全隱患
    ?
    # if value is letter(字母) 和將以0開頭的字符串檢出來
    if str(value).isalpha() or str(value).startswith(‘0‘):
    return False
    return type(int(value)) == int or type(float(value)) == float
    ?
    ?
    class Determine:
    """
    這個class用來對比。取列序號和值
    match方法比較數值或者字符串
    可以理解為決策樹每個節點所提出的“問題”,如:
    今天溫度是冷還是熱?
    今天天氣是晴,多雲,還是有雨?
    """
    def __init__(self, column, value):
    """
    initial structure of our object
    :param column: column index of our "question"
    :param value: splitting value of our "question"
    """
    self.column = column
    self.value = value
    ?
    def match(self, example):
    """
    Compares example data and self.value
    note that you need to determine whether the data asked is numeric or categorical/string
    Be careful for missing data
    :param example: a full row of data
    :return: boolean(True/False) of whether input data is GREATER THAN self.value (numeric) or the SAME AS self.value (string)
    """
    # Your code here . missing is string too so don‘t judge(判斷)
    e_index = self.column
    value_node = self.value
    # 此處and之後的條件是在e_index = 10是補充的,因為此列的數據類型不統一,包括0開頭的字符串,還有int型數字,這就尷尬了,int 和 str 無法做compare
    if is_numeric(example[e_index]) and type(value_node) is int or type(value_node) is float:
    return example[e_index] > value_node
    else:
    return example[e_index] == value_node
    ?
    ?
    def __repr__(self):
    """
    打印樹的時候用
    :return:
    """
    if is_numeric(self.value):
    condition = ">="
    else:
    condition = "是"
    return "{} {} {}?".format(
    header[self.column], condition, str(self.value))
    ?
    ?
    def partition(rows, question):
    """
    將數據分割,如果滿足上面Question條件則被分入true_row,否則被分入false_row
    :param rows: data set/subset
    :param question: Determine object you implemented above
    :return: 2 lists based on the answer of the question
    """
    # Your code here . question is Determine‘s object
    true_rows, false_rows = [], []
    # 此處將二維數組進行遍歷的目的是Determine對象中match方法只處理每個一維列表中指定索引的數據
    for row in rows:
    if question.match(row):
    true_rows.append(row)
    else:
    false_rows.append(row)
    return true_rows, false_rows
    ?
    ?
    def gini(rows):
    """
    計算一串數據的Gini值,即離散度的一種表達方式
    :param rows: data set/subset
    :return: gini值,”不純度“ impurity
    """
    data_set_size = len(rows) # 所有數據的總長度
    class_dict = class_counts(rows)
    sum_subgini = 0
    for class_dict_value in class_dict.values():
    sub_gini = (class_dict_value/data_set_size) ** 2
    sum_subgini += sub_gini
    gini = 1 - sum_subgini
    return gini
    ?
    ?
    ?
    def info_gain(left, right, current_uncertainty):
    """
    計算信息增益
    Please refer to the .md tutorial for details
    :param left: left branch
    :param right: right branch
    :param current_uncertainty: current uncertainty (data)
    """
    p_left = len(left) / (len(left) + len(right))
    p_right = 1 - p_left
    return current_uncertainty - p_left * gini(left) - p_right * gini(right)
    ?
    ?
    ?
    ?
    # 使用這組數據測試自己代碼的質量
    data = readfile("E:\data\crx.data")
    t, f = partition(data, Determine(2,‘1.8‘))
    print(info_gain(t, f, gini(data)))
    ?
    ?

January 2, 2019

數據分析系列精彩濃縮(三)