數據分析系列精彩濃縮(三)
數據分析(三)
在分析UCI數據之前,有必要先了解一些決策樹的概念(decision tree)
-
此處推薦一個關於決策樹的博客地址:
http://www.cnblogs.com/yonghao/p/5061873.html
-
決策樹(decision tree (DT))的基本特征
-
DT 是一個監督學習方法(supervised learning method)
-
DT is a supervised learning method, thus we need labeled data
-
It is one process only thus it is not good for giant datasets
-
PS: It is pretty good on small and clean datasets
-
-
UCI數據特征: UCI credit approval data set
-
690 data entries, relatively small dataset
-
15 attributes, pretty tiny to be honest
-
missing value is only 5%
-
2 class data
-
-
By looking at these two, we know DT should work well for our dataset
綜上,就可以嘗試用代碼實現決策樹的功能了,此時使用段老師提供的skeleton(框架),按照以下步驟寫自己的代碼
-
Copy and paste your code to function
readfile(file_name)
under the comment# Your code here
. -
Make sure your input and output matches how I descirbed in the docstring
-
Make a minor improvement to handle missing data, in this case let‘s use string
"missing"
to represent missing data. Note that it is given as"?"
-
Implement
is_missing(value)
,class_counts(rows)
,is_numeric(value)
as directed in the docstring -
Implement class
Determine
. This object represents a node of our DT. 這個對象表示的是決策樹的節點。-
It has 2 inputs and a function. 有兩個輸入,一個方法
-
We can think of it as the Question we are asking at each node. 可以理解成決策樹中每個節點我們所提出的“問題”
-
-
Implement the method
partition(rows, question)
as described in the docstring-
Use Determine class to partition data into 2 groups
-
-
Implement the method
gini(rows)
as described in the docstring-
Here is the formula for Gini impurity:
-
where
n
is the number of classes -
is the percentage of the given class
i
-
-
-
Implement the method
info_gain(left, right, current_uncertainty)
as described in the docstring-
Here is the formula for Information Gain:
-
where
-
is current_uncertainty
-
is the percentage/probability of left branch, same story for
-
-
-
my code is as follows , for reference only(以下是我的代碼,僅供參考)
def readfile(file_name):
"""
This function reads data file and returns structured and cleaned data in a list
:param file_name: relative path under data folder
:return: data, in this case it should be a 2-D list of the form
[[data1_1, data1_2, ...],
[data2_1, data2_2, ...],
[data3_1, data3_2, ...],
...]
i.e.
[[‘a‘, 58.67, 4.46, ‘u‘, ‘g‘, ‘q‘, ‘h‘, 3.04, ‘t‘, ‘t‘, 6.0, ‘f‘, ‘g‘, ‘00043‘, 560.0, ‘+‘],
[‘a‘, 24.5, 0.5, ‘u‘, ‘g‘, ‘q‘, ‘h‘, 1.5, ‘t‘, ‘f‘, 0.0, ‘f‘, ‘g‘, ‘00280‘, 824.0, ‘+‘],
[‘b‘, 27.83, 1.54, ‘u‘, ‘g‘, ‘w‘, ‘v‘, 3.75, ‘t‘, ‘t‘, 5.0, ‘t‘, ‘g‘, ‘00100‘, 3.0, ‘+‘],
...]
Couple things you should note:
1. You need to handle missing data. In this case let‘s use "missing" to represent all missing data
2. Be careful of data types. For instance,
"58.67" and "0.2356" should be number and not a string
"00043" should be string but not a number
It is OK to treat all numbers as float in this case. (You don‘t need to worry about differentiating integer and float)
"""
# Your code here
data_ = open(file_name, ‘r‘)
# print(data_)
lines = data_.readlines()
output = []
# never use built-in names unless you mean to replace it
for list_str in lines:
str_list = list_str[:-1].split(",")
# keep it
# str_list.remove(str_list[len(str_list)-1])
data = []
for substr in str_list:
if substr.isdigit():
if len(substr) > 1 and substr.startswith(‘0‘):
data.append(substr)
else:
substr = int(substr)
data.append(substr)
else:
try:
current = float(substr)
data.append(current)
except ValueError as e:
if substr == ‘?‘:
substr = ‘missing‘
data.append(substr)
output.append(data)
return output
?
?
?
?
def is_missing(value):
"""
Determines if the given value is a missing data, please refer back to readfile() where we defined what is a "missing" data
:param value: value to be checked
:return: boolean (True, False) of whether the input value is the same as our "missing" notation
"""
return value == ‘missing‘
?
?
def class_counts(rows):
"""
Count how many data samples there are for each label
數每個標簽的樣本數
:param rows: Input is a 2D list in the form of what you have returned in readfile()
:return: Output is a dictionary/map in the form:
{"label_1": #count,
"label_2": #count,
"label_3": #count,
...
}
"""
# 這個方法是一個死方法 只使用於當前給定標簽(‘+’,‘-’)的數據統計 為了達到能使更多不確定標簽的數據的統計 擴展出下面方法
# label_dict = {}
# count1 = 0
# count2 = 0
# # rows 是readfile返回的結果
# for row in rows:
# if row[-1] == ‘+‘:
# count1 += 1
# elif row[-1] == ‘-‘:
# count2 += 1
# label_dict[‘+‘] = count1
# label_dict[‘-‘] = count2
# return label_dict
?
# 擴展方法一
# 這個方法可以完成任何不同標簽的數據的統計 使用了兩個循環 第一個循環是統計出所有數據中存在的不同類型的標簽 得到一個標簽列表lable_list
# 然後遍歷lable_list中的標簽 重要的是在其中嵌套了遍歷所有數據的循環 同時在當前循環中統計出所有數據的標簽中和lable_list中標簽相同的總數
# label_dict = {}