基於R的資料探勘方法與實踐(3)——決策樹分析
決策樹構建的目的有兩個——探索與預測。探索方面,參與決策樹聲場的資料為訓練資料,待樹長成後即可探索資料所隱含的資訊。預測方面,可以藉助決策樹推匯出的規則預測未來資料。由於需要考慮未來資料進入該模型的分類表現,因此在基於訓練資料構建決策樹之後,可以用測試資料來衡量該模型的穩健性和分類表現。通過一連串的驗證過程,最後得到最佳的分類規則,用作未來資料的預測。
1決策樹構建理論
決策樹的建立步驟包括資料準備、決策樹生長、決策樹修剪及規則提取。
1.1資料準備
決策樹的分析資料包括兩類變數:一是根據問題所決定的目標變數;二是根據問題背景與環境所選擇的各種屬性變數作為分支變數。分支變數是否容易理解與解釋將決定決策樹分析結果。
(1)二元屬性:其測試條件可以產生兩種結果。
(2)名目屬性:名目屬性結果的多少可以用不同屬性值來表示,如血型可以分為A、B、AB、O四種類別。
(3)順序屬性:可以生成二元或二元以上的分割,其屬性可以是群組,但群組必須符合屬性值順序特徵。如年齡可以分為青年、中年、老年,
(4)連續屬性:連續屬性的條件可以表示成x<a或x>=a的關係。決策樹必須考慮到所有可能的分割點y,再從中選出最好的分割。
取得資料後,將所蒐集的資料分為訓練資料集和測試資料集,資料分割可參照如下方法:
資料分割是將資料分成訓練資料集、測試資料集和驗證資料集。訓練資料集用來建立模型,測試資料集用來評估模型是否過度複雜及其通用性,驗證資料集則用以衡量模型的好壞,例如分類錯誤率、均方誤差。一個好的訓練模式應該對於未知的資料仍有很好的適配度,若當模式複雜程度越來越高,而測試資料的誤差卻越來越大,表示該訓練模型有過度配適的問題。
資料分割的比例有不同的定義,均應代表原來的資料。一種方法是抽取80%的資料作為訓練資料集構建模型,剩下的20%用於模型的效度檢驗。另一種方法是k-fold互動驗證。該方法首先將資料分為k等份,每次抽取k-1份資料進行模式訓練,剩下的1份資料用於測試模型,如此重複k次,使每筆資料都能成為訓練資料集與測試資料集,最後的平均結果則用來代表模型的效度。該方法適合於樣本數較少的情況,可以有效涵蓋整個資料,但缺點是計算時間很長。
在決策樹構建過程中,如果一個決策樹模型僅在訓練資料中有很低的錯誤率,但在測試資料集中有很高的錯誤率,則說明該決策樹模型過度配適,造成模型無法用於估計其他資料。因此建立決策樹模型後,應根據估計測試資料的分類表現,適當地修剪決策樹,增加其分類或預測的爭取性,避免過度配適。
1.2決策樹分支準則
決策樹的分支準則決定了樹的規模大小,包括寬度和深度。常見的分支準則有:資訊增益、Gini係數、卡方統計量、資訊增益比等。
假設訓練資料集有k個類別,分別為C1、C2、……、Ck,屬性A有l中不同的數值,A1、A2、……、Al。
屬性 |
類別 |
||||
C1 |
C2 |
… |
Ck |
總和 |
|
A1 |
x11 |
x12 |
… |
x1k |
x1. |
A2 |
x21 |
x22 |
… |
x2k |
x2. |
… |
… |
… |
… |
… |
… |
Al |
xl1 |
xl2 |
… |
xlk |
xl. |
總和 |
x.1 |
x.2 |
… |
x.k |
N |
(1)資訊增益
資訊增益是根據不同資訊的似然值或概率衡量不同條件下的資訊量。
若每個類別的資料個數定義為x.j,N為資料集合中所有資料的個數,類別出現的概率為pj = x.j/N,根據資訊理論可以知道,各類別的資訊為-log2(pj),因此各類別C1、C2、……、Ck所帶來的資訊總和Info(D)為:
Info(D)= - (x.1/N)*log2(x.1/N) - (x.2/N)*log2(x.2/N) - … - (x.k/N)*log2(x.k/N)
Info(D)又稱為熵,常用以衡量資料離散程度。當各類別出現的概率相等,則Info(D)=1,表示該分類的資訊複雜程度最高。
假設該資料集D要根據屬性A進行分割,產生共L各資料分割集Di,其中xi.為各屬性值Ai下的分割資料總個數xij為屬性值Ai下且為類別Cj的個數,因此可計算屬性Ai下的資訊Info(Ai):
Info(Ai)= - (xi1/ xi.)*log2(xi1/ xi.) - (xi2/ xi.)*log2(xi2/ xi.) - … - (xik/ xi.)*log2(xik/xi.)
屬性A的資訊則根據各屬性值下資料個數多寡決定:
InfoA(D)= (x1./N)*Info(A1) + (x2./N)*Info(A2) + … + (xl./N)*Info(Al)
資訊增益可以表示為原始資料的總資訊量減去分之後的總資訊量,以表示屬性A作為分支屬性對資訊的貢獻程度。以此類推可以計算出各個屬性作為分支變數能帶來的資訊貢獻度,比較後可找出具有最佳資訊增益的資訊屬性。
(2)Gini係數
Gini係數是衡量資料集合對於所有類別的不純度。
Gini(D)= 1 – sum(j = 1, ….,k, pj^2)
各屬性值Ai下資料集合的不純度Gini(Ai)為:
Gini(Ai)= 1 – (xi1/xi.)^2 – (xi2/xi.)^2 - ……, – (xik/xi.)^2
屬性A的總資料不純度為:
GiniA(D)= (x1./N)*Gini(A1) + (x2./N)*Gini(A2) + … + (xl./N)*Gini(Al)
屬性A對不純度減少的貢獻:
deltaGini(A)= Gini(D) –GiniA(D)
(3)卡方統計量
卡方統計量是用列聯表來計算兩列變數之間的相依程度,當計算出的樣本卡方統計值越大,表示兩變數之間的相依程度越高。
(4)資訊增益比
資訊增益比是考慮候選屬性本身所攜帶的資訊,在將這些資訊轉移至決策樹,經由計算增益與分支屬性的資訊量的比值來找出最適合的分支屬性。
(5)方差縮減
當目標變數為連續時,可採用放假縮減作為分支依據。
1.3 決策樹修剪
決策樹的修剪方式包括事前修剪和事後修剪。事前修剪應用於一開始決策樹的生長過程中,實現設定停止決策樹生長的門檻值,常見的設定門檻如分割的評估值沒達到此門檻值時,就會停止決策樹的生長,例如資訊增益值要大於0.1或是節點中包含足夠的樣本數目。事前修剪的優點在於具有執行效率,但可能會有過度修剪的缺點。事後修剪法雖然效率較低,但對於解決決策樹的過度配適問題相當具有正面效益。
1.4 規則提取
完成決策樹的生長及修剪之後,即可利用決策樹提取資料中隱含的資訊。
2、決策樹演算法
演算法 |
CART |
C4.5/C5.0 |
CHAID |
|
處理資料形態 |
離散、連續 |
離散、連續 |
離散 |
|
連續型資料分支方式 |
只分2支 |
無限制 |
無法處理 |
|
分支準則 |
類別型相依變數 |
Gini分散度指標 |
資訊增益比 |
卡方檢驗 |
連續型相依變數 |
方差縮減 |
方差縮減 |
卡方檢定或F檢定(需先轉化為類別變數) |
|
分支方法 |
類別型獨立變數 |
二元分支 |
多元分支 |
多元分支 |
連續型獨立變數 |
二元分支 |
二元分支 |
多元分支(需轉化為類別變數) |
|
修剪方法 |
成本複雜性修剪 |
基於錯誤的修剪 |
無 |
3、模型評估
決策樹分類模型可以從兩個方面評估其分類及預測表現:(1)以測試組資料的結果來客觀評估較佳的決策樹模型,例如分類錯誤率;(2)由於分類規則的提取隨著問題而異,因此在客觀評估後,通常均需由該領域專家根據問題背景選出最適合的決策樹模型。
4、決策樹應用
4.1 CART決策樹
載入包和資料集
> library(MASS)
> data("Pima.tr")
> str(Pima.tr)
'data.frame': 200 obs. of 8 variables:
$ npreg: int 5 7 5 0 0 5 3 1 3 2 ...
$ glu : int 86 195 77 165 107 97 83 193 142 128 ...
$ bp : int 68 70 82 76 60 76 58 50 80 78 ...
$ skin : int 28 33 41 43 25 27 31 16 15 37 ...
$ bmi : num 30.2 25.1 35.8 47.9 26.4 35.6 34.3 25.9 32.4 43.3 ...
$ ped : num 0.364 0.163 0.156 0.259 0.133 ...
$ age : int 24 55 35 26 23 52 25 24 63 31 ...
$ type : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 1 1 2 ...
其中,Pima資料已經被分為兩部分,Pima.tr為訓練集、Pima.te為測試集。
> #首先以不修剪的方法進行決策樹的構建,因而將複雜係數cp設定為0
> cart_tree1 = rpart(type~., Pima.tr, control = rpart.control(cp = 0))
> summary(cart_tree1)
Call:
rpart(formula = type ~ ., data = Pima.tr, control = rpart.control(cp = 0))
n= 200
CP nsplit rel error xerror xstd
1 0.22058824 0 1.0000000 1.0000000 0.09851844
2 0.16176471 1 0.7794118 0.9852941 0.09816108
3 0.07352941 2 0.6176471 0.8235294 0.09337946
4 0.05882353 3 0.5441176 0.7941176 0.09233140
5 0.01470588 4 0.4852941 0.6176471 0.08470895
6 0.00000000 7 0.4411765 0.7500000 0.09064718
Node number 1: 200 observations, complexity param=0.2205882
predicted class=No expected loss=0.34 P(node) =1
class counts: 132 68
probabilities: 0.660 0.340
left son=2 (109 obs) right son=3 (91 obs)
Primary splits:
glu < 123.5 to the left, improve=19.624700, (0 missing)
age < 28.5 to the left, improve=15.016410, (0 missing)
npreg < 6.5 to the left, improve=10.465630, (0 missing)
bmi < 27.35 to the left, improve= 9.727105, (0 missing)
skin < 22.5 to the left, improve= 8.201159, (0 missing)
Surrogate splits:
age < 30.5 to the left, agree=0.685, adj=0.308, (0 split)
bp < 77 to the left, agree=0.650, adj=0.231, (0 split)
npreg < 6.5 to the left, agree=0.640, adj=0.209, (0 split)
skin < 32.5 to the left, agree=0.635, adj=0.198, (0 split)
bmi < 30.85 to the left, agree=0.575, adj=0.066, (0 split)
Node number 2: 109 observations, complexity param=0.01470588
predicted class=No expected loss=0.1376147 P(node) =0.545
class counts: 94 15
probabilities: 0.862 0.138
left son=4 (74 obs) right son=5 (35 obs)
Primary splits:
age < 28.5 to the left, improve=3.2182780, (0 missing)
npreg < 6.5 to the left, improve=2.4578310, (0 missing)
bmi < 33.5 to the left, improve=1.6403660, (0 missing)
bp < 59 to the left, improve=0.9851960, (0 missing)
skin < 24 to the left, improve=0.8342926, (0 missing)
Surrogate splits:
npreg < 4.5 to the left, agree=0.798, adj=0.371, (0 split)
bp < 77 to the left, agree=0.734, adj=0.171, (0 split)
skin < 36.5 to the left, agree=0.725, adj=0.143, (0 split)
bmi < 38.85 to the left, agree=0.716, adj=0.114, (0 split)
glu < 66 to the right, agree=0.688, adj=0.029, (0 split)
Node number 3: 91 observations, complexity param=0.1617647
predicted class=Yes expected loss=0.4175824 P(node) =0.455
class counts: 38 53
probabilities: 0.418 0.582
left son=6 (35 obs) right son=7 (56 obs)
Primary splits:
ped < 0.3095 to the left, improve=6.528022, (0 missing)
bmi < 28.65 to the left, improve=6.473260, (0 missing)
skin < 19.5 to the left, improve=4.778504, (0 missing)
glu < 166 to the left, improve=4.104532, (0 missing)
age < 39.5 to the left, improve=3.607390, (0 missing)
Surrogate splits:
glu < 126.5 to the left, agree=0.670, adj=0.143, (0 split)
bp < 93 to the right, agree=0.659, adj=0.114, (0 split)
bmi < 27.45 to the left, agree=0.659, adj=0.114, (0 split)
npreg < 9.5 to the right, agree=0.648, adj=0.086, (0 split)
skin < 20.5 to the left, agree=0.637, adj=0.057, (0 split)
Node number 4: 74 observations
predicted class=No expected loss=0.05405405 P(node) =0.37
class counts: 70 4
probabilities: 0.946 0.054
Node number 5: 35 observations, complexity param=0.01470588
predicted class=No expected loss=0.3142857 P(node) =0.175
class counts: 24 11
probabilities: 0.686 0.314
left son=10 (9 obs) right son=11 (26 obs)
Primary splits:
glu < 90 to the left, improve=2.3934070, (0 missing)
bmi < 33.4 to the left, improve=1.3714290, (0 missing)
bp < 68 to the right, improve=0.9657143, (0 missing)
ped < 0.334 to the left, improve=0.9475564, (0 missing)
skin < 39.5 to the right, improve=0.7958592, (0 missing)
Surrogate splits:
ped < 0.1795 to the left, agree=0.8, adj=0.222, (0 split)
Node number 6: 35 observations, complexity param=0.05882353
predicted class=No expected loss=0.3428571 P(node) =0.175
class counts: 23 12
probabilities: 0.657 0.343
left son=12 (27 obs) right son=13 (8 obs)
Primary splits:
glu < 166 to the left, improve=3.438095, (0 missing)
ped < 0.2545 to the right, improve=1.651429, (0 missing)
skin < 25.5 to the left, improve=1.651429, (0 missing)
npreg < 3.5 to the left, improve=1.078618, (0 missing)
bp < 73 to the right, improve=1.078618, (0 missing)
Surrogate splits:
bp < 94.5 to the left, agree=0.8, adj=0.125, (0 split)
Node number 7: 56 observations, complexity param=0.07352941
predicted class=Yes expected loss=0.2678571 P(node) =0.28
class counts: 15 41
probabilities: 0.268 0.732
left son=14 (11 obs) right son=15 (45 obs)
Primary splits:
bmi < 28.65 to the left, improve=5.778427, (0 missing)
age < 39.5 to the left, improve=3.259524, (0 missing)
npreg < 6.5 to the left, improve=2.133215, (0 missing)
ped < 0.8295 to the left, improve=1.746894, (0 missing)
skin < 22 to the left, improve=1.474490, (0 missing)
Surrogate splits:
skin < 19.5 to the left, agree=0.839, adj=0.182, (0 split)
Node number 10: 9 observations
predicted class=No expected loss=0 P(node) =0.045
class counts: 9 0
probabilities: 1.000 0.000
Node number 11: 26 observations, complexity param=0.01470588
predicted class=No expected loss=0.4230769 P(node) =0.13
class counts: 15 11
probabilities: 0.577 0.423
left son=22 (19 obs) right son=23 (7 obs)
Primary splits:
bp < 68 to the right, improve=1.6246390, (0 missing)
bmi < 33.4 to the left, improve=1.6173080, (0 missing)
npreg < 6.5 to the left, improve=0.9423077, (0 missing)
skin < 39.5 to the right, improve=0.6923077, (0 missing)
ped < 0.334 to the left, improve=0.4923077, (0 missing)
Surrogate splits:
glu < 94.5 to the right, agree=0.808, adj=0.286, (0 split)
ped < 0.2105 to the right, agree=0.808, adj=0.286, (0 split)
Node number 12: 27 observations
predicted class=No expected loss=0.2222222 P(node) =0.135
class counts: 21 6
probabilities: 0.778 0.222
Node number 13: 8 observations
predicted class=Yes expected loss=0.25 P(node) =0.04
class counts: 2 6
probabilities: 0.250 0.750
Node number 14: 11 observations
predicted class=No expected loss=0.2727273 P(node) =0.055
class counts: 8 3
probabilities: 0.727 0.273
Node number 15: 45 observations
predicted class=Yes expected loss=0.1555556 P(node) =0.225
class counts: 7 38
probabilities: 0.156 0.844
Node number 22: 19 observations
predicted class=No expected loss=0.3157895 P(node) =0.095
class counts: 13 6
probabilities: 0.684 0.316
Node number 23: 7 observations
predicted class=Yes expected loss=0.2857143 P(node) =0.035
class counts: 2 5
probabilities: 0.286 0.714
> par(xpd = TRUE); plot(cart_tree1); text(cart_tree1)
> #對測試集進行預測分析,並得到預測精度
> pre_cart_tree1 = predict(cart_tree1, Pima.te, type = "class")
> matrix1 = table(Type = Pima.te$type, predict = pre_cart_tree1)
> matrix1
predict
Type No Yes
No 223 0
Yes 109 0
> accuracy_tree1 = sum(diag(matrix1))/sum(matrix1)
> accuracy_tree1
[1] 0.6716867
> #對建成的決策樹模型進行剪枝,將cp設為0.03
> cart_tree2 = prune(cart_tree1, cp = 0.03)
> par(xpd = TRUE); plot(cart_tree2); text(cart_tree2)
> #基於剪枝後的模型對測試集進行預測分析,並得到預測精度
> pre_cart_tree2 = predict(cart_tree2, Pima.te, type = "class")
> matrix2 = table(Type = Pima.te$type, predict = pre_cart_tree2)
> matrix2
predict
Type No Yes
No 223 0
Yes 109 0
> accuracy_tree2 = sum(diag(matrix2))/sum(matrix2)
> accuracy_tree2
[1] 0.6716867
> #對建成的決策樹模型進行進一步剪枝,將cp設為0.1
> cart_tree3 = prune(cart_tree2, cp = 0.1)
> par(xpd = TRUE); plot(cart_tree3); text(cart_tree3)
> #基於剪枝後的模型對測試集進行預測分析,並得到預測精度
> pre_cart_tree3 = predict(cart_tree3, Pima.te, type = "class")
> matrix3 = table(Type = Pima.te$type, predict = pre_cart_tree3)
> matrix3
predict
Type No Yes
No 223 0
Yes 109 0
> accuracy_tree3 = sum(diag(matrix3))/sum(matrix3)
> accuracy_tree3
[1] 0.6716867
顯然當cp為0.03時可以獲得較高的準確率,而cp設為0.1時,模型二道了極大的簡化且準確率基本並未過多損失。
4.2 C5.0決策樹
> #C5.0決策樹分析
> library(C50)
> library(MASS)
> data("Pima.tr")
> str(Pima.tr)
'data.frame': 200 obs. of 8 variables:
$ npreg: int 5 7 5 0 0 5 3 1 3 2 ...
$ glu : int 86 195 77 165 107 97 83 193 142 128 ...
$ bp : int 68 70 82 76 60 76 58 50 80 78 ...
$ skin : int 28 33 41 43 25 27 31 16 15 37 ...
$ bmi : num 30.2 25.1 35.8 47.9 26.4 35.6 34.3 25.9 32.4 43.3 ...
$ ped : num 0.364 0.163 0.156 0.259 0.133 ...
$ age : int 24 55 35 26 23 52 25 24 63 31 ...
$ type : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 1 1 2 ...
> C50_tree2 = C5.0(type~., Pima.tr, control=C5.0Control(noGlobalPruning = TRUE)) #不對樹進行剪枝
> summary(C50_tree2)
Call:
C5.0.formula(formula = type ~ ., data = Pima.tr, control = C5.0Control(noGlobalPruning = TRUE))
C5.0 [Release 2.07 GPL Edition] Sat Sep 16 12:12:54 2017
-------------------------------
Class specified by attribute `outcome'
Read 200 cases (8 attributes) from undefined.data
Decision tree:
glu <= 123: No (109/15)
glu > 123:
:...bmi > 28.6:
:...ped <= 0.344: No (29/12)
: ped > 0.344: Yes (41/5)
bmi <= 28.6:
:...age <= 32: No (11)
age > 32:
:...bp > 80: No (3)
bp <= 80:
:...ped <= 0.162: No (2)
ped > 0.162: Yes (5)
Evaluation on training data (200 cases):
Decision Tree
----------------
Size Errors
7 32(16.0%) <<
(a) (b) <-classified as
---- ----
127 5 (a): class No
27 41 (b): class Yes
Attribute usage:
100.00% glu
45.50% bmi
38.50% ped
10.50% age
5.00% bp
Time: 0.0 secs
> plot(C50_tree2)
> C50_tree3 = C5.0(type~., Pima.tr, control=C5.0Control(noGlobalPruning = FALSE)) #對樹進行剪枝
> summary(C50_tree3)
Call:
C5.0.formula(formula = type ~ ., data = Pima.tr, control = C5.0Control(noGlobalPruning = FALSE))
C5.0 [Release 2.07 GPL Edition] Sat Sep 16 12:14:14 2017
-------------------------------
Class specified by attribute `outcome'
Read 200 cases (8 attributes) from undefined.data
Decision tree:
glu <= 123: No (109/15)
glu > 123:
:...bmi <= 28.6: No (21/5)
bmi > 28.6:
:...ped <= 0.344: No (29/12)
ped > 0.344: Yes (41/5)
Evaluation on training data (200 cases):
Decision Tree
----------------
Size Errors
4 37(18.5%) <<
(a) (b) <-classified as
---- ----
127 5 (a): class No
32 36 (b): class Yes
Attribute usage:
100.00% glu
45.50% bmi
35.00% ped
Time: 0.0 secs
> plot(C50_tree3)
> pre_C50_Cla2 = predict(C50_tree2, Pima.te, type = "class")
> matrix2 = table(Type = Pima.te$type, predict = pre_C50_Cla2)
> matrix2
predict
Type No Yes
No 193 30
Yes 58 51
> accuracy_tree2 = sum(diag(matrix2))/sum(matrix2)
> accuracy_tree2
[1] 0.7349398
> pre_C50_Cla3 = predict(C50_tree3, Pima.te, type = "class")
> matrix3 = table(Type = Pima.te$type, predict = pre_C50_Cla3)
> matrix3
predict
Type No Yes
No 195 28
Yes 60 49
> accuracy_tree3 = sum(diag(matrix3))/sum(matrix3)
> accuracy_tree3
[1] 0.7349398
我們發現修剪和不修剪對模型正確率沒有影響,但修剪之後的模型顯然更容易解釋。
4.3 CHAID決策樹
> #CHAID決策樹分析
> #CHAID決策樹只能對離散型屬性進行處理,因此需要將資料中的連續型資料都轉化為離散型,不用考慮時候修剪的問題。
> install.packages("CHAID")#如果找不到,則可以從https://r-forge.r-project.org/R/?group_id=343下載後安裝
> library(CHAID)
> #載入訓練和測試資料集
> data("Pima.tr")
> data("Pima.te")
> #將資料集合並
> Pima = rbind(Pima.tr, Pima.te)
> #對資料進行離散化處理,並輸出離散化的屬性
> level_name = {}
> for(i in 1:7)
+ {
+ Pima[,i] = cut(Pima[,i], breaks = 3, ordered_result = TRUE, include.lowest = TRUE)
+ level_name <- rbind(level_name, levels(Pima[,i]))
+ }
> level_name = data.frame(level_name)
> row.names(level_name) = colnames(Pima)[1:7]
> colnames(level_name) = paste("L",1:3,sep="")
> level_name
L1 L2 L3
npreg [-0.017,5.67] (5.67,11.3] (11.3,17]
glu [55.9,104] (104,151] (151,199]
bp [23.9,52.7] (52.7,81.3] (81.3,110]
skin [6.91,37.7] (37.7,68.3] (68.3,99.1]
bmi [18.2,34.5] (34.5,50.8] (50.8,67.1]
ped [0.0827,0.863] (0.863,1.64] (1.64,2.42]
age [20.9,41] (41,61] (61,81.1]
> #以前200個數據為訓練集,剩下的332個數據為測試集
> Pima.tr = Pima[1:200,]
> Pima.te = Pima[201:nrow(Pima),]
> CHAID_tree = chaid(type~., Pima.tr)
> CHAID_tree
Model formula:
type ~ npreg + glu + bp + skin + bmi + ped + age
Fitted party:
[1] root
| [2] glu in [55.9,104]
| | [3] age in [20.9,41]: No (n = 50, err = 6.0%)
| | [4] age in (41,61], (61,81.1]: No (n = 10, err = 40.0%)
| [5] glu in (104,151]
| | [6] age in [20.9,41]: No (n = 86, err = 27.9%)
| | [7] age in (41,61], (61,81.1]: Yes (n = 15, err = 26.7%)
| [8] glu in (151,199]: Yes (n = 39, err = 33.3%)
Number of inner nodes: 3
Number of terminal nodes: 5
> plot(CHAID_tree)
> #對測試集分別進行預測分析,並得到預測精度
> pre_CHAID_tree = predict(CHAID_tree, Pima.te)
> matrix = table(Type = Pima.te$type, predict = pre_CHAID_tree)
> matrix
predict
Type No Yes
No 199 24
Yes 47 62
> accuracy_tree = sum(diag(matrix))/sum(matrix)
> accuracy_tree
[1] 0.7861446