1. 程式人生 > >some understanding of《Improved Use of Continuous Attributes in C4.5》

some understanding of《Improved Use of Continuous Attributes in C4.5》

Here are formulas provided in
“Improved Use of Continuous Attributes in C4.5”
1996,Journal of Artificial Intelligence Research 4 (1996)77-90

I n f o (

D ) = j = 1 C
p ( D , j ) l o g
2
( p ( D , j ) ) Info(D)=-\sum_{j=1}^{C}p(D,j)·log_2(p(D,j))

G a i n ( D , T ) = I n f o ( D ) i = 1 k D i D I n f o ( D i ) Gain(D,T)=Info(D)-\sum_{i=1}^{k}\frac{|D_i|}{|D|}·Info(D_i)

S p l i t ( D , T ) = i = 1 k D i D l o g 2 ( D i D ) Split(D,T)=-\sum_{i=1}^{k}\frac{|D_i|}{|D|}·log_2(\frac{|D_i|}{|D|})

The followding are my understandings:
------------------first change-----------------------------
then,
G a i n _ R a t i o = G a i n ( D , T ) S p l i t ( D , T ) Gain\_Ratio=\frac{Gain(D,T)}{Split(D,T)}

Then ,my understanding of the "first change"is
G a i n _ R a t i o _ a d j u s t e d = G a i n ( D , T ) l o g 2 ( N 1 ) D S p l i t ( D , T ) Gain\_Ratio\_adjusted=\frac{Gain(D,T)-\frac{log_2(N-1)}{D}}{Split(D,T)}
is this right?
Many Thanks~
--------------------second change---------------------------
Relevant part of “second change” in this article is:
"This seems to be an unnecessary complication,so the threshold t is chosen instead to maximize gain.Once the threshold is chosen,however,the final selection of the attribute to be used for the test is still made on the basis of the gain ratio criterion using the adjusted gain
"
My understanding is:


1st step:
choose threshold t according to G a i n ( D , T ) m a x Gain(D,T)_{max} ,
Not G a i n _ R a t i o m a x Gain\_Ratio_{max}
Not ( G a i n ( D , T ) l o g 2 ( N 1 ) / D ) m a x (Gain(D,T)-log_2(N-1)/|D|)_{max}
2nd step:
the criterion to choose best feature is according to:
G a i n _ R a t i o ( d i s c r e t e   f e a t u r e ) = G a i n ( D , T ) S p l i t ( D , T ) Gain\_Ratio(discrete\ feature)=\frac{Gain(D,T)}{Split(D,T)}
G a i n _ R a t i o _ a d j u s t e d ( c o n t i n u o u s   f e a t u r e ) = G a i n ( D , T ) l o g 2 ( N 1 ) D S p l i t ( D , T ) Gain\_Ratio\_adjusted(continuous\ feature)=\frac{Gain(D,T)-\frac{log_2(N-1)}{D}}{Split(D,T)}
Finally,just choose the feature whose Gain Ratio or Gain Ratio(adjusted) is the largest.


is this understanding right?
Many thanks~