資料探勘：id3 演算法

阿新 • • 發佈：2019-01-26

1 簡述

1.1
    id3是一種基於決策樹的分類演算法，由J.Ross Quinlan
在1986年開發。id3根據資訊增益，運用自頂向下的貪心策略
建立決策樹。資訊增益用於度量某個屬性對樣本集合分類的好壞程度。
由於採用了資訊增益，id3演算法建立的決策樹規模比較小，
查詢速度快。id3演算法的改進是C4.5演算法，C4.5演算法可以
處理連續資料，採用資訊增益率，而不是資訊增益。
理解資訊增益，需要先看一下資訊熵。

1.2 資訊熵
    資訊熵是隨機變數的期望。度量資訊的不確定程度。
資訊的熵越大，資訊就越不容易搞清楚。處理資訊就是
為了把資訊搞清楚，就是熵減少的過程。
   Entropy(X) = -Sum(p(xi) * log(p(xi))) {i: 0 <= i <= n}
   p(x)是概率密度函式；對數是以2為底；

1.3 資訊增益
    用於度量屬性A降低樣本集合X熵的貢獻大小。資訊增益
越大，越適於對X分類。
   Gain(A, X) = Entropy(X) - Sum(|Xv| / |X| * Entropy(Xv)) {v: A的所有可能值}
   Xv表示A中所有為v的值；|Xv|表示A中所有為v的值的數量；

2 id3演算法流程
   輸入：樣本集合S，屬性集合A
   輸出：id3決策樹。
   1) 若所有種類的屬性都處理完畢，返回；否則執行2）
   2）計算出資訊增益最大屬性a，把該屬性作為一個節點。
       如果僅憑屬性a就可以對樣本分類，則返回；否則執行3）
   3）對屬性a的每個可能的取值v，執行一下操作：
       i. 將所有屬性a的值是v的樣本作為S的一個子集Sv；
       ii. 生成屬性集合AT=A-{a};
       iii.以樣本集合Sv和屬性集合AT為輸入，遞迴執行id3演算法；

3 一個的例子
   3.1
   這個例子來源於Quinlan的論文。
   假設，有種戶外活動。該活動能否正常進行與各種天氣因素有關。
   不同的天氣因素組合會產生兩種後果，也就是分成2類：能進行活動或不能。
   我們用P表示該活動可以進行，N表示該活動無法進行。
   下表描述樣本集合是不同天氣因素對該活動的影響。

                   Attribute                       class
   outlook    temperature    humidity    windy
   ---------------------------------------------------------
   sunny       hot           high         false       N
   sunny     hot         high     true     N
   overcast   hot           high     false       P
   rain         mild           high       false       P
   rain         cool           normal false       P
   rain       cool           normal      true       N
   overcast   cool           normal      true     P
   sunn y      mild           high         false       N
   sunny     cool           normal      false       P
   rain         mild           normal      false       P
   sunny      mild           normal      true     P
   overcast   mild           high         true         P
   overcast   hot         normal      false       P
   rain         mild           high     true        N

   3.2
   該活動無法進行的概率是：5/14
   該活動可以進行的概率是：9/14
   因此樣本集合的資訊熵是：-5/14log(5/14) - 9/14log(9/14) = 0.940

   3.3
   接下來我們再看屬性outlook資訊熵的計算：
   outlook為sunny時，
   該活動無法進行的概率是：3/5
   該活動可以進行的概率是：2/5
   因此sunny的資訊熵是：-3/5log(3/5) - 2/5log(2/5) = 0.971

   同理可以計算outlook屬性取其他值時候的資訊熵：
   outlook為overcast時的資訊熵：0
   outlook為rain時的資訊熵：0.971

   屬性outlook的資訊增益：gain(outlook) = 0.940 - (5/14*0.971 + 4/14*0 + 5/14*0.971) = 0.246

   相似的方法可以計算其他屬性的資訊增益：
   gain(temperature) = 0.029
   gain(humidity) = 0.151
   gain(windy) = 0.048

   資訊增益最大的屬性是outlook。

   3.4
   根據outlook把樣本分成3個子集，然後把這3個子集和餘下的屬性
   作為輸入遞迴執行演算法。

4 程式碼演示
   4.1
   程式碼說明：
   程式碼只是演示上一節的例子，寫的比較倉促，沒有經過仔細的設計和編碼，
   只是在fedora 16上做了初步的測試，所以有一些錯誤和不適當的地方。
   4.2
   編譯：
       g++ -g -W -Wall -Wextra -o mytest main.cpp id3.cpp
   4.3
   執行:
       ./mytest
   4.4

id3.h:
================================================
// 2012年 07月 12日星期四 15:07:10 CST
// author: 李小丹(Li Shao Dan) 字殊恆(shuheng)
// K.I.S.S
// S.P.O.T

#ifndef ID3_H
#define ID3_H

#include <list>
#include <map>
#include <utility>

// value and index: >= 0, and index 0 is classification
// value and index: not decision is -1
class id3_classify {
public:
   id3_classify(int);
   ~id3_classify();

public:
   int push_sample(const int *, int);
   int classify();
   int match(const int *);
   void print_tree();

private:
   typedef std::list<std::list<std::pair<int, int> > > sample_space_t;

   struct tree_node {
       int index;
       int classification;
       std::map<int, struct tree_node *> next;
       sample_space_t unclassified;
   };

private:

   void clear(struct tree_node *);
   int recur_classify(struct tree_node *, int);
   int recur_match(const int *, struct tree_node *);
   int max_gain(struct tree_node *);
   double cal_entropy(const std::map<int, int> &, double);
   int cal_max_gain(const sample_space_t &);
   int cal_split(struct tree_node *, int);
   void att_statistics(const sample_space_t &,
           std::map<int, std::map<int, int> > &,
           std::map<int, std::map<int, std::map<int, int> > > &,
           std::map<int, int> &);
   double cal_gain(std::map<int, int> &,
           std::map<int, std::map<int, int> > &,
           double, double);

   int is_classfied(const sample_space_t &);
   void dump_tree(struct tree_node *);

private:
   sample_space_t unclassfied;
   struct tree_node *root;
   std::map<int, int> *attribute_values;
   int dimension;
};

#endif
===================================================

id3.cpp:
==================================================
// 2012年 07月 16日星期一 10:07:43 CST
// author: 李小丹(Li Shao Dan) 字殊恆(shuheng)
// K.I.S.S
// S.P.O.T

#include <iostream>

#include <cmath>
#include <cassert>

#include "id3.h"

using namespace std;

id3_classify::id3_classify(int d)
:root(new struct tree_node), dimension(d)
{
   root->index = -1;
   root->classification = -1;
}

id3_classify::~id3_classify()
{
   clear(root);
}

int id3_classify::push_sample(const int *vec, int c)
{
   list<pair<int, int> > v;

   for(int i = 0; i < dimension; ++i)
       v.push_back(make_pair(i + 1, vec[i]));
   v.push_front(make_pair(0, c));

   root->unclassified.push_back(v);

   return 0;
}

int id3_classify::classify()
{
   return recur_classify(root, dimension);
}

int id3_classify::match(const int *v)
{
   return recur_match(v, root);
}

void id3_classify::clear(struct tree_node *node)
{
   unclassfied.clear();

   std::map<int, struct tree_node *> &next = node->next;
   for(std::map<int, struct tree_node *>::iterator pos
           = next.begin(); pos != next.end(); ++pos)
       clear(pos->second);

   next.clear();
   delete node;
}

int id3_classify::recur_classify(struct tree_node *node, int dim)
{
   sample_space_t &unclassified = node->unclassified;
   int cls;
   if((cls = is_classfied(unclassified)) >= 0) {
       node->index = -1;
       node->classification = cls;
       return 0;
   }
   int ret = max_gain(node);
   unclassified.clear();
   if(ret < 0) return 0;

   map<int, struct tree_node *> &next = node->next;
   for(map<int, struct tree_node *>::iterator pos
           = next.begin(); pos != next.end(); ++pos)
       recur_classify(pos->second, dim - 1);

   return 0;
}

int id3_classify::is_classfied(const sample_space_t &ss)
{
   const list<pair<int, int> > &f = ss.front();
   if(f.size() == 1)
       return f.front().second;

   int cls;
   for(list<pair<int, int> >::const_iterator p
           = f.begin(); p != f.end(); ++p) {
           if(!p->first) {
               cls = p->second;
               break;
           }
   }
   for(sample_space_t::const_iterator s
           = ss.begin(); s != ss.end(); ++s) {
       const list<pair<int, int> > &v = *s;
       for(list<pair<int, int> >::const_iterator vp
               = v.begin(); vp != v.end(); ++vp) {
           if(!vp->first) {
               if(cls != vp->second)
                   return -1;
               else
                   break;
           }
       }
   }
   return cls;
}

int id3_classify::max_gain(struct tree_node *node)
{
   // index of max attribute gain
   int mai = cal_max_gain(node->unclassified);
   assert(mai >= 0);
   node->index = mai;
   cal_split(node, mai);
   return 0;
}

int id3_classify::cal_max_gain(const sample_space_t &ss)
{
   map<int, map<int, int> >att_val;
   map<int, map<int, map<int, int> > >val_cls;
   map<int, int> cls;

   att_statistics(ss, att_val, val_cls, cls);

   double s = (double)ss.size();
   double entropy = cal_entropy(cls, s);

   double mag = -1;        // max information gain
   int mai = -1; // index of max information gain

   for(map<int, map<int, int> >::iterator p
           = att_val.begin(); p != att_val.end(); ++p) {
       double g;
       if((g = cal_gain(p->second, val_cls[p->first],
                       s, entropy)) > mag) {
           mag = g;
           mai = p->first;
       }
   }
   if(!att_val.size() && !val_cls.size() && cls.size())
       return 0;
   return mai;
}

void id3_classify::att_statistics(const sample_space_t &ss,
       map<int, map<int, int> > &att_val,
       map<int, map<int, map<int, int> > > &val_cls,
       map<int, int> &cls)
{
   for(sample_space_t::const_iterator spl = ss.begin();
           spl != ss.end(); ++spl) {
       const list<pair<int, int> > &v = *spl;
       int c;
       for(list<pair<int, int> >::const_iterator vp
               = v.begin(); vp != v.end(); ++vp) {
           if(!vp->first) {
               c = vp->second;
               break;
           }
       }
       ++cls[c];
       for(list<pair<int, int> >::const_iterator vp
               = v.begin(); vp != v.end(); ++vp) {
           if(vp->first) {
               ++att_val[vp->first][vp->second];
               ++val_cls[vp->first][vp->second][c];
           }
       }
   }
}

double id3_classify::cal_entropy(const map<int, int> &att, double s)
{
   double entropy = 0;
   for(map<int, int>::const_iterator pos = att.begin();
           pos != att.end(); ++pos) {
       double tmp = pos->second / s;
       entropy += tmp * log2(tmp);
   }
   return -entropy;
}

double id3_classify::cal_gain(map<int, int> &att_val,
       map<int, map<int, int> > &val_cls,
       double s, double entropy)
{
   double gain = entropy;
   for(map<int, int>::const_iterator att = att_val.begin();
           att != att_val.end(); ++att) {
       double r = att->second / s;
       double e = cal_entropy(val_cls[att->first], att->second);
       gain -= r * e;
   }
   return gain;
}

int id3_classify::cal_split(struct tree_node *node, int idx)
{
   map<int, struct tree_node *> &next = node->next;
   sample_space_t &unclassified = node->unclassified;

   for(sample_space_t::iterator sp = unclassified.begin();
           sp != unclassified.end(); ++sp) {
       list<pair<int, int> > &v = *sp;
       for(list<pair<int, int> >::iterator vp = v.begin();
               vp != v.end(); ++vp) {
           if(vp->first == idx) {
               struct tree_node *tmp;
               if(!(tmp = next[vp->second])) {
                   tmp = new struct tree_node;
                   tmp->index = -1;
                   tmp->classification = -1;
                   next[vp->second] = tmp;
               }
               v.erase(vp);
               tmp->unclassified.push_back(v);
               break;
           }
       }
   }
   return 0;
}

int id3_classify::recur_match(const int *v, struct tree_node *node)
{
   if(node->index < 0)
       return node->classification;

   map<int, struct tree_node *>::iterator p;
   map<int, struct tree_node *> &next = node->next;

   if((p = next.find(v[node->index-1])) == next.end())
       return -1;

   return recur_match(v, p->second);
}

void id3_classify::print_tree()
{
   return dump_tree(root);
}

void id3_classify::dump_tree(struct tree_node *node)
{
   cout << "I: " << node->index << endl;
   cout << "C: " << node->classification << endl;
   cout << "N: " << node->next.size() << endl;
   cout << "+++++++++++++++++++++++\n";

   map<int, struct tree_node *> &next = node->next;
   for(map<int, struct tree_node *>::iterator p
           = next.begin(); p != next.end(); ++p) {
       dump_tree(p->second);
   }
}
====================================================

main.cpp:
===================================================
// 2012年 07月 18日星期三 13:59:10 CST
// author: 李小丹(Li Shao Dan) 字殊恆(shuheng)
// K.I.S.S
// S.P.O.T

#include <iostream>

#include "id3.h"

using namespace std;

int main()
{
   enum outlook {SUNNY, OVERCAST, RAIN};
   enum temp {HOT, MILD, COOL};
   enum hum {HIGH, NORMAL};
   enum windy {WEAK, STRONG};

   int samples[14][4] = {
       {SUNNY   ,       HOT ,      HIGH ,       WEAK },
       {SUNNY   ,       HOT ,      HIGH ,       STRONG},
       {OVERCAST,       HOT ,      HIGH ,       WEAK },
       {RAIN    ,       MILD,      HIGH ,       WEAK },
       {RAIN    ,       COOL,      NORMAL,       WEAK },
       {RAIN    ,       COOL,      NORMAL,       STRONG},
       {OVERCAST,       COOL,      NORMAL,       STRONG},
       {SUNNY   ,       MILD,      HIGH ,       WEAK },
       {SUNNY   ,       COOL,      NORMAL,       WEAK },
       {RAIN    ,       MILD,      NORMAL,       WEAK },
       {SUNNY   ,       MILD,      NORMAL,       STRONG},
       {OVERCAST,       MILD,      HIGH ,       STRONG},
       {OVERCAST,       HOT ,      NORMAL,       WEAK },
       {RAIN    ,       MILD,      HIGH ,       STRONG}};

   id3_classify cls(4);
   cls.push_sample((int *)&samples[0], 0);
   cls.push_sample((int *)&samples[1], 0);
   cls.push_sample((int *)&samples[2], 1);
   cls.push_sample((int *)&samples[3], 1);
   cls.push_sample((int *)&samples[4], 1);
   cls.push_sample((int *)&samples[5], 0);
   cls.push_sample((int *)&samples[6], 1);
   cls.push_sample((int *)&samples[7], 0);
   cls.push_sample((int *)&samples[8], 1);
   cls.push_sample((int *)&samples[9], 1);
   cls.push_sample((int *)&samples[10], 1);
   cls.push_sample((int *)&samples[11], 1);
   cls.push_sample((int *)&samples[12], 1);
   cls.push_sample((int *)&samples[13], 0);

   cls.classify();
   cls.print_tree();
   cout << "===============================\n";
   for(int i = 0; i < 14; ++i)
       cout << cls.match((int *)&samples[i]) << endl;
   return 0;
}
================================================

資料探勘：id3 演算法

資料探勘：id3 演算法

資料探勘：Apriori演算法

一小時瞭解資料探勘②：分類演算法的應用和成熟案例解析

資料探勘十大演算法（一）：決策樹演算法 python和sklearn實現

資料探勘十大演算法——支援向量機SVM（一）：線性支援向量機

資料探勘十大演算法——支援向量機SVM（四）：SMO演算法原理

Python資料探勘：利用聚類演算法進行航空公司客戶價值分析

資料探勘十大演算法（九）：樸素貝葉斯 python和sklearn實現

資料探勘十大演算法——支援向量機SVM（二）：線性支援向量機的軟間隔最大化模型

資料探勘十大演算法（五）：EM(Expectation Maximum)演算法原理與Python實現

資料探勘十大演算法——支援向量機SVM（五）：線性支援迴歸

資料探勘十大演算法（九）：樸素貝葉斯原理、例項與Python實現

資料探勘：Apriori(先驗)演算法

『資料探勘十大演算法』筆記二：SVM-支援向量機

資料探勘：基於樸素貝葉斯分類演算法的文字分類實踐

資料探勘：基於Spark+HanLP實現影視評論關鍵詞抽取(1)

資料探勘之FP_Tree演算法實現

資料探勘聚類演算法

資料倉庫與資料探勘之Apriori演算法例項

資料探勘：資料（資料物件與屬性型別）

資料探勘：id3 演算法

相關推薦