機器學習中K-means聚類演算法原理及C語言實現

阿新 • • 發佈：2019-07-08

本人以前主要focus在傳統音訊的軟體開發，接觸到的演算法主要是音訊訊號處理相關的，如各種編解碼演算法和回聲消除演算法等。最近切到語音識別上，接觸到的演算法就變成了各種機器學習演算法，如GMM等。K-means作為其中比較簡單的一種肯定是要好好掌握的。今天就講講K-means的基本原理和程式碼實現。其中基本原理簡述（主要是因為:1,K-means比較簡單；2,網上有很多講K-means基本原理的），重點放在程式碼實現上。

1, K-means基本原理

K均值（K-means）聚類演算法是無監督聚類（聚類（clustering）是將資料集中的樣本劃分為若干個通常是不相交的子集，每個子集稱為一個“簇（cluster）”）演算法中的一種，也是最常用的聚類演算法。K表示類別數，Means表示均值。K-means主要思想是在給定K值和若干樣本（點）的情況下，把每個樣本（點）分到離其最近的類簇中心點所代表的類簇中，所有點分配完畢之後，根據一個類簇內的所有點重新計算該類簇的中心點(取平均值)，然後再迭代的進行分配點和更新類簇中心點的步驟，直至類簇中心點的變化很小，或者達到指定的迭代次數。

K-means演算法流程如下：

（a）隨機選取K個初始cluster center

（b）分別計算所有樣本到這K個cluster center的距離

（c）如果樣本離cluster center Ci最近，那麼這個樣本屬於Ci點簇；如果到多個cluster center的距離相等，則可劃分到任意簇中

（d）按距離對所有樣本分完簇之後，計算每個簇的均值（最簡單的方法就是求樣本每個維度的平均值），作為新的cluster center

（e）重複（b）（c）（d）直到新的cluster center和上輪cluster center變化很小或者達到指定的迭代次數，演算法結束

2, 演算法實現

我主要偏底層開發，最熟悉語言是C，所以程式碼是用C語言來實現的。在二維平面上有一些點，大意如下圖，

用K-means演算法對其分類，其中類的個數（即K值）和點的個數人為指定。具體的程式碼如下：

#include<stdio.h>
#include<stdlib.h>
#include<string.h>
#include<math.h>

#define MAX_ROUNDS 100    //最大允許的聚類次數

//“點”的結構體
typedef struct Point{
int x_value;           //用於存放點在X軸上的值
int y_value;           //用於存放點在Y軸上的值
int cluster_id;        //用於存放該點所屬的cluster id
}Point;
Point* data;

//cluster center的結構體
typedef struct ClusterCenter{
double x_value;
double y_value;
int cluster_id;
}ClusterCenter;
ClusterCenter* cluster_center;

//計算cluster center的結構體
typedef struct CenterCalc{
double x_value;
double y_value;
}CenterCalc;
CenterCalc *center_calc;

int is_continue;                               //kmeans 運算是否繼續
int* cluster_center_init_index;        //記錄每個cluster center最初用的是哪個“點”
double* distance_from_center;      //記錄一個“點”到所有cluster center的距離
int* data_size_per_cluster;            //每個cluster點的個數
int data_size_total;                        //設定點的個數
char filename[200];                       //要讀取的點的資料的檔名
int cluster_count;                          //設定的cluster的個數

void memoryAlloc();
void memoryFree();
void readDataFromFile();
void initialCluster();
void calcDistance2OneCenter(int pointID, int centerID);
void calcDistance2AllCenters(int pointID);
void partition4OnePoint(int pointID);
void partition4AllPointOneCluster();
void calcClusterCenter();
void kmeans();
void compareNewOldClusterCenter(CenterCalc* center_calc);

int main(int argc, char* argv[])
{
    if( argc != 4 )
    {
        printf("This application needs 3 parameters to run:"
            "\n the 1st is the size of data set,"
            "\n the 2nd is the file name that contains data"
            "\n the 3rd indicates the cluster_count"
            "\n");
        exit(1);
    }

    data_size_total = atoi(argv[1]);
    strcat(filename, argv[2]);
    cluster_count = atoi(argv[3]);
    //1, memory alloc
    memoryAlloc();
    //2, read point data from file
    readDataFromFile();
    //3, initial cluster
    initialCluster();
    //4, run k-means
    kmeans();
    //5, memory free & end
    memoryFree();

    return 0;
}

void memoryAlloc()
{
data = (Point*)malloc(sizeof(struct Point) * (data_size_total));
if( !data )
{
    printf("malloc error:data!");
    exit(1);
}
cluster_center_init_index = (int*)malloc(sizeof(int) * (cluster_count));
if( !cluster_center_init_index )
{
    printf("malloc error:cluster_center!\n");
    exit(1);
}
distance_from_center = (double*)malloc(sizeof(double) * (cluster_count));
if( !distance_from_center )
{
    printf("malloc error: distance_from_center!\n");
    exit(1);
}
cluster_center = (ClusterCenter*)malloc(sizeof(struct ClusterCenter) * (cluster_count));
if( !cluster_center )
{
    printf("malloc cluster center new error!\n");
    exit(1);
}

center_calc = (CenterCalc*)malloc(sizeof(CenterCalc) * cluster_count);
if( !center_calc )
{
    printf("malloc error: center_calc!\n");
    exit(1);
}

data_size_per_cluster = (int*)malloc(sizeof(int) * (cluster_count));
if( !data_size_per_cluster )
{
    printf("malloc error: data_size_per_cluster\n");
    exit(1);
}

}

void memoryFree()
{
free(data);
data = NULL;
free(cluster_center_init_index);
cluster_center_init_index = NULL;
free(distance_from_center);
distance_from_center = NULL;
free(cluster_center);
cluster_center = NULL;
free(center_calc);
center_calc = NULL;
free(data_size_per_cluster);
data_size_per_cluster = NULL;
}

//從檔案中讀入每個點的x和y值
void readDataFromFile()
{
int i;
FILE* fread;

if( NULL == (fread = fopen(filename, "r")))
{
    printf("open file(%s) error!\n", filename);
    exit(1);
}

for( i = 0; i < data_size_total; i++ )
{
    if( 2 != fscanf(fread, "%d %d ", &data[i].x_value, &data[i].y_value))
    {
      printf("fscanf error: %d\n", i);
    }
    data[i].cluster_id = -1;    //初始時每個點所屬的cluster id均置為-1

    printf("After reading, point index:%d, X:%d, Y:%d, cluster_id:%d\n", i, data[i].x_value, data[i].y_value, data[i].cluster_id);
}
}

//根據傳入的cluster_count來隨機的選擇一個點作為一個cluster的center
void initialCluster()
{
int i,j;
int random;

//產生初始化的cluster_count個聚類
for( i = 0; i < cluster_count; i++ )
{
    cluster_center_init_index[i] = -1;
}
//隨機選擇一個點作為每個cluster的center（不重複）
for( i = 0; i < cluster_count; i++ )
{
    Reselect:
        random = rand() % (data_size_total - 1);
        for(j = 0; j < i; j++) {
            if(random == cluster_center_init_index[j])
                goto Reselect;
        }

    cluster_center_init_index[i] = random;
    printf("cluster_id: %d, located in point index:%d\n", i, random);
}
//將隨機選擇的點作為center，同時這個點的cluster id也就確定了
for( i = 0; i < cluster_count; i++ )
{
    cluster_center[i].x_value = data[cluster_center_init_index[i]].x_value;
    cluster_center[i].y_value = data[cluster_center_init_index[i]].y_value;
    cluster_center[i].cluster_id = i;
    data[cluster_center_init_index[i]].cluster_id = i;

    printf("cluster_id:%d, index:%d, x_value:%f, y_value:%f\n", cluster_center[i].cluster_id, cluster_center_init_index[i], cluster_center[i].x_value, cluster_center[i].y_value);
}
}

//計算一個點到一個cluster center的distance
void calcDistance2OneCenter(int point_id,int center_id)
{
distance_from_center[center_id] = sqrt( (data[point_id].x_value-cluster_center[center_id].x_value)*(double)(data[point_id].x_value-cluster_center[center_id].x_value) + (double)(data[point_id].y_value-cluster_center[center_id].y_value) *              (data[point_id].y_value-cluster_center[center_id].y_value) );
}

//計算一個點到每個cluster center的distance
void calcDistance2AllCenters(int point_id)
{
int i;
for( i = 0; i < cluster_count; i++ )
{
    calcDistance2OneCenter(point_id, i);
}
}

//確定一個點屬於哪一個cluster center(取距離最小的)
void partition4OnePoint(int point_id)
{
int i;
int min_index = 0;
double min_value = distance_from_center[0];
for( i = 0; i < cluster_count; i++ )
{
    if( distance_from_center[i] < min_value )
    {
      min_value = distance_from_center[i];
      min_index = i;
    }
}

data[point_id].cluster_id = cluster_center[min_index].cluster_id;
}

//在一輪的聚類中得到所有的point所屬於的cluster center
void partition4AllPointOneCluster()
{
int i;
for( i = 0; i < data_size_total; i++ )
{
    if( data[i].cluster_id != -1 ) //這個點就是center，不需要計算
      continue;
    else
    {
      calcDistance2AllCenters(i); //計算第i個點到所有center的distance
      partition4OnePoint(i);          //根據distance對第i個點進行partition
    }
}
}

//重新計算新的cluster center
void calcClusterCenter()
{
int i;

memset(center_calc, 0, sizeof(CenterCalc) * cluster_count);
memset(data_size_per_cluster, 0, sizeof(int) * cluster_count);
//分別對每個cluster內的每個點的X和Y求和，並計每個cluster內點的個數
for( i = 0; i < data_size_total; i++ )
{
    center_calc[data[i].cluster_id].x_value += data[i].x_value;
    center_calc[data[i].cluster_id].y_value += data[i].y_value;
    data_size_per_cluster[data[i].cluster_id]++;
}
//計算每個cluster內點的X和Y的均值作為center
for( i = 0; i < cluster_count; i++ )
{
     if(data_size_per_cluster[i] != 0) {
        center_calc[i].x_value = center_calc[i].x_value/ (double)(data_size_per_cluster[i]);
        center_calc[i].y_value = center_calc[i].y_value/ (double)(data_size_per_cluster[i]);

printf(" cluster %d point cnt:%d\n", i, data_size_per_cluster[i]);
        printf(" cluster %d center: X:%f, Y:%f\n", i, center_calc[i].x_value, center_calc[i].y_value);
    }
    else
          printf(" cluster %d count is zero\n", i);
}

//比較新的和舊的cluster center值的差別。如果是相等的，則停止K-means演算法。
compareNewOldClusterCenter(center_calc);

//將新的cluster center的值放入cluster_center結構體中
for( i = 0; i < cluster_count; i++ )
{
    cluster_center[i].x_value = center_calc[i].x_value;
    cluster_center[i].y_value = center_calc[i].y_value;
    cluster_center[i].cluster_id = i;
}

//在重新計算了新的cluster center之後，要重新來為每一個Point進行聚類，所以data中用於表示聚類ID的cluster_id要都重新置為-1。
for( i = 0; i < data_size_total; i++ )
{
    data[i].cluster_id = -1;
}
}

//比較新舊的cluster center的值,完全一樣表示聚類完成
void compareNewOldClusterCenter(CenterCalc* center_calc)
{
int i;
is_continue = 0;       //等於0表示不要繼續，1表示要繼續
for( i = 0; i < cluster_count; i++ )
{
    if( center_calc[i].x_value != cluster_center[i].x_value || center_calc[i].y_value != cluster_center[i].y_value)
    {
      is_continue = 1;
      break;
    }
}
}

//K-means演算法
void kmeans()
{
int rounds;
for( rounds = 0; rounds < MAX_ROUNDS; rounds++ )
{
    printf("\nRounds : %d             \n", rounds+1);
    partition4AllPointOneCluster();
    calcClusterCenter();
    if( 0 == is_continue )
    {
       printf("\n after %d rounds, the classification is ok and can stop.\n", rounds+1);
       break;
    }
}
}

編譯後生成可執行檔案kmeans，輸入的檔案裡共有6個點，分別為（0, 0)， (4, 4)， (4, 5)， (0, 1)， (3, 6) ，(4, 9)，要求分成兩類。執行可執行程式後得到結果如下：

$ ./kmeans 6 data 2
After reading, point index:0, X:0, Y:0, cluster_id:-1
After reading, point index:1, X:4, Y:4, cluster_id:-1
After reading, point index:2, X:4, Y:5, cluster_id:-1
After reading, point index:3, X:0, Y:1, cluster_id:-1
After reading, point index:4, X:3, Y:6, cluster_id:-1
After reading, point index:5, X:4, Y:9, cluster_id:-1

cluster_id: 0, located in point index:3
cluster_id: 1, located in point index:1
cluster_id:0, index:3, x_value:0.000000, y_value:1.000000
cluster_id:1, index:1, x_value:4.000000, y_value:4.000000

Rounds : 1
cluster 0 point cnt:2
cluster 0 center: X:0.000000, Y:0.500000
cluster 1 point cnt:4
cluster 1 center: X:3.750000, Y:6.000000

Rounds : 2
cluster 0 point cnt:2
cluster 0 center: X:0.000000, Y:0.500000
cluster 1 point cnt:4
cluster 1 center: X:3.750000, Y:6.000000

after 2 rounds, the classification is ok and can stop.

即兩輪後聚類就好了，（0, 0)，(0, 1)一類，(4, 4)， (4, 5)， (3, 6) ，(4, 9)一類。

機器學習中K-means聚類演算法原理及C語言實現

機器學習中K-means聚類演算法原理及C語言實現

吳恩達老師機器學習筆記K-means聚類演算法（二）

吳恩達老師機器學習筆記K-means聚類演算法（一）

【機器學習】K-means聚類演算法初探

python機器學習：K-means聚類演算法

機器學習之K-means聚類演算法

k-means聚類演算法原理及python3實現

K-Means聚類演算法原理及實現

Andrew Ng機器學習課程筆記（十二）之無監督學習之K-means聚類演算法

非監督學習之k-means聚類演算法——Andrew Ng機器學習筆記（九）

機器學習實戰———k均值聚類演算法

【無監督學習】1：K-means聚類演算法原理

scikit-learn學習之K-means聚類演算法與 Mini Batch K-Means演算法

【OpenCV學習筆記 020】K-Means聚類演算法介紹及實現

scikit-learn學習之K-means聚類演算法與 Mini Batch K-Means演算法 [轉自別的作者，還有其他sklearn翻譯]

matlab中k-means聚類演算法畫點

機器學習實戰---K均值聚類演算法

K-means聚類演算法原理簡單介紹

java實現K-means演算法，k-means聚類演算法原理

K-Means聚類演算法原理

機器學習中K-means聚類演算法原理及C語言實現

相關推薦