1. 程式人生 > >機器學習演算法原理總結系列---演算法基礎之(13)模糊C均值聚類(Fuzzy C-means Clustering)

機器學習演算法原理總結系列---演算法基礎之(13)模糊C均值聚類(Fuzzy C-means Clustering)

筆者在韓國Chonnam National University攻讀碩士學位,FCM演算法是professer Lim在這學期主要授課的內容,他說他剛發一篇FCM結合遺傳演算法還有各種腦電訊號處理,搭建分析AD病人的EEG訊號的計算智慧模型。反正就是各種難。

一、原理詳解

模糊c-均值聚類演算法 fuzzy c-means algorithm (FCMA)或稱( FCM)。在眾多模糊聚類演算法中,模糊C-均值( FCM) 演算法應用最廣泛且較成功,它通過優化目標函式得到每個樣本點對所有類中心的隸屬度,從而決定樣本點的類屬以達到自動對樣本資料進行分類的目的。

什麼是聚類?
這裡寫圖片描述
假設樣本集合為X={x1 ,x2 ,…,xn },將其分成c 個模糊組,並求每組的聚類中心cj ( j=1,2,…,C) ,使目標函式達到最小。

C-Means Clustering:
固定數量的叢集。
每個群集一個質心。
每個資料點屬於最接近質心對應的簇。
這裡寫圖片描述
這裡寫圖片描述
這裡寫圖片描述
叢集是模糊集合。
一個點的隸屬度可以是0到1之間的任何數字。
一個點的所有度數之和必須加起來為1。

K均值和模糊C均值的區別:

k均值聚類:一種硬聚類演算法,隸屬度只有兩個取值0或1,提出的基本根據是“類內誤差平方和最小化”準則;
模糊的c均值聚類演算法:一種模糊聚類演算法,是k均值聚類演算法的推廣形式,隸屬度取值為[0 1]區間內的任何一個數,提出的基本根據是“類內加權誤差平方和最小化”準則;
這兩個方法都是迭代求取最終的聚類劃分,即聚類中心與隸屬度值。兩者都不能保證找到問題的最優解,都有可能收斂到區域性極值,模糊c均值甚至可能是鞍點。
K均值和C均值,其實有種C是包含在K中的感覺,C只是特定的實現方式,K均值是廣義的概念。

二、程式碼實現

期中作業就是使用iris資料集搭建無監督學習Fuzzy c-means模型,完成演算法,測試演算法的成功分類率。
資料集:
這裡寫圖片描述
FuzzyCmeans_model.py

from numpy import dot, array, sum, zeros, outer, any


# Fuzzy C-Means class
class FuzzyCMeans(object):
    """
    Fuzzy C-Means convergence.

    Use this class to instantiate a fuzzy c-means object. The object must be
    given a training set and initial conditions. The training set is a list or
    an array of N-dimensional vectors; the initial conditions are a list of the
    initial membership values for every vector in the training set -- thus, the
    length of both lists must be the same. The number of columns in the initial
    conditions must be the same number of classes. That is, if you are, for
    example, classifying in ``C`` classes, then the initial conditions must have
    ``C`` columns.

    There are restrictions in the initial conditions: first, no column can be
    all zeros or all ones -- if that happened, then the class described by this
    column is unnecessary; second, the sum of the memberships of every example
    must be one -- that is, the sum of the membership in every column in each
    line must be one. This means that the initial condition is a perfect
    partition of ``C`` subsets.

    """
def __init__(self, training_set, initial_conditions, m=2.): """ Initializes the algorithm. :Parameters: training_set A list or array of vectors containing the data to be classified. Each of the vectors in this list *must* have the same dimension, or the algorithm won't behave correctly. Notice that each vector can be given as a tuple -- internally, everything is converted to arrays. initial_conditions A list or array of vectors containing the initial membership values associated to each example in the training set. Each column of this array contains the membership assigned to the corresponding class for that vector. Notice that each vector can be given as a tuple -- internally, everything is converted to arrays. m This is the aggregation value. The bigger it is, the smoother will be the classification. Please, consult the bibliography about the subject. ``m`` must be bigger than 1. Its default value is 2 """ self.__x = array(training_set) self.__mu = array(initial_conditions) self.m = m '''The fuzzyness coefficient. Must be bigger than 1, the closest it is to 1, the smoother the membership curves will be.''' self.__c = self.centers() def __getc(self): return self.__c def __setc(self, c): self.__c = array(c).reshape(self.__c.shape) c = property(__getc, __setc) '''A ``numpy`` array containing the centers of the classes in the algorithm. Each line represents a center, and the number of lines is the number of classes. This property is read and write, but care must be taken when setting new centers: if the dimensions are not exactly the same as given in the instantiation of the class (*ie*, *C* centers of dimension *N*, an exception will be raised.''' def __getmu(self): return self.__mu mu = property(__getmu, None) '''The membership values for every vector in the training set. This property is modified at each step of the execution of the algorithm. This property is not writable.''' def __getx(self): return self.__x x = property(__getx, None) '''The vectors in which the algorithm bases its convergence. This property is not writable.''' def centers(self): """ Given the present state of the algorithm, recalculates the centers, that is, the position of the vectors representing each of the classes. Notice that this method modifies the state of the algorithm if any change was made to any parameter. This method receives no arguments and will seldom be used externally. It can be useful if you want to step over the algorithm. *This method has a colateral effect!* If you use it, the ``c`` property (see above) will be modified. :Returns: A vector containing, in each line, the position of the centers of the algorithm. """ mm = self.__mu ** self.m c = dot(self.__x.T, mm) / sum(mm, axis=0) self.__c = c.T return self.__c def membership(self): """ Given the present state of the algorithm, recalculates the membership of each example on each class. That is, it modifies the initial conditions to represent an evolved state of the algorithm. Notice that this method modifies the state of the algorithm if any change was made to any parameter. :Returns: A vector containing, in each line, the membership of the corresponding example in each class. """ x = self.__x c = self.__c M, _ = x.shape C, _ = c.shape r = zeros((M, C)) m1 = 1. / (self.m - 1.) for k in range(M): den = sum((x[k] - c) ** 2., axis=1) if any(den == 0): return self.__mu frac = outer(den, 1. / den) ** m1 r[k, :] = 1. / sum(frac, axis=1) self.__mu = r return self.__mu def step(self): """ This method runs one step of the algorithm. It might be useful to track the changes in the parameters. :Returns: The norm of the change in the membership values of the examples. It can be used to track convergence and as an estimate of the error. """ old = self.__mu self.membership() self.centers() return sum(self.__mu - old) ** 2. def __call__(self, emax=1.e-10, imax=20): """ The ``__call__`` interface is used to run the algorithm until convergence is found. :Parameters: emax Specifies the maximum error admitted in the execution of the algorithm. It defaults to 1.e-10. The error is tracked according to the norm returned by the ``step()`` method. imax Specifies the maximum number of iterations admitted in the execution of the algorithm. It defaults to 20. :Returns: An array containing, at each line, the vectors representing the centers of the clustered regions. """ error = 1. i = 0 while error > emax and i < imax: error = self.step() i = i + 1 return self.c

main.py

# -*- coding:utf-8 -*-
import random
from fuzzy_model import FuzzyCMeans

# used for randomising U
global MAX
MAX = 10000.0


def load_data(file):
    data = []
    cluster_location = []
    with open(str(file), 'r') as f:
        for line in f:
            current = line.split(",")
            current_dummy = []
            for j in range(0, len(current) - 1):
                current_dummy.append(float(current[j]))
            j += 1
            # print current[j]
            if current[j] == "Iris-setosa\n":
                cluster_location.append(0)
            elif current[j] == "Iris-versicolor\n":
                cluster_location.append(1)
            else:
                cluster_location.append(2)
            data.append(current_dummy)
    return data, cluster_location


def randomise_data(data):
    """
    This function randomises the data,
    and also keeps record of the order of randomisation.
    """
    order = list(range(0, len(data)))
    random.shuffle(order)
    new_data = [[] for _ in range(0, len(data))]
    for index in range(0, len(order)):
        new_data[index] = data[order[index]]
    return new_data, order


def initialise_U(data, cluster_number):
    """
    This function would randomis U such that the rows add up to 1.
    it requires a global MAX.
    """
    global MAX
    U = []
    for i in range(0, len(data)):
        current = []
        rand_sum = 0.0
        for j in range(0, cluster_number):
            dummy = random.randint(1, int(MAX))
            current.append(dummy)
            rand_sum += dummy
        for j in range(0, cluster_number):
            current[j] = current[j] / rand_sum
        U.append(current)
    return U


def normalise_U(U):
    """
    This de-fuzzifies the U, at the end of the clustering.
    It would assume that the point is a member of the cluster
    whose membership is maximum.
    """
    for i in range(0, len(U)):
        maximum = max(U[i])
        for j in range(0, len(U[0])):
            if U[i][j] != maximum:
                U[i][j] = 0
            else:
                U[i][j] = 1
    return U


def de_randomise_data(data, order):
    """
    This function would return the original order of the data,
    pass the order list returned in randomise_data() as an argument
    """
    new_data = [[] for i in range(0, len(data))]
    for index in range(len(order)):
        new_data[order[index]] = data[index]
    return new_data


def checker_iris(final_location):
    """
    This is used to find the percentage correct match with
    the real clustering.
    """
    right = 0.0
    for k in range(0, 3):
        checker = [0, 0, 0]
        for i in range(0, 50):
            for j in range(0, len(final_location[0])):
                if final_location[i + (50 * k)][j] == 1:
                    checker[j] += 1
        right += max(checker)
    answer = right / 150 * 100
    return str(answer) + " % accuracy"


if __name__ == '__main__':
    data, cluster_location = load_data("iris.txt")
    # print(data)
    data, order = randomise_data(data)
    # print("========")
    # print(cluster_location)
    initU = initialise_U(data, 3)
    # print('init membership matrix')
    # print(initU)
    # This parameter measures the smoothness of convergence
    m = 2.0
    fcm = FuzzyCMeans(data, initU, m)
    print ('Optimal clustering center')
    print (fcm(emax=0))
    print ('Optimal membership matrix')
    print (fcm.mu)
    nu = normalise_U(fcm.mu)
    final_location = de_randomise_data(nu, order)
    print(checker_iris(final_location))

昨天我還在一個視訊中聽到說,監督學習深度網路現在特別火,大家都想在這方面做出點成績,而非監督學習這邊,由於分類效果並不是特別好,所以大家很少關注到它,也很少去研究這個方向。但是你想想,我們人生下來,從無到有,從小到大,我們很多東西都沒有訓練學習標籤,都是一種無監督的學習,所以這方面還是有很大的研究價值。