互相關資訊和歸一化互相關資訊

阿新 • • 發佈：2019-01-09

實驗室最近用到nmi( Normalized Mutual information )評價聚類效果,在網上找了一下這個演算法的實現,發現滿意的不多.

浙江大學蔡登教授有一個，http://www.zjucadcg.cn/dengcai/Data/code/MutualInfo.m ，他在資料探勘屆地位很高,他實現這個演算法的那篇論文引用率高達三位數。但這個實現，恕個人能力有限，我實在是沒有看懂:變數命名極為個性，看的如墜雲霧；程式碼倒數第二行作者自己添加註釋why complex，我就更不懂了；最要命的是使用他的函式MutualInfo(L1,L2)得到的結果不等於MutualInfo(L2,L1)，沒有對稱性！

還有個python的版本http://blog.sun.tc/2010/10/mutual-informationmi-and-normalized-mutual-informationnmi-for-numpy.html，這個感覺很靠譜，作者對nmi的理解和我是一樣的。

我的理解來自wiki和stanford，其實很簡單，先說一下問題：例如stanford中介紹的一個例子：

比如標準結果是圖中的叉叉點點圈圈，我的聚類結果是圖中標註的三個圈。

或者我的結果: A = [1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3];

標準的結果 : B = [1 2 1 1 1 1 1 2 2 2 2 3 1 1 3 3 3];

問題：衡量我的結果和標準結果有多大的區別，若我的結果和他的差不多，結果應該為1，若我做出來的結果很差，結果應趨近於0。

MI可以按照下面的公式計算。X=unique(A)=[1 2 3]，Y=unique(B)=[1 2 3];

分子p(x,y)為x和y的聯合分佈概率，

p(1,1)=5/17,p(1,2)=1/17,p(1,3)=0;

p(2,1)=1/17,p(2,2)=4/17,p(2,3)=1/17;

p(3,1)=2/17,p(3,2)=0,p(3,3)=3/17;

分母p(x)為x的概率函式，p(y)為y的概率函式，x和y分別來自於A和B中的分佈，所以即使x=y時，p(x)和p(y)也可能是不一樣的。

對p(x)： p(1)=6/17p(2)=6/17p(3)=5/17

對p(y)： p(1)=8/17p(2)=5/17P(3)=4/17

這樣就可以算出MI值了。

標準化互聚類資訊也很簡單，有幾個不同的版本，大體思想都是相同的，都是用熵做分母將MI值調整到0與1之間。一個比較多見的實現是下面所示：

H(X)和H(Y)分別為X和Y的熵，下面的公式中log的底b=2。

例如H(X) = -p(1)*log2(p(1)) - -p(2)*log2(p(2)) -p(3)*log2(p(3))。

matlab實現程式碼如下

function MIhat = nmi( A, B )

%NMI Normalized mutual information
% http://en.wikipedia.org/wiki/Mutual_information
% http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html

% Author: http://www.cnblogs.com/ziqiao/ [2011/12/13]

if length( A ) ~= length( B)
    error('length( A ) must == length( B)');end
total = length(A);
A_ids = unique(A);
B_ids = unique(B);

% Mutual information
MI = 0;
for idA = A_ids
    for idB = B_ids
         idAOccur = find( A == idA );
         idBOccur = find( B == idB );
         idABOccur = intersect(idAOccur,idBOccur);

         px = length(idAOccur)/total;
         py = length(idBOccur)/total;
         pxy = length(idABOccur)/total;

         MI = MI + pxy*log2(pxy/(px*py)+eps); % eps : the smallest positive number

    end
end

% Normalized Mutual information
Hx = 0; % Entropies
for idA = A_ids
    idAOccurCount = length( find( A == idA ) );
    Hx = Hx - (idAOccurCount/total) * log2(idAOccurCount/total + eps);
end
Hy = 0; % Entropies
for idB = B_ids
    idBOccurCount = length( find( B == idB ) );
    Hy = Hy - (idBOccurCount/total) * log2(idBOccurCount/total + eps);
end

MIhat = 2 * MI / (Hx+Hy);
end

% Example :
% (http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html)
% A = [1 1 1 1 1 1   2 2 2 2 2 2    3 3 3 3 3];
% B = [1 2 1 1 1 1   1 2 2 2 2 3    1 1 3 3 3];
% nmi(A,B)

% ans = 0.3646

為了節省執行時間，將for迴圈用矩陣運算代替，1百萬的資料量執行 1.795723second，上面的方法執行3.491053 second；

但是這種方法太佔記憶體空間，五百萬時，利用matlab2011版本的記憶體設定就顯示Out of memory了。

function MIhat = nmi( A, B )
%NMI Normalized mutual information
% http://en.wikipedia.org/wiki/Mutual_information
% http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html
% Author: http://www.cnblogs.com/ziqiao/ [2011/12/15]

if length( A ) ~= length( B)
error('length( A ) must == length( B)');end
total = length(A);
A_ids = unique(A);
A_class = length(A_ids);
B_ids = unique(B);
B_class = length(B_ids);
% Mutual information
idAOccur = double (repmat( A, A_class, 1) == repmat( A_ids', 1, total ));idBOccur = double (repmat( B, B_class, 1) == repmat( B_ids', 1, total ));idABOccur = idAOccur * idBOccur';Px = sum(idAOccur') / total;Py = sum(idBOccur') / total;Pxy = idABOccur / total;
MImatrix = Pxy .* log2(Pxy ./(Px' * Py)+eps);MI = sum(MImatrix(:))
% Entropies
Hx = -sum(Px .* log2(Px + eps),2);
Hy = -sum(Py .* log2(Py + eps),2);
%Normalized Mutual information
MIhat = 2 * MI / (Hx+Hy);

% MIhat = MI / sqrt(Hx*Hy); another version of NMI

end

% Example :
% (http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html)
% A = [1 1 1 1 1 1   2 2 2 2 2 2    3 3 3 3 3];
% B = [1 2 1 1 1 1   1 2 2 2 2 3    1 1 3 3 3];
% nmi(A,B)

% ans = 0.3646

參考： stanford的講解：http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html

wiki百科的講解：http://en.wikipedia.org/wiki/Mutual_information

某作者的python的實現：http://blog.sun.tc/2010/10/mutual-informationmi-and-normalized-mutual-informationnmi-for-numpy.html

蔡登的matlab實現：http://www.zjucadcg.cn/dengcai/Data/code/MutualInfo.m

互相關資訊和歸一化互相關資訊

互相關資訊和歸一化互相關資訊

NCC(Normalized Cross Correlation)歸一化互相關原理和C++程式碼實現

歸一化互相關演算法

影象匹配 | NCC 歸一化互相關損失 | 程式碼 + 講解

第9章優化方法和歸一化

梯度下降、過擬合和歸一化

標準化和歸一化對機器學習經典模型的影響

資料特徵的標準化和歸一化你瞭解多少？

R語言中的scale函式及標準化和歸一化的理解

正則化和歸一化

歸一化(softmax)、資訊熵、交叉熵

為什麼要資料歸一化和歸一化方法為什麼要資料歸一化和歸一化方法

機器學習L1和L2正規化和歸一化

特徵的標準化和歸一化

標準化和歸一化

連續特徵離散化和歸一化

機器學習資料標準和歸一化

標準化，歸一化和的概念與適用範圍整理

批歸一化（Batch Normalization）、L1正則化和L2正則化

鏈碼的平移歸一化和旋轉歸一化

互相關資訊和歸一化互相關資訊

相關推薦