1. 程式人生 > >使用R語言的BNLearn包實現貝葉斯網路

使用R語言的BNLearn包實現貝葉斯網路


1. 載入程式包匯入資料

library(bnlearn)  #CRAN中有,可以直接用install.packages(“bnlearn”)安裝或者去網上下載後複製到library資料夾下即可。

library(Rgraphviz)  #用於繪圖。這個包CRAN中沒有,需要到http://www.bioconductor.org/pack ... ws.html#___Software 去下載。



data(learning.test)  #匯入資料,資料框中的變數必須全部為因子型(離散)或數值型(連續)。

lear.test =read.csv("***.csv", colClasses ="factor")  #也可以直接從csv檔案直接匯入資料。需要注意的是如果資料中含有0-1之類的布林型,或者1-3之類的等級資料,需要強行指定其為因子型,不然許多BN函式會報錯。因為read函式只會自動的將字元型轉換成因子型,其他的不會自動轉換。



該包包含貝葉斯網路的結構學習、引數學習和推理三個方面的功能,其中結構學習包含基於約束的演算法、基於得分的演算法和混合演算法,引數學習包括最大似然估計和貝葉斯估計兩種方法。此外還有引導(bootstrap),交叉驗證(cross-validation)和隨機模擬(stochastic simulation)等功能,附加的繪圖功能需要呼叫前述的Rgraphviz and lattice包。

Bayesian network structure learning (via constraint-based, score-based and hybrid algorithms), parameter learning (via ML and Bayesian estimators) and inference.

This package implements some algorithms for learning the structure of Bayesian networks.

Constraint-based algorithms, also known as conditional independence learners, are all optimized derivatives of the Inductive Causation algorithm (Verma and Pearl, 1991). These algorithms use conditional independence tests to detect the Markov blankets of the variables, which in turn are used to compute the structure of the Bayesian network.

Score-based learning algorithms are general purpose heuristic optimization algorithms which rank network structures with respect to a goodness-of-fit score.

Hybrid algorithms combine aspects of both constraint-based and score-based algorithms, as they use conditional independence tests (usually to reduce the search space) and network scores (to find the optimal network in the reduced space) at the same time.

Several functions for parameter estimation, parametric inference, bootstrap, cross-validation and stochastic simulation are available. Furthermore, advanced plotting capabilities are implemented on top of the Rgraphviz and lattice packages. 

2 基於約束的演算法

Bnlearn包中可使用的基於約束的演算法有gs、iamb、fast.iamb、inter.iamb。

Available constraint-based learning algorithms

引用方法很簡單,就是函式名加資料框作為引數就可以了。做結構學習的時候還可以自定義黑名單、白名單列表,在學習中引入專家知識。

res = gs(learning.test)



Grow-Shrink演算法(GS):是第一個(也是最簡單)將馬爾科夫邊界檢測演算法(Margaritis,2003年)用於結構學習的演算法。伸展/收縮。

Grow-Shrink (gs): based on the Grow-Shrink Markov Blanket, the first (and simplest) Markov blanket detection algorithm (Margaritis, 2003) used in a structure learning algorithm.

Incremental Association(iamb):基於馬爾可夫邊界檢測演算法相同的名稱(Tsamardinos等,2003),這是基於兩個階段的選擇方案(一個向前的選擇後緊跟著嘗試消除誤報)。增量協會

Incremental Association (iamb): based on the Markov blanket detection algorithm of the same name (Tsamardinos et al., 2003), which is based on a two-phase selection scheme (a forward selection followed by an attempt to remove false positives).

Fast Incremental Association(fast.iamb):IAMP使用投機逐步向前選擇條件獨立測試的人數減少(Yaramakala Margaritis,2005年)的一個變種。快速增量協會

Fast Incremental Association (fast.iamb): a variant of IAMB which uses speculative stepwise forward selection to reduce the number of conditional independence tests (Yaramakala and Margaritis, 2005).

Interleaved Incremental Association(inter.iamb):IAMP的另一個變種,採用向前逐步選擇(Tsamardinos等,2003),以避免誤報馬爾可夫邊界檢測階段。交錯增量協會

Interleaved Incremental Association (inter.iamb): another variant of IAMB which uses forward stepwise selection (Tsamardinos et al., 2003) to avoid false positives in the Markov blanket detection phase.



這些演算法的計算複雜度是多項式的測試的數量,通常為O(N ^ 2)(O(N ^ 4)在最壞的情況下),其中N是變數的數目。執行的時間尺度線性資料集的大小。

The computational complexity of these algorithms is polynomial in the number of tests, usually O(N^2) (O(N^4) in the worst case scenario), where N is the number of variables. Execution time scales linearly with the size of the data set.



條件獨立測試

(有條件)獨立測試

Available (conditional) independence tests

基於約束的演算法在實踐中使用的條件獨立測試,統計測試資料集。可用的測試(以及相應的標籤)包括:

The conditional independence tests used in constraint-based algorithms in practice are statistical tests on the data set. Available tests (and the respective labels) are:



離散情況(多項式分佈)

discrete case (multinomial distribution)

互資訊:理論上的資訊距離測度。相關的測試模型有:漸近卡方檢驗(MI),蒙特卡羅置換檢驗(MC-MI),序貫蒙特卡羅置換檢驗(SMC-MI),和半引數測試(SP-MI)。

mutual information: an information-theoretic distance measure. It's proportional to the log-likelihood ratio (they differ by a 2n factor) and is related to the deviance of the tested models. The asymptotic chi-square test (mi), the Monte Carlo permutation test (mc-mi), the sequential Monte Carlo permutation test (smc-mi), and the semiparametric test (sp-mi) are implemented.

?互資訊(MI-SH):基於互資訊的J-S估計的改進漸近卡方檢驗。測試模型包括:皮爾遜的X ^ 2:經典的皮爾遜的X ^ 2檢驗應急表。漸近卡方檢驗(X2),蒙特卡羅(MC-X^2)置換檢驗,序貫蒙特卡羅置換檢驗(SMC-X2)和半引數測試(SP-X2)來實現。



shrinkage estimator for the mutual information (mi-sh): an improved asymptotic chi-square test based on the James-Stein estimator for the mutual information. Pearson's X^2: the classical Pearson's X^2 test for contingency tables. The asymptotic chi-square test (x2), the Monte Carlo permutation test (mc-x2), the sequential Monte Carlo permutation test (smc-x2) and semiparametric test (sp-x2) are implemented.



連續情況(多元正態分佈)

continuous case (multivariate normal distribution)

線性相關性:線性相關。檢驗方法包括:t檢驗(COR),蒙特卡羅置換檢驗(MC-COR)和序貫蒙特卡羅置換檢驗(SMC-COR)。

linear correlation: linear correlation. The exact Student's t test (cor), the Monte Carlo permutation test (mc-cor) and the sequential Monte Carlo permutation test (smc-cor) are implemented.

Fisher's Z: a transformation of the linear correlation with asymptotic normal distribution. Used by commercial software (such as TETRAD II) for the PC algorithm (an R implementation is present in the pcalg package on CRAN). The asymptotic normal test (zf), the Monte Carlo permutation test (mc-zf) and the sequential Monte Carlo permutation test (smc-zf) are implemented.

互資訊:與離散的情況相同。包括漸進的卡方檢驗(MI-G),蒙特卡羅置換檢驗(MC-MI-G)和序貫蒙特卡羅置換檢驗(SMC-MI-G)。

mutual information: an information-theoretic distance measure. Again it's proportional to the log-likelihood ratio (they differ by a 2n factor). The asymptotic chi-square test (mi-g),the Monte Carlo permutation test (mc-mi-g) and the sequential Monte Carlo permutation test (smc-mi-g) are implemented.

?互資訊(MI-G-SH):與離散的情況相同。

shrinkage estimator for the mutual information (mi-g-sh): an improved asymptotic chi-square test based on the James-Stein estimator for the mutual information.



3 基於得分的演算法

Available score-based learning algorithms



爬山(hc):在有向圖空間上執行貪婪爬山搜尋。

Hill-Climbing (hc): a hill climbing greedy search on the space of the directed graphs. The optimized implementation uses score caching, score decomposability and score equivalence to reduce the number of duplicated tests.

禁忌搜尋(tabu):修改後的爬山法,能夠避免區域性最優。

Tabu Search (tabu): a modified hill climbing able to escape local optima by selecting a network that minimally decreases the score function.



Random restart with a configurable number of perturbing operations is implemented for both algorithms.



可用的得分包括:

Available network scores

Available scores (and the respective labels) are:

離散情況(多項式分佈)

discrete case (multinomial distribution)

多項式的對數似然(loglik)得分,相當於Weka中對熵的測度。

the multinomial log-likelihood (loglik) score, which is equivalent to the entropy measure used in Weka.

赤池資訊量準則得分(aic)。

the Akaike Information Criterion score (aic).

貝葉斯資訊量準則得分(bic),相當於MDL(也稱為Schwarz資訊準則)。

the Bayesian Information Criterion score (bic), which is equivalent to the Minimum Description Length (MDL) and is also known as Schwarz Information Criterion.

對數形式的貝葉斯狄氏等價得分(bde),相當於狄氏後驗密度。

the logarithm of the Bayesian Dirichlet equivalent score (bde), a score equivalent Dirichlet posterior density.

對數形式的修正貝葉斯狄氏等價得分(mbde),使用試驗和觀察的資料混合(沒有與之相對應的得分)。

the logarithm of the modified Bayesian Dirichlet equivalent score (mbde) for mixtures of experimental and observational data (not score equivalent).

對數形式的K2得分(k2),一種狄氏後驗密度(沒有與之相應的得分,K2演算法在Matlab的BN工具箱裡有)。

the logarithm of the K2 score (k2), a Dirichlet posterior density (not score equivalent).



連續情況(多元正態分佈)

continuous case (multivariate normal distribution)



多元高斯對數似然得分(loglik-g)。

the multivariate Gaussian log-likelihood (loglik-g) score.

赤池資訊量準則評分(aic-g)。

the corresponding Akaike Information Criterion score (aic-g).

貝葉斯資訊量準則(bic-g)的得分。

the corresponding Bayesian Information Criterion score (bic-g).

高斯後驗密度(bge)。

a score equivalent Gaussian posterior density (bge).



***附註*****************

The log likelihood of the model is the value that is maximized by the process that computes the maximum likelihood value for the Bi parameters. 

The Deviance is equal to -2*log-likelihood.

Akaike’s Information Criterion (AIC) is -2*log-likelihood 2*k where k is the number of estimated parameters.

The Bayesian Information Criterion (BIC) is -2*log-likelihood k*log(n) where k is the number of estimated parameters and n is the sample size.  The Bayesian Information Criterion is also known as the Schwartz criterion.

赤池資訊量準則,即Akaike information criterion,簡稱AIC,是衡量統計模型擬合優良性的一種標準,是由日本統計學家赤池弘次創立和發展的。赤池資訊量準則建立在熵的概念基礎上,可以權衡所估計模型的複雜度和此模型擬合數據的優良性。

在一般的情況下AIC可以表示為:AIC=(2k-2L)/n

它的假設條件是模型的誤差服從獨立正態分佈。其中:k是引數的數量,L是對數似然函式,n是觀測值數目。

AIC的大小取決於L和k。k取值越小,AIC越小;L取值越大,AIC值越小。k小意味著模型簡潔,L大意味著模型精確。因此AIC和修正的決定係數類似,在評價模型是兼顧了簡潔性和精確性。

具體到,L=-(n/2)*ln(2*pi)-(n/2)*ln(sse/n)-n/2.其中n為樣本量,sse為殘差平方和

表明增加自由引數的數目提高了擬合的優良性,AIC鼓勵資料擬合的優良性但是儘量避免出現過度擬合(Overfitting)的情況。所以優先考慮的模型應是AIC值最小的那一個。赤池資訊準則的方法是尋找可以最好地解釋資料但包含最少自由引數的模型。



4 混合演算法

Available hybrid learning algorithms



Max-Min Hill-Climbing (mmhc): a hybrid algorithm which combines the Max-Min Parents and Children algorithm (to restrict the search space) and the Hill-Climbing algorithm (to find the optimal network structure in the restricted space).

Restricted Maximization (rsmax2): a more general implementation of the Max-Min Hill-Climbing, which can use any combination of constraint-based and score-based algorithms.



5 其他演算法

Other (constraint-based) local discovery algorithms

這些演算法實現與貝葉斯網路相關的無向圖的結構學習,通常使用混合學習演算法。

These algorithms learn the structure of the undirected graph underlying the Bayesian network, which is known as the skeleton of the network or the (partial) correlation graph. Therefore all the arcs are undirected, and no attempt is made to detect their orientation. They are often used in hybrid learning algorithms.

Max-Min Parents and Children (mmpc): a forward selection technique for neighbourhood detection based on the maximization of the minimum association measure observed with any subset of the nodes selected in the previous iterations (Tsamardinos, Brown and Aliferis, 2006).

6 貝葉斯網路分類

Bayesian Network classifiers

演算法的目的是分類,贊成的能力,以恢復正確的網路結構的預測能力。實施在bnlearn假定所有的變數,包括分類器,是離散的。

The algorithms are aimed at classification, and favour predictive power over the ability to recover the correct network structure. The implementation in bnlearn assumes that all variables, including the classifiers, are discrete.

樸素貝葉斯(naive.bayes):一個很簡單的演算法,假設是所有的分類是獨立的,使用分類目標變數的後驗概率。

Naive Bayes (naive.bayes): a very simple algorithm assuming that all classifiers are independent and using the posterior probability of the target variable for classification.

樹增強樸素貝葉斯(tree.bayes):這種演算法使用一個樸素貝葉斯改進,周劉近似的依賴結構的分類。

Tree-Augmented Naive Bayes (tree.bayes): a improvement over naive Bayes, this algorithms uses Chow-Liu to approximate the dependence structure of the classifiers.


Hiton Parents and Children (si.hiton.pc): a fast forward selection technique for neighbourhood detection designed to exclude nodes early based on the marginal association. The implementation follows the Semi-Interleaved variant of the algorithm described in Aliferis et al. (2010).

Chow-Liu (chow.liu): an application of the minimum-weight spanning tree and the information inequality. It learn the tree structure closest to the true one in the probability space (Chow and Liu, 1968).

ARACNE (aracne): an improved version of the Chow-Liu algorithm that is able to learn polytrees (Margolin et al., 2006).