1. 程式人生 > >馬爾可夫毯式遺傳演算法在基因選擇中的應用

馬爾可夫毯式遺傳演算法在基因選擇中的應用

#引用

##LaTex

@article{ZHU20073236, title = “Markov blanket-embedded genetic algorithm for gene selection”, journal = “Pattern Recognition”, volume = “40”, number = “11”, pages = “3236 - 3248”, year = “2007”, issn = “0031-3203”, doi = “https://doi.org/10.1016/j.patcog.2007.02.007”, url = “http://www.sciencedirect.com/science/article/pii/S0031320307000945

”, author = “Zexuan Zhu and Yew-Soon Ong and Manoranjan Dash”, keywords = “Microarray, Feature selection, Markov blanket, Genetic algorithm (GA), Memetic algorithm (MA)” }

##Normal

Zexuan Zhu, Yew-Soon Ong, Manoranjan Dash, Markov blanket-embedded genetic algorithm for gene selection, Pattern Recognition, Volume 40, Issue 11, 2007, Pages 3236-3248, ISSN 0031-3203,

https://doi.org/10.1016/j.patcog.2007.02.007. (http://www.sciencedirect.com/science/article/pii/S0031320307000945) Keywords: Microarray; Feature selection; Markov blanket; Genetic algorithm (GA); Memetic algorithm (MA)

#摘要

Microarray technologies the smallest possible set of genes

Markov blanket-embedded genetic algorithm (MBEGA) for gene selection problem

Markov blanket and predictive power in classifier model

filter, wrapper, and standard GA

evaluation criteria: classification accuracy, number of selected genes, computational cost, and robustness

#主要內容

這裡寫圖片描述

##Markov Blanket(Markov毯)

FF — 所有特徵的集合 CC — 類別

一個特徵FiF_i的Markov毯 定義如下:

定義(Markov毯) MM — 一個特徵子集(不包含FiF_i) 即,MFM \in FFiMF_i \notin MMMFiF_i的一個Markov毯,若 給定MMFiF_i是對於 (FC)M{Fi}\left( F \cup C \right) - M - \left\{ F_i \right\}條件獨立的, 即,P(FM{Fi},CFi,M)=P(FM{Fi},CM)P \left( F - M - \left\{ F_i \right\}, C | F_i, M \right) = P \left( F - M - \left\{ F_i \right\}, C | M \right)

給定X,兩個屬性A與B是條件獨立的,若$P \left( A | X, B \right) = P \left( A | X \right) BXA,也就是說,B並不能在X之外提供關於A的資訊。若一個特徵F_iMarkov在當前選擇的特徵子集中有一個Markov毯M,那麼F_iM之外關於C不能提供其他選擇的特徵的資訊,因此,F_i使能夠安全移除。然而,決定特徵的條件獨立的計算複雜度通常非常高,因此,只使用一個特徵來估計F_i$的Markov毯。

定義(近似Markov毯) 對於兩個特徵FiF_iFjF_j iji\neq jFjF_j可看作為FiF_i的近似Markov毯,若SUj,CSUi,CSU_{j,C} \geq SU_{i,C}SUi,jSUi,CSU_{i,j} \geq SU_{i,C},其中, 對稱不確定性(symmetrical uncertainty,SU)度量特徵(包括類,CC)間的相關性,定義為:

這裡寫圖片描述

IG(FiFj)IG \left( F_i | F_j \right) — 特徵FiF_iFjF_j間的資訊增益 H(Fi)H \left( F_i \right)H(Fj)H \left( F_j \right) — 特徵FiF_iFjF_j的熵 SUi,CSU_{i,C} — 特徵FiF_i與類CC間的相關性,稱為C-correlation 一個特徵被認為是相關的若其C-correlation高於使用者給定的閾值γ\gamma,即,Si,C>γS_{i,C} > \gamma 沒有任何近似Markov毯的特徵為predominant feature主導特徵

##馬爾可夫毯式嵌入式遺傳演算法

這裡寫圖片描述

這裡寫圖片描述

這裡寫圖片描述

若適應值差異小於ε\varepsilon,則特徵數較少的個體較好

Lamarckian learning: 通過將區域性改進的個體放回種群競爭繁殖的機會,來迫使基因型反映改進的效果

這裡寫圖片描述

XX — 選擇的特徵子集 YY — 排除的特徵子集

這裡寫圖片描述

C-correlation 只計算一次

搜尋範圍LL — 定義了AddAddDelDel操作的最大數目 — L2L^2個操作組合 隨機順序 — 直到得到改進提升效果

這裡寫圖片描述

Lamarckian learning process

之後是 usual evolutionary operations:

  1. linear ranking selection
  2. uniform crossover
  3. mutation operators with elitism

##試驗

MBEGA method

考慮了:

  1. the FCBF (fast correlation-based filter)
  2. BIRS (best incremental ranked subset)
  3. standard GA feature selection algorithms

FCBF — a fast correlation based filter method

  1. selecting a subset of relevant features whose C-correlation are larger than a given threshold γ\gamma
  2. sorts the relevant features in descending order in terms of C-correlation
  3. redundant features are eliminated one-by-one in a descending order

A feature is redundant 僅當 it has an approximate Markov blanket

predominant features with zero redundant features in terms of C-correlation

BIRS — a similar scheme as the FCBF evaluates the goodness of features using a classifier

  1. ranking the genes according to some measure of interest
  2. sequentially selects the ranked features one-by-one based on their incremental usefulness

calls to the classifier as many times as the number of features

BIRSFBIRS_F or BIRSWBIRS_W — 基於 — C-correlation (i.e., symmetrical uncertainty between feature FiF_i and the class CC) or individual predictive power

BIRSFBIRS_F 耗時更少

###synthetic data 合成數據

這裡寫圖片描述 ten 10-fold crossvalidations with C4.5 classifier

10 independent runs

The maximum number of selected features in each chromosome, m, is set to 50.

###microarray data 微陣列資料

這裡寫圖片描述

The .632+ bootstrap

這裡寫圖片描述

K次重取樣

the support vector machine (SVM) — microarray classification problems

one-versus-rest strategy — multi-class datasets

the linear kernel SVM

這裡寫圖片描述 這裡寫圖片描述