Bayesian generalized linear model (GLM) | 貝葉斯廣義線性回歸實例

阿新 • • 發佈：2018-04-05

gamma tail merge detailed 變量 clas under acc sig

學習GLM的時候在網上找不到比較通俗易懂的教程。這裏以一個實例應用來介紹GLM。

We used a Bayesian generalized linear model (GLM) to assign every gene to one or more cell populations, as previously described (Zeisel et al., 2015).

在單細胞RNA-seq的分析中，可以用GLM來尋找marker。

貝葉斯 + 廣義 + 線性回歸

線性回歸：這個最基礎，大部分人應該都知道。為什麽找marker問題可以轉化為線性回歸問題？我們可以把每一個基因的表達當作自變量，把最終的類別作為因變量，擬合線性模型，然後根據系數的分布來得到marker。

廣義：因變量（響應變量）可以服從多種分布（思考：為什麽下文要用負二項分布）；

貝葉斯：是一種新的思維方式，所有的系數都有自己的分布。

The GLM models the measured gene expression of a cell as realizations of a Negative Binomial probability distribution whose mean is determined by a linear combination of K predictors xi with coefficient bi.

技術分享圖片

For each cell, the outcome and predictors are known and the aim is to determine the posterior probability distributions of the coefficients.

As predictors, we use a continuous Baseline predictor and a categorical Cell Type predictor. The Baseline predictor value is the cell’s molecule count normalized to the average molecule count of all cells and takes account of the fact that we expect every gene to have a baseline expression proportional to the total number of expressed molecules within a particular cell. While the Cell Type predictor is set to 1 for the cluster BackSPIN assignation of the cell, and 0 for the other classes. From the definition of the model it follows that the coefficient bk for a Cell Type predictor xk can be interpreted as the additional number of molecules of a particular gene that are present as a result of the cell being of cell type k. A more detailed description of the model, including explanation of the prior probabilities used for the fitting as well as the full source code of the model, is provided elsewhere (Zeisel et al., 2015). The Stan (http://mc-stan.org) source is copied below for completeness:

data {
int < lower = 0 > N; # number of outcomes
int < lower = 0 > K; # number of predictors
matrix < lower = 0 > [N,K] x; # predictor matrix
int y[N]; # outcomes
}
parameters {
vector < lower = 1 > [K] beta; # coefficients
real < lower = 0.001 > r; # overdispersion
}
model {
vector < lower = 0.001 > [N] mu;
vector < lower = 1.001 > [N] rv;
# priors
r !cauchy(0, 1);
beta !pareto(1, 1.5);
# vectorize the overdispersion
for (n in 1:N) {
rv[n] < - square(r + 1) - 1;
}
# regression
mu < - x * (beta - 1) + 0.001;
y !neg_binomial(mu ./ rv, 1 / rv[1]);
}

To determine which genes are higher than basal expression in each population we compared the posterior probability distributions of the Baseline coefficient and the Cell Type coefficient. A gene was considered as marking a cell population if (1) its cell-typespecific coefficient exceeded the Baseline coefficient with 99.8% (95% for the mouse adult) posterior probability, and (2) the median of its posterior distribution did not fall below a threshold q set to 35% of the median posterior probability of the highest expressing group, and (3) the median of the highest-expressing cell type was greater than 0.4. For every gene this corresponds to a binary pattern (0 if the conditions are not met and 1 if they are), and genes can therefore be grouped according to their binarized expression patterns.

We use those binarized patterns to call transcription factor specificity. Our definition of a transcription factor gene was based of annotations provided by the merged annotation of PANTHER GO (Mi et al., 2013) and FANTOM5 (Okazaki et al., 2002), this list was further curated and missing genes and occasional misannotations corrected.

The feature selection procedure is based on the largest difference between the observed coefficient of variation (CV) and the predicted CV (estimated by a non-linear noise model learned from the data) See Figure S1C. In particular, Support Vector Regression (SVR, Smola and Vapnik, 1997) was used for this purpose (scikit-learn python implementation, default parameters with gamma = 0.06; Pedregosa et al., 2011).

特征選取：尋找觀察CV值和預測CV值之間的最大差異。

SVR支持向量回歸

Similarities between clusters within a species were summarized using a Pearson’s correlation coefficient calculated on the binarized matrix (Figures 1C and 1D).

參考：從線性模型到廣義線性模型（1）——模型假設篇

Bayesian generalized linear model (GLM) | 貝葉斯廣義線性回歸實例

gamma tail merge detailed 變量 clas under acc sig 學習GLM的時候在網上找不到比較通俗易懂的教程。這裏以一個實例應用來介紹GLM。 We used a Bayesian generalized linear model

Bayesian generalized linear model (GLM) | 貝葉斯廣義線性回歸實例

貝葉斯 + 廣義 + 線性回歸

Bayesian generalized linear model (GLM) | 貝葉斯廣義線性回歸實例

2.常用演算法（推導）演算法分類，演算法原理，演算法設計，推導---SVM，DTree，樸素貝葉斯，線性迴歸等；

貝葉斯線性迴歸（Bayesian Linear Regression）

貝葉斯層次型模型引數估計 Bayesian hierarchical model parameter estimation with Stan

樸素貝葉斯（Naive Bayesian）

聯合概率與聯合分佈、條件概率與條件分佈、邊緣概率與邊緣分佈、貝葉斯定理、生成模型（Generative Model）和判別模型（Discriminative Model）的區別

貝葉斯網路Bayesian Network (樸素貝葉斯，Naive )

貝葉斯網路Bayesian network基礎理論及演算法介紹

貝葉斯優化(Bayesian Optimization)深入理解

貝葉斯網路(Bayesian Network)

[PRML] Bayesian Learning 貝葉斯學習方法

Bayesian（貝葉斯）

概率圖模型（PGM）：貝葉斯網（Bayesian network）初探

貝葉斯算法的基本原理和算法實現

我對貝葉斯分類器的理解

<Machine Learning in Action >之二樸素貝葉斯 C#實現文章分類

（筆記）斯坦福機器學習第六講--樸素貝葉斯

基於的樸素貝葉斯的文本分類（附完整代碼(spark/java）

樸素貝葉斯分類算法

Knowledge Tracing -- 基於貝葉斯的學生知識點追蹤（BKT）

Bayesian generalized linear model (GLM) | 貝葉斯廣義線性回歸實例

貝葉斯 + 廣義 + 線性回歸

相關推薦