1. 程式人生 > >【譯】統計建模:兩種文化(第四、五部分)

【譯】統計建模:兩種文化(第四、五部分)

謝絕任何不通知本人的轉載,尤其是抄襲。

 

Abstract 

1. Introduction 

2. ROAD MAP

3. Projects in consulting

4. Return to the university

5. The use of data models

6. The limitations of data models

7. Algorithmic modeling

8. Rashomon and the multiplicity of good models

9. Occam and simplicity vs. accuracy

10. Bellman and the curse of dimensionality

11. Information from a black box

12. Final remarks

 


 

Statistical Modeling: The Two Cultures 

統計建模:兩種文化

 

Leo Breiman

Professor, Department of Statistics, University of California, Berkeley, California

 

4. RETURN TO THE UNIVERSITY

I had one tip about what research in the university was like. A friend of mine, a prominent statistician from the Berkeley Statistics Department, visited me in Los Angeles in the late 1970s. After I described the decision tree method to him, his first question was, “What’s the model for the data?”

 

4. 重返高校

我對這所學校(伯克利)是如何進行科研的有所瞭解。我的一個朋友,一位伯克利統計學院的傑出統計學家在20世紀70年代後期在洛杉磯訪問了我。在我向他描述了決策樹之後,他的第一個問題是:“資料的模型是什麼?”

 

4.1 Statistical Research


Upon my return, I started reading the Annals of Statistics, the flagship journal of theoretical statistics, and was bemused. Every article started with Assume that the data are generated by the following model: ...

followed by mathematics exploring inference, hypothesis testing and asymptotics. There is a wide spectrum of opinion regarding the usefulness of the theory published in the Annals of Statistics to the field of statistics as a science that deals with data. I am at the very low end of the spectrum. Still, there have been some gems that have combined nice theory and significant applications. An example is wavelet theory. Even in applications, data models are universal. For instance, in the Journal of the American Statistical Association (JASA), virtually every article contains a statement of the form: Assume that the data are generated by the following model: ...

I am deeply troubled by the current and past use of data models in applications, where quantitative conclusions are drawn and perhaps policy decisions made.
 

在我返回學校之後,我開始閱讀理論統計學的傑出期刊《統計年鑑》,我困惑了。每一篇文章都是從假設資料來源於某種模型開始,隨之而來的是數學的推斷探索,假設檢驗和漸進。從資料科學領域出發,對於這一套理論的有效性,在《統計年鑑》中,統計學家有著大範圍的討論。我在這個範圍中處於邊緣位置。當然,其中也有一些結合了很棒的理論和重要應用的文章。其中一個例子是微波理論。即使在應用層面,資料模型的應用也很廣泛。舉個栗子,在《美國統計聯盟期刊》(JASA)中,實際上每一篇文章都有如下套路:假設資料來源於某一模型……

對於這種在需要定量結論和決策的應用中大量使用資料模型的現狀和過往我感到很困惑。

【我之前也有看到一篇論文將p值被濫用了,大概就是這個意思吧。其實毋庸置疑好的模型準確率高大家當然喜歡用,但是機器學習的很多演算法沒有統計學這種一套完整的理論,所以不知道如何保證穩定性,這才是最大的問題吧】

 

5. THE USE OF DATA MODELS


Statisticians in applied research consider data modeling as the template for statistical analysis: Faced with an applied problem, think of a data model. This enterprise has at its heart the belief that a statistician, by imagination and by looking at the data, can invent a reasonably good parametric class of models for a complex mechanism devised by nature. Then parameters are estimated and conclusions are drawn. But when a model is fit to data to draw quantitative conclusions:

• The conclusions are about the model’s mechanism, and not about nature’s mechanism. It follows that:
• If the model is a poor emulation of nature, the conclusions may be wrong.

 

5. 資料模型的使用

應用研究領域的統計學家通常會把資料建模當做一個統計分析的模板:在面對一個應用問題時,思考出一個數據模型。客戶就不得允許統計學家通過想象和檢視資料來做分析,根據問題本質為複雜的機制發明一個貼合模型的合理引數集。然後估計這些引數並且得出結論。但是當一個模型是用來擬合數據得出定量結論時:

  • 結論是關於模型的機制的,而不是關於機制本身。這會導致:
  • 如果模型是對事物本質的低效估計,那麼結論可能是錯誤的。

 

These truisms have often been ignored in the enthusiasm for fitting data models. A few decades ago, the commitment to data models was such that even simple precautions such as residual analysis or goodness-of-fit tests were not used. The belief in the infallibility of data models was almost religious. It is a strange phenomenon—once a model is made, then it becomes truth and the conclusions from it are infallible.

 

這些老生常談的東西通常會被建模的熱情所掩蓋,從而被大家忽略。幾十年之前,對於資料模型效用的保證只是一些簡單的措施,例如殘差估計,goodness-of-fit檢測還沒有使用。當時人們對於資料模型的正確性是謹慎的。現在有一個奇怪的現象:一旦模型建立,那麼我們就預設它是真的並且其結論是可用的。

 

5.1 An Example

I illustrate with a famous (also infamous) example: assume the data is generated by independent  draws from the model

where the coefficients {bm} are to be estimated, ε is N~(0, σ2) and σ2 is to be estimated. Given that the data is generated this way, elegant tests of hypotheses, confidence intervals, distributions of the residual sum-of-squares and asymptotics can be derived. This made the model attractive in terms of the mathematics involved. This theory was used both by academic statisticians and others to derive significance levels for coefficients on the basis of model (R), with little consideration as to whether the data on hand could have been generated by a linear model. Hundreds, perhaps thousands of articles were published claiming proof of something or other because the coefficient was significant at the 5% level.

 

我會用一個著名的(也臭名昭著)的例子來說明:假設資料是獨立地由以下模型生成:

係數{bm}是要被估計的值,ε服從均值為0方差為σ的正態分佈,其中方差σ要被估計。因為資料是由這個模型產生,我們可以使用假設檢驗,置信區間,殘差分佈,殘差平方和和漸進。這些方法會讓這個模型看起來很誘人,因為我們使用了數學。學術統計學家和其他領域的人都在使用這個理論從而得到基於模型R的引數的置信區間,但是人們甚少考慮為什麼手上的資料可以由一個線性模型生成。上百的,甚至上千的文章只是在使用95%置信區間說明這個證明過程而且不去探討本質。【難道機器學習探尋本質了?】

 

Goodness-of-fit was demonstrated mostly by giving the value of the multiple correlation coefficient R2 which was often closer to zero than one and which could be overinflated by the use of too many parameters. Besides computing R2, nothing else was done to see if the observational data could have been generated by model (R). For instance, a study was done several decades ago by a well-known member of a university statistics department to assess whether there was gender discrimination in the salaries of the faculty. All personnel files were examined and a data base set up which consisted of salaryas the response variable and 25 other variables which characterized academic performance; that is, papers published, quality of journals published in, teaching record, evaluations, etc. Gender appears as a binary predictor variable.

在給出R方值時,goodness-of-fit是最常被提起的,但是通常我們得到的值都是趨近於0的,而不是理論上的1,尤其在我們使用了過多引數時【過擬合?】。如果觀測到的資料可能來源於模型R,除去計算R方,我們沒有別的評估方法了。舉一個在幾十年前一位某大學統計學院知名成員做的研究的例子。該研究員要評估是否該部門在薪水待遇上存在性別歧視。所有人事部門的資料都被檢測瞭然後形成了一個數據庫。該資料庫包含作為響應變數的薪水,25個由學術表現數值化的其它變數,例如發表的論文,發表論文的期刊質量,教學記錄,教學評估等等。性別被當做一個二進位制變數考慮。

 

A linear regression was carried out on the data and the gender coefficient was significant at the 5% level. That this was strong evidence of sex discrimination was accepted as gospel. The design of the study raises issues that enter before the consideration of a model—Can the data gathered answer the question posed? Is inference justified when your sample is the entire population? Should a data model be used? The deficiencies in analysis occurred because the focus was on the model and not on the problem.

這裡使用了線性迴歸,性別係數在95%置信區間上顯著。有強烈證據表面性別歧視是真實存在的。實際上在進入建模階段之前,這個研究的設計就出現了問題——獲取的資料能回答提出的問題嗎?我們可以確定你的樣本能代表總體嗎?我們應該使用資料模型嗎?之所以會出現分析的漏洞就是應為大家把注意力都放在了模型上而不是問題上。

 

The linear regression model led to many erroneous conclusions that appeared in journal articles waving the 5% significance level without knowing whether the model fit the data. Nowadays, I think most statisticians will agree that this is a suspect way to arrive at conclusions. At the time, there were few objections from the statistical profession about the fairy-tale aspect of the procedure, But, hidden in an elementary textbook, Mosteller and Tukey(1977) discuss many of the fallacies possible in regression and write “The whole area of guided regression is fraught with intellectual, statistical, computational, and subject matter difficulties.”

在期刊雜誌文章中拿著5%顯著性招搖而不去深入探討為什麼模型貼合數據的線性迴歸模型會導致許多錯誤的結論。當今,我想大多數統計學家都可能對通過這一方法得到結論抱有懷疑。同時,幾乎沒有專業的統計學者反對這個近乎童話一樣的過程,但是,Mosteller 和 Tukey在1977的教學書記中含蓄地討論了許多回歸的謬論並且寫下了這句話:“整個指導型迴歸領域都充滿了對智慧、統計、計算機和研究領域的憂慮。”

 

Even currently, there are only rare published critiques of the uncritical use of data models. One of the few is David Freedman, who examines the use of regression models (1994); the use of path models (1987) and data modeling (1991, 1995). The analysis in these papers is incisive.

即使現在,只有很少的出版物批判了資料模型的盲目使用。其中一個就是David Freedman, 他檢測了迴歸模型的使用、路徑模型的使用和資料建模。其論文中的分析很深刻。

 

5.2 Problems in Current Data Modeling


Current applied practice is to check the data model fit using goodness-of-fit tests and residual analysis. At one point, some years ago, I set up a simulated regression problem in seven dimensions with a controlled amount of nonlinearity. Standard tests of goodness-of-fit did not reject linearity until the nonlinearity was extreme. Recent theory supports this conclusion. Work by Bickel, Ritov and Stoker (2001) shows that goodness-of-fit tests have very little power unless the direction of the alternative is precisely specified. The implication is that omnibus goodness-of-fit tests, which test in many directions simultaneously, have little power, and will not reject until the lack of fit is extreme.

 

5.2 當今資料模型中存在的問題

如今的應用例項中都是使用goodness-of-fit和殘差估計來監測資料模型。有一個問題是,在許多年之前,我建立過一個七個維度的非線性可控的模擬迴歸問題。直到極值之前,標準的goodness-of-fit測試並不會拒絕非線性。最近的理論支援了這一結論。Bickel, Ritov 和 Stoker的工作指出除非H1被精確定義,否則goodness-of-fit的效力很弱。這暗示著被應用在很多領域的總體性的goodness-of-fit檢測只有很小的效用,並且直到擬合效果很差之前都不會拒絕假設。

 

Furthermore, if the model is tinkered with on the basis of the data, that is, if variables are deleted or nonlinear combinations of the variables added, then goodness-of-fit tests are not applicable. Residual analysis is similarly unreliable. In a discussion after a presentation of residual analysis in a seminar at Berkeley in 1993, William Cleveland, one of the fathers of residual analysis, admitted that it could not uncover lack of fit in more than four to five dimensions. The papers I have read on using residual analysis to check lack of fit are confined to data sets with two or three variables.

並且,如果模型在資料基礎上進行過修正,也就是說,如果變數被刪除或者非線性的變數組合被新增,goodness-of-fit就不適用了。同理,殘差分析也可能會不穩定。在1993年伯克利的一個關於殘差分析演講的研討會上,William Cleveland, 殘差分析之父之一,承認殘差分析可能會無法覆蓋超過四維或者五維的模型擬合。我所讀過的關於使用殘差分析來檢測擬合性的文章通常都受限在兩個或者三個變數。【what? 我所學的都是十幾個變數也在用殘差分析呀】

 

With higher dimensions, the interactions between the variables can produce passable residual plots for a variety of models. A residual plot is a goodness-of fit test, and lacks power in more than a few dimensions. An acceptable residual plot does not imply that the model is a good fit to the data.

在更高維度上,變數之間的交叉項可能會產生對一定範圍模型都可行的殘差圖。一個殘差圖就是一個goodness-of-fit檢測,在一定維度之後就變得缺少效力。一個可行的殘差圖並不代表著一個適合資料的模型。

 

There are a variety of ways of analyzing residuals. For instance, Landwher, Preibon and Shoemaker (1984, with discussion) gives a detailed analysis of fitting a logistic model to a three-variable data set using various residual plots. But each of the four discussants present other methods for the analysis. One is left with an unsettled sense about the arbitrariness of residual analysis.

殘差分析有很多方法。例如,Landwher, Preibon和Shoemaker給出了詳細的擬合一個三變數資料集的邏輯迴歸的變數殘差圖的詳細分析。但是四位討論者呈現了其它分析方法,其中一個包含著關於殘差分析恣意性的激進觀點。

 

Misleading conclusions may follow from data models that pass goodness-of-fit tests and residual checks. But published applications to data often show little care in checking model fit using these methods or any other . For instance, many of the current application articles in JASA that fit data models have very little discussion of how well their model fits the data. The question of how well the model fits the data is of secondary importance compared to the construction of an ingenious stochastic model.

如果資料模型通過了goodness-of-fit和殘差檢驗,誤導性的結論可能會產生。但是已釋出的資料應用中通常很少關心模型的檢測。舉個栗子,許多當今的JASA中的應用都很少談到他們的模型如何貼合數據。模型擬合效果多好這類的問題優先順序一般都在建立一個精緻隨機模型之後。

 

5.3 The Multiplicity of Data Models


One goal of statistics is to extract information from the data about the underlying mechanism producing the data. The greatest plus of data modeling is that it produces a simple and understandable picture of the relationship between the input variables and responses. For instance, logistic regression in classification is frequently used because it produces a linear combination of the variables with weights that give an indication of the variable importance. The end result is a simple picture of how the prediction variables affect the response variable plus confidence intervals for the weights. Suppose two statisticians, each one with a different approach to data modeling, fit a model to the same data set. Assume also that each one applies standard goodness-of-fit tests, looks at residuals, etc., and is convinced that their model fits the data. Yet the two models give different pictures of nature’s mechanism and lead to different conclusions.

 

5.3 資料模型的多樣性

統計學的一個目標就是從資料入手,從潛在的產生資料的機制中提取資訊。資料模型最大的加分點就是它能產生一個簡單並且易懂的關於解釋變數和響應變數的關係圖【這點是上課也強調過的】。舉個栗子,分類問題中的邏輯迴歸常常被使用的原因是它可以產生一個有權重的變數的線性組合,讓人們知道不同的重要性。結果就是一個簡單的帶置信區間的解釋變數如何影響響應變數的圖。假設兩個統計學家每個都用不同的方法來建立資料模型,擬合同一個數據集。假設每個人都用標準的goodness-of-fit和殘差分析等來檢查模型,並且認為他們的模型貼合數據。那麼這兩個模型就會給出不同的對於問題本質的解釋並且得出不同的結論。

 

McCullah and Nelder (1989) write “Data will often point with almost equal emphasis on several possible models, and it is important that the statistician recognize and accept this.” Well said, but different models, all of them equally good, may give different pictures of the relation between the predictor and response variables. The question of which one most accurately reflects the data is difficult to resolve. One reason for this multiplicity is that goodness-of-fit tests and other methods for checking fit give a yes–no answer. With the lack of power of these tests with data having more than a small number of dimensions, there will be a large number of models whose fit is acceptable. There is no way, among the yes–no methods for gauging fit, of determining which is the better model. A few statisticians know this. Mountain and Hsiao (1989) write, “It is difficult to formulate a comprehensive model capable of encompassing all rival models. Furthermore, with the use of finite samples, there are dubious implications with regard to the validity and power of various encompassing tests that rely on asymptotic theory.”

 

McCullah 和 Nelder在1989年寫道:“資料通常在許多可能的模型中都能得到解釋,並且統計學家能夠意識到並接受這點非常重要。” 說的挺好,但是同樣好的不同模型可能會給出不同的響應變數和解釋變數之間的關係。那麼哪一個最精確呢?——通常這個問題是很難衡量的。能產生這樣的多樣性一個原因是因為goodness-of-fit檢測和其他方法智慧給出yes-no的回答。在多維度資料中,當我們無從得知這些檢測的效力如何時,就會產生大量的可以擬合數據的模型。通過yes-no來判定模型並且說哪個更好是不現實的。一些統計學家意識到了這一點。Mountain和Hsiao寫道:“我們很難形成一個能夠涵蓋所有效力相同的模型的複雜模型。並且,在有限的樣本下,許多相關的依賴於漸進法的檢測的可用性和效用是值得懷疑的。”

 

Data models in current use may have more damaging results than the publications in the social sciences based on a linear regression analysis. Just as the 5% level of significance became a de facto standard for publication, the Cox model for the analysis of survival times and logistic regression for survive–nonsurvive data have become the de facto standard for publication in medical journals. That different survival models, equally well fitting, could give different conclusions is not an issue.

相比於基於線性迴歸分析的社會學科出版物,如今的資料模型可能有著更毀滅性的結果。例如95%的的顯著性變成了出版物的必備元素,對於survival times的Cox模型和suvive-non-survive資料的邏輯迴歸已經變成了醫療期刊的必備配置。不同的存活模型可以得到相同的擬合效果,即時給出不同的結論也不是一個大問題。

 

5.4 Predictive Accuracy


The most obvious way to see how well the model box emulates nature’s box is this: put a case x down nature’s box getting an output y. Similarly, put the same case x down the model box getting an output y . The closeness of y and y is a measure of how good the emulation is. For a data model, this translates as: fit the parameters in your model by using the data, then, using the model, predict the data and see how good the prediction is.

 

5.4 預測精確率

去檢視一個模型是否能很好模擬本質的最顯而易見的方法就是:把一個例子x放進本質模型中得到一個輸出y。同樣地,把同樣的x放進模型中得到輸出y。 y和y越接近說明擬合的越好。對於一個數據模型,我們可以把這個過程翻譯為:通過資料擬合模型中的引數,然後使用模型來預測資料,看看預測結果如何。

 

Prediction is rarely perfect. There are usually many unmeasured variables whose effect is referred to as “noise.” But the extent to which the model box emulates nature’s box is a measure of how well our model can reproduce the natural phenomenon producing the data.

預測幾乎是不可能完美的。通常會有很多不可衡量的變數產生所謂“噪聲”的影響。但是在使用模型模擬本質的這個層面上,我們在乎的是通過這個模型再生產出來的y能夠有多貼合產生這些資料的本質現象。【這裡我突然有一點懂了,其實資料科學的本質還是在玩資料,我們期望通過建模等手段模擬資料產生的原因,而不是建立酷炫好看的模型,畢竟解決問題才是實際的】

 

McCullagh and Nelder (1989) in their book on generalized linear models also think the answer is obvious. They write, “At first sight it might seem as though a good model is one that fits the data very well; that is, one that makes ˆμ (the model predicted value) very close to y (the response value).” Then they go on to note that the extent of the agreement is biased by the number of parameters used in the model and so is not a satisfactory measure. They are, of course, right. If the model has too many parameters, then it may overfit the data and give a biased estimate of accuracy. But there are ways to remove the bias. To get a more unbiased estimate of predictive accuracy, cross-validation can be used, as advocated in an important early work by Stone
(1974). If the data set is larger, put aside a test set.

McCullagh和Nelder在他們關於廣義線性模型的書中也提到了這個問題。他們寫道:“儘管第一眼看上去某一個模型應該是能夠很好地擬合數據,但其實只是一個人用μ去接近y(即用模型預測值去接近響應變數)”。然後他們提到這種和諧會隨著引數數量的變化而有偏,所以這不是一個很好地方法。他們當然是對的。如果一個模型有太多的引數,那麼它可能過擬合併且給出有偏的估計值。但是我們有很多方法讓它無偏。為了得到一個更加無偏的精確預測估計,Stone提出應該把交叉驗證作為一個早期的重要工作去使用。如果資料集很大,把驗證集放在一邊(不是很確定這裡翻譯對了)。

 

Mosteller and Tukey(1977) were early advocates of cross-validation. They write, “Cross-validation is a natural route to the indication of the quality of any data-derived quantity... . We plan to cross-validate carefully wherever we can.”

Judging by the infrequency of estimates of predictive accuracy in JASA, this measure of model fit that seems natural to me (and to Mosteller and Tukey) is not natural to others. More publication of predictive accuracy estimates would establish standards for comparison of models, a practice that is common in machine learning.

Mosteller 和 Tukey很早就倡導使用交叉驗證。他們說道:“交叉驗證是一個指向任何資料派生的總量的質量的自然的路徑……我們計劃在任何可以使用的地方小心地使用交叉驗證。”

通過JASA中少見的預測精確率的估計判斷,這個對於模型擬合的衡量標準對我來說能接受多了,但並不被大眾所接受。更多的出版物的預測精確估計應該建立在標準的模型比較上——一個在機器學習領域更加常見的例項。