1. 程式人生 > >【譯】統計建模:兩種文化(第六部分)

【譯】統計建模:兩種文化(第六部分)

謝絕任何不通知本人的轉載,尤其是抄襲。

 

Abstract 

1. Introduction 

2. ROAD MAP

3. Projects in consulting

4. Return to the university

5. The use of data models

6. The limitations of data models

7. Algorithmic modeling

8. Rashomon and the multiplicity of good models

9. Occam and simplicity vs. accuracy

10. Bellman and the curse of dimensionality

11. Information from a black box

12. Final remarks

 


 

Statistical Modeling: The Two Cultures 

統計建模:兩種文化

 

Leo Breiman

Professor, Department of Statistics, University of California, Berkeley, California

 

6. THE LIMITATIONS OF DATA MODELS


With the insistence on data models, multivariate analysis tools in statistics are frozen at discriminant analysis and logistic regression in classification and multiple linear regression in regression. Nobody really believes that multivariate data is multivariate normal, but that data model occupies a large number of pages in every graduate textbook on multivariate statistical analysis.

6. 資料模型的侷限性

和資料模型一樣,統計學中的多元分析工具在判別式分析和分類邏輯迴歸以及多重線性迴歸中的地位也很尷尬。沒有人會診相信多元資料是真的符合多元正態分佈的,但是這些資料模型在每一本高校多元統計分析教科書中卻佔據了大量篇幅。

 

With data gathered from uncontrolled observations on complex systems involving unknown physical, chemical, or biological mechanisms, the a priori assumption that nature would generate the data through a parametric model selected by the statistician can result in questionable conclusions that cannot be substantiated by appeal to goodness-of-fit tests and residual analysis. Usually, simple parametric models imposed on data generated by complex systems, for example, medical data, financial data, result in a loss of accuracy and information as compared to algorithmic models (see Section 11).

如果資料是由未知的物理、化學或生物機制中的複雜系統經過未加控制的觀察所得,那麼使用一個先驗假設說問題本質產生的資料是由一個統計學家精心挑選的含參模型產生的可能會導致goodness-of-fit和殘差檢驗無法支援的備受質疑的結論。通常來講,由複雜系統產生的資料會產生的簡單的引數模型。舉個栗子,相比演算法模型,醫療資料,金融資料產生的模型會損失一定的精確度和資訊(詳情見第十一部分)。

 

There is an old saying “If all a man has is a hammer, then every problem looks like a nail.” The trouble for statisticians is that recently some of the problems have stopped looking like nails. I conjecture that the result of hitting this wall is that more complicated data models are appearing in current published applications. Bayesian methods combined with Markov Chain Monte Carlo are cropping up all over. This may signify that as data becomes more complex, the data models become more cumbersome and are losing the advantage of presenting a simple and clear picture of nature’s mechanism.

古人云:“如果一個人僅有一把錘子,那麼每個問題都看起來像是一個釘子”。統計學家所面臨的問題是近來有一些問題看起來不再像是釘子了。我推測打破這面牆壁的結果就是越來越複雜的資料模型被應用到釋出的實踐中。結合了Markov Chain Monte Carlo的貝葉斯方法到處都是。這可能意味著當資料變得越來越複雜,資料模型也只會變得更加冗餘並且失去了能闡述一個簡單和清晰本質的優勢。

 

Approaching problems by looking for a data model imposes an a priori straight jacket that restricts the ability of statisticians to deal with a wide range of statistical problems. The best available solution to a data problem might be a data model; then again it might be an algorithmic model. The data and the problem guide the solution. To solve a wider range of data problems, a larger set of tools is needed.

直接使用一個約定俗成的方法(資料模型)來解決問題會限制統計學家解決更多領域的問題。對於一個數據問題,最好的可行方法很可能不是資料模型,而是演算法模型。資料和問題可以指導這個方法的實施。如果我們需要解決更廣範圍的資料問題,那麼我們需要更多的手段。

 

Perhaps the damaging consequence of the insistence on data models is that statisticians have ruled themselves out of some of the most interesting and challenging statistical problems that have arisen out of the rapidly increasing ability of computers to store and manipulate data. These problems are increasingly present in many fields, both scientific and commercial, and solutions are being found by nonstatisticians.

可能堅持資料模型產生的毀滅性後果就是統計學家們不得不讓自己無法從事一些新興的有趣且具有挑戰性的統計問題,而這些問題是能夠幫他們快速提升計算機和操控資料能力的。這些問題在包含科學和商業的許多領域都開始顯現,並且很多都是由非統計學家解決的。