1. 程式人生 > >【譯】統計建模:兩種文化(第三部分)

【譯】統計建模:兩種文化(第三部分)

謝絕任何不通知本人的轉載,尤其是抄襲。

 

Abstract 

1. Introduction 

2. ROAD MAP

3. Projects in consulting

4. Return to the university

5. The use of data models

6. The limitations of data models

7. Algorithmic modeling

8. Rashomon and the multiplicity of good models

9. Occam and simplicity vs. accuracy

10. Bellman and the curse of dimensionality

11. Information from a black box

12. Final remarks

 


 

Statistical Modeling: The Two Cultures 

統計建模:兩種文化

 

Leo Breiman

Professor, Department of Statistics, University of California, Berkeley, California

 

3. PROJECTS IN CONSULTING

As a consultant I designed and helped supervise surveys for the Environmental Protection Agency (EPA) and the state and federal court systems. Controlled experiments were designed for the EPA, and I analyzed traffic data for the U.S. Department of Transportation and the California Transportation Department. Most of all, I worked on a diverse set of prediction projects. Here are some examples:

Predicting next-day ozone levels.

Using mass spectra to identify halogen-containing compounds.

Predicting the class of a ship from high altitude radar returns.

Using sonar returns to predict the class of a submarine.

Identity of hand-sent Morse Code.

Toxicity of chemicals.

On-line prediction of the cause of a freeway traffic breakdown.

Speech recognition The sources of delay in criminal trials in state court systems.

To understand the nature of these problems and the approaches taken to solve them, I give a fuller description of the first two on the list.

 

3. 諮詢工作

作為一個諮詢師,我負責設計和幫助EPA(環境保護署)調查監管,並且將結果彙報給聯邦法院系統。我為EPA設計控制變數實驗,分析美國交通部和加利福尼亞交通部的交通資料。總體來說,我致力於研究許多不同的專案。以下為一些例子:

預測隔天臭氧等級。

使用質量光譜鑑別鹵化物。

通過高海拔無線電探測器甄別船隻型別。

通過聲吶返回值預測潛艇種類。

鑑別手打摩爾斯電碼。

化學毒性。

線上預測高速公路交通設施損壞原因。

使用語音識別識別州法院系統犯罪審判延遲的源頭(沒看懂啥意思)。

為了瞭解這些問題的本質和其解決辦法,我會詳細介紹列表前兩個例子。

 

3.1 The Ozone Project

In the mid-to-late 1960s ozone levels became a serious health problem in the Los Angeles Basin. Three different alert levels were established. At the highest, all government workers were directed not to drive to work, children were kept off playgrounds and outdoor exercise was discouraged.

The major source of ozone at that time was automobile tailpipe emissions. These rose into the low atmosphere and were trapped there by an inversion layer. A complex chemical reaction, aided by sunlight, cooked away and produced ozone two to three hours after the morning commute hours. The alert warnings were issued in the morning, but would be more effective if they could be issued 12 hours in advance. In the mid-1970s, the EPA funded a large effort to see if ozone levels could be accurately predicted 12 hours in advance.

Commuting patterns in the Los Angeles Basin are regular, with the total variation in any given daylight hour varying only a few percent from one weekday to another. With the total amount of emissions about constant, the resulting ozone levels depend on the meteorology of the preceding days. A large data base was assembled consisting of lower and upper air measurements at U.S. weather stations as far away as Oregon and Arizona, together with hourly readings of surface temperature, humidity, and wind speed at the dozens of air pollution stations in the Basin and nearby areas.

 

3.1 臭氧專案

在二十世紀60年代中後期,臭氧問題嚴重影響著洛杉磯盆地人們的健康。當局使用了三個不同的預警等級。最高等級:所有政府部門工作人員不允許駕駛上班,孩童必須遠離操場等場所,並且不建議進行戶外活動。

當時主要的臭氧排放源是汽車尾氣。尾氣進入較低的大氣層並且在一個反轉層轉化。在日照催化下,經過複雜的化學反應,臭氧會在早高峰的兩到三個小時形成。我們可以在早上形成較多臭氧時發出警報,但是如果能提前12小時釋出預警,就能更加有效避免一定傷害。在20世紀70年代中期,EPA耗費了大量精力來尋求是否能夠提前12小時釋出準確的臭氧預警。

洛杉磯盆地的通勤模式是常規的:在給定夏令時中,只有少量百分點會在工作日內變化。因此,總尾氣排放量趨近於一個常數,臭氧等級取決於基於先前若干天的氣象學。為了完成這個專案,我們安裝了一個大型資料庫。其中包括從俄勒岡州到亞利桑那州的各種美國氣象臺的空氣評估資料,包含了洛杉磯盆地和附近區域眾多空氣汙染站點每小時地表溫度、溼度、風速等的讀取。

 

Altogether, there were daily and hourly readings of over 450 meteorological variables for a period of seven years, with corresponding hourly values of ozone and other pollutants in the Basin. Let x be the predictor vector of meteorological variables on the nth day. There are more than 450 variables in x since information several days back is included. Let y be the ozone level on the (n + 1) st day. Then the problem was to construct a function f(x) such that for any future day and future predictor variables x for that day, f(x) is an accurate predictor of the next day’s ozone level y.

總之,我們擁有對於洛杉磯盆地關於臭氧和其他汙染物長達七年的超過450個以天和小時為量級的氣象學變數。我們設x為第n天時,這些氣象學變數組成的預測向量。在x中,因為許多天前的資訊都被包含在其中,它有超過450個變數。假設y為第(n+1)天的臭氧等級。那麼問題就轉換成了構造一個關於y和x的函式f(x),使得f(x)能夠精確預測第二天的臭氧等級y。

 

To estimate predictive accuracy, the first five years of data were used as the training set. The last two years were set aside as a test set. The algorithmic modeling methods available in the pre-1980s decades seem primitive now. In this project large linear regressions were run, followed by variable selection. Quadratic terms in, and interactions among, the retained variables were added and variable selection used again to prune the equations. In the end, the project was a failure—the false alarm rate of the final predictor was too high. I have regrets that this project can’t be revisited with the tools available today.

為了評估預測準確性,第一個五年的資料被用作訓練集,後兩年的資料被用作測試集。所使用到的演算法模型是20世紀80年代早期的,現在看來十分原始。在這個專案中,我們使用了大型的線性迴歸和變數選擇。除此之外,我們還使用了二次項和交叉項,添加了保留下來的變數並且再次使用變數選擇來給等式做修剪。最後,這個專案失敗了——錯誤預警率太高。我很後悔這個專案沒辦法使用如今的方法來重建。【這很尷尬啊,我學的就是這些東西】

 

3.2 The Chlorine Project

The EPA samples thousands of compounds a year and tries to determine their potential toxicity. In the mid-1970s, the standard procedure was to measure the mass spectra of the compound and to try to determine its chemical structure from its mass spectra.

3.2 氯氣專案

EPA取樣了一年中成千的複合物並且試圖判斷它們的潛在毒性。在20世紀70年代中期,標準的程式應該是衡量複合物的質量光譜然後試圖確定其化學結構。

 

Measuring the mass spectra is fast and cheap. But the determination of chemical structure from the mass spectra requires a painstaking examination by a trained chemist. The cost and availability of enough chemists to analyze all of the mass spectra produced daunted the EPA. Many toxic compounds contain halogens. So the EPA funded a project to determine if the presence of chlorine in a compound could be reliably predicted from its mass spectra.

衡量質譜是快速且廉價的。但是根據質譜確定化學結構需要訓練有素的化學家極其小心且辛苦地進行檢測。使用足夠化學家來分析質譜的成本和可實施性讓EPA望而卻步。許多有毒複合物都包含鹵素。所以EPA籌建了一個專案來判斷是否能夠通過一個複合物的質譜來推斷氯的存在。

 

Mass spectra are produced by bombarding the compound with ions in the presence of a magnetic field. The molecules of the compound split and the lighter fragments are bent more by the magnetic field than the heavier. Then the fragments hit an absorbing strip, with the position of the fragment on the strip determined by the molecular weight of the fragment. The intensity of the exposure at that position measures the frequency of the fragment. The resultant mass spectra has numbers reflecting frequencies of fragments from molecular weight 1 up to the molecular weight of the original compound. The peaks correspond to frequent fragments and there are many zeroes. The available data base consisted of the known chemical structure and mass spectra of 30,000 compounds.

質譜是通過在磁場中使用離子射擊複合物得到的。相比重力場,磁場中複合物的分子會分離並且分裂成更輕的組成。然後這些分裂物會依據碎片的位置擊打一個吸收帶,碎片的重量和分子重量相同,以此來確定分子質量。根據特定位置的出現密度我們可以確定碎片產生的頻率。質譜會有一些數字用來反應從質量為1的分子到複合物總重的碎片頻次。峰值和碎片頻次是一致的,並且有很多為0的值。可使用的資料庫和已知的30,000種複合物的化學結構和質譜是一致的。

 

The mass spectrum predictor vector x is of variable dimensionality. Molecular weight in the data base varied from 30 to over 10,000. The variable to be predicted is

 

質譜的預測變數x在變數維度中(其實這一句不知道怎麼翻)。資料庫中的分子質量從30到10,000不等。預測變數為:

y=1:包含氯;y=2:不包含氯。

 

The problem is to construct a function f(x) that is an accurate predictor of y where x is the mass spectrum of the compound.

To measure predictive accuracy the data set was randomly divided into a 25,000 member training set and a 5,000 member test set. Linear discriminant analysis was tried, then quadratic discriminant analysis. These were difficult to adapt to the variable dimensionality. By this time I was thinking about decision trees. The hallmarks of chlorine in mass spectra were researched. This domain knowledge was incorporated into the decision tree algorithm by the design of the set of 1,500 yes–no questions that could be applied to a mass spectra of any dimensionality. The result was a decision tree that gave 95% accuracy on both chlorines and nonchlorines (see Breiman, Friedman, Olshen and Stone, 1984).

這個問題目的是構造一個函式f(x)使得我們可以根據x——複合物質譜來精確預測y。

為了保證預測準確率,整個資料集被隨機分成了25,000 組訓練集和5,000組測試集。線性判別式分析(LDA)被使用了,然後我們使用了二項判別式分析。在調節變數維度上,這兩個方法太麻煩了【之前學2D的GAM建模時就發現異常麻煩】。這一次,我開始考慮使用決策樹。我們探究了氯在質譜中的特性。然後我們將這一專業領域的知識結合在了決策樹演算法中,設計了1500個yes-no的問題。這個演算法可以被應用在任何維度的質譜中。結果表明在預測含氯和不含氯的複合物中,決策樹給出了95%的準確率(可參考Breiman, Friedman, Olshen and Stone, 1984)。

線性判別分析(LDA)

是一種用來實現兩個或者多個物件特徵分類方法,在資料統計、模式識別、機器學習領域均有應用。

LDA跟PCA非常相似、唯一不同的是LDA的結果是將資料投影到不同分類、PCA的結果是將資料投影到最高相似分組,而且過程無一例外的都基於特徵值與特性向量實現降維處理。

PCA變換基於在原資料與調整之後估算降維的資料之間最小均方錯誤,PCA趨向提取資料最大相同特徵、而忽視資料之間微小不同特徵、所以如果在OCR識別上使用PCA的方法就很難分辨Q與O個英文字母、而LDA基於最大類間方差與最小類內方差,其目的是減少分類內部之間差異,擴大不同分類之間差異。所以LDA在一些應用場景中有比PCA更好的表現。

——摘自“OpenCV學堂”《LDA(Linear Discriminant Analysis)演算法介紹》【1】

Quadratic Discriminant Analysis

類似於LDA,不同的地方是它可以形成非線性的邊界,並且不同的類所屬的高斯分佈具有不同的協方差矩陣。 

——摘自CSDN使用者“NirHeavenX”部落格《sklearn淺析(五)——Discriminant Analysis》【2】

 

3.3 Perceptions on Statistical Analysis

As I left consulting to go back to the university, these were the perceptions I had about working with data to find answers to problems:

(a) Focus on finding a good solution—that’s what consultants get paid for.

(b) Live with the data before you plunge into modeling.

(c) Search for a model that gives a good solution, either algorithmic or data.

(d) Predictive accuracy on test sets is the criterion for how good the model is.

(e) Computers are an indispensable partner.

 

3.3 統計分析的一些觀點

在我離開資訊崗位返回高校時,我有了一些關於從事資料相關工作產生的問題及其解決方案的想法:

(a)我們需要關注於解決好的方法——這才是諮詢師的獲利點;

(b)瞭解資料比模型更重要;

(c)尋求一個能夠產生良好解決辦法的模型,無論是演算法上的還是資料上的;

(d)驗證集的準確率才是衡量一個模型好壞的標準;

(e)計算機是不可或缺的夥伴。

 

參考文獻:

【1】https://www.sohu.com/a/159765142_823210

【2】https://blog.csdn.net/qsczse943062710/article/details/75977118