1. 程式人生 > >AI-014: 吳恩達教授(Andrew Ng)的機器學習課程學習筆記49

AI-014: 吳恩達教授(Andrew Ng)的機器學習課程學習筆記49

本文是學習Andrew Ng的機器學習系列教程的學習筆記。教學視訊地址:

49. Machine learning system design: prioritizing what to work on: spam classification example

以建立垃圾郵件過濾系統為例,首先建立分類器:

選擇高頻詞彙作為特徵。

如何降低分類器的錯誤率,舉例:

  • 收集大量資料
  • 使用從郵件路由資訊(比如發件人、標題)中提取的複雜特徵,比如空標題、@saler.com等
  • 使用從郵件內容中提取的複雜特徵,比如由降價、促銷等詞彙
  • 識別錯誤拼寫

50. Machine Learning system design: Error analysis

方法論:

錯誤分析:

看看各種情況的分佈,佔比大的情況可以改進演算法進行識別,嘗試各種新的方法(更多資料、更多特徵...),然後看看引起誤差的主要原因;

演算法最好能夠返回量化的檢驗結果,比如返回錯誤率,這樣根據引入不同的特徵或方法(比如是否使用提取詞幹)獲得的錯誤率來決定如何做更好:

如果引入詞幹提取的錯誤率更小,就採用引入詞幹分析的演算法;

51. Machine learning system design: Error metric for skewed classes

skewed classes 偏斜類

accuracy 精確度

Precision 查準率

Recall 召回率

查準率和召回率越高越好;

if a classify is getting high precision and high recall then we are actually confident that the algorithm has to be doing well, even if we have very skewed classes.

So for the problem of skewed classes, precision and recall gives us more direct insight into how the learning algorithm is doing, and this is often a much better way to evaluate our learning algorithms than looking at classification error(

類誤) or classification accuracy(準確率) when the classes are very skewed.

51. Machine learing system design: Trading off precision and recall

threshold

被查出來的很少,但是一旦查出來,就可以確定->高查準率,低召回率。比如垃圾郵件,你可不希望錯過正常郵件;

被查出來的很多,但是查出來的有很多是誤判->低查準率,高召回率。比如預測癌症,保持懷疑態度:)

use F function to compute if the precision and recall is ok.

52. Machine learning system design: data for machine learning

In such condition, the size of training set will advance the algorithm.

in this case, large training set can get good result and no need to discuss using which algorithms.

key test:

first, can a human experts look at the features x and confidently predict the value of y.

second, can we actually get a large training set and training the learning algorithm with a lot of parameters in the training set.

If you can do the both, you often can get a very good algorithm.