1. 程式人生 > >spark機器學習筆記:(四)用Spark Python構建分類模型(上)

spark機器學習筆記:(四)用Spark Python構建分類模型(上)

因此,當 wTx的估計值大於等於閾值0時,SVM對資料點標記為1,否則標記為0(其中閾值是SVM可以自適應的模型引數)。

SVM的損失函式被稱為合頁損失,定義為:

                                                                                             

SVM是一個最大間隔分類器,它試圖訓練一個使得類別儘可能分開的權重向量。在很多分類任務中,SVM不僅表現得效能突出,而且對大資料集的擴充套件是線性變化的。

下圖中,基於原先的二分類簡單樣例,我們畫出了關於邏輯迴歸(藍線)和線性

SVM(紅線)的決策函式:

從圖中可以看出SVM可以有效定位到最靠近決策函式的資料點(間隔線用紅色的虛線表示)。

1.2 樸素貝葉斯模型

樸素貝葉斯是一個概率模型,通過計算給定資料點在某個類別的概率來進行預測。樸素貝葉斯模型假設每個特徵分配到某個類別的概率是獨立分佈的(假定各個特徵之間條件獨立),這也是樸素貝葉斯叫法的原因

基於這個假設,屬於某個類別的概率表示為若干概率乘積的函式,其中這些概率包括某個特徵在給定某個類別的條件下出現的概率(條件概率),以及該類別的概率(先驗概率)。這樣使得模型訓練非常直接且易於處理。類別的先驗概率和特徵的條件概率可以通過資料的頻率估計得到。分類過程就是在給定特徵和類別概率的情況下選擇最可能的類別。

另外還有一個關於特徵分佈的假設,即引數的估計來自資料。MLlib實現了多項樸素貝葉斯(multinomial naïve Bayes),其中假設特徵分佈是多項分佈,用以表示特徵的非負頻率統計。

上述假設非常適合二元特徵(比如1-of-k,k維特徵向量中只有1維為1,其他為0中介紹的詞袋模型是一個典型的二元特徵表示)。

下圖展示了樸素貝葉斯在二分類樣本上的決策函式

1.3 決策樹

決策樹是一個強大的非概率模型,它可以表達複雜的非線性模式和特徵相互關係。決策樹在很多工上表現出的效能很好,相對容易理解和解釋,可以處理類屬或者數值特徵,同時不要求輸入資料歸一化或者標準化。決策樹非常適合應用整合方法(

ensemble method),比如多個決策樹的整合,稱為決策樹森林。

決策樹模型就好比一棵樹,葉子代表值為01的分類,樹枝代表特徵。如圖5-6所示,二元輸出分別是“待在家裡”和“去海灘”,特徵則是天氣。

決策樹演算法是一種自上而下始於根節點(或特徵)的方法,在每一個步驟中通過評估特徵分裂的資訊增益,最後選出分割資料集最優的特徵。資訊增益通過計算節點不純度(即節點標籤不相似或不同質的程度)減去分割後的兩個子節點不純度的加權和。對於分類任務,這裡有兩個評估方法用於選擇最好的分割:基尼係數和熵。


2 從資料中抽取合適的特徵

,可以發現大部分機器學習模型以特徵向量的形式處理數值資料。另外,對於分類和迴歸等監督學習方法,需要將目標變數(或者多類別情況下的變數)和特徵向量放在一起。

MLlib中的分類模型通過LabeledPoint物件操作,其中封裝了目標變數(標籤)和特徵向量:

                LabeledPoint(label: Double, features: Vector) 

雖然在使用分類模型的很多樣例中會碰到向量格式的資料集,但在實際工作中,通常還需要從原始資料中抽取特徵 ,包括封裝數值特徵、歸一或者正則化特徵,以及使用1-of-k編碼表示類屬特徵 


Kaggle/StumbleUpon evergreen分類資料集中抽取特徵 

考慮到上一篇博文推薦模型中的MovieLens資料集和分類問題無關,本章將使用另外一個數據集。這個資料集源自Kaggle比賽,由StumbleUpon提供。比賽的問題涉及網頁中推薦的頁面是短暫(短暫存在,很快就不流行了)還是長久(長時間流行)。 

為了讓Spark更好地操作資料,我們需要刪除檔案第一行的列頭名稱。先進入到資料檔案所在目錄,然後執行命令:

sed 1d train.tsv > train_noheader.tsv

讀取資料:
rawData = sc.textFile('/Users/youwei.tan/Downloads/train_noheader.tsv')
records = rawData.map(lambda x: x.split('\t'))
print records.first()

輸出結果:
[u'"http://www.bloomberg.com/news/2010-12-23/ibm-predicts-holographic-calls-air-breathing-batteries-by-2015.html"', u'"4042"', u'"{""title"":""IBM Sees Holographic Calls Air Breathing Batteries ibm sees holographic calls, air-breathing batteries"",""body"":""A sign stands outside the International Business Machines Corp IBM Almaden Research Center campus in San Jose California Photographer Tony Avelar Bloomberg Buildings stand at the International Business Machines Corp IBM Almaden Research Center campus in the Santa Teresa Hills of San Jose California Photographer Tony Avelar Bloomberg By 2015 your mobile phone will project a 3 D image of anyone who calls and your laptop will be powered by kinetic energy At least that s what International Business Machines Corp sees in its crystal ball The predictions are part of an annual tradition for the Armonk New York based company which surveys its 3 000 researchers to find five ideas expected to take root in the next five years IBM the world s largest provider of computer services looks to Silicon Valley for input gleaning many ideas from its Almaden research center in San Jose California Holographic conversations projected from mobile phones lead this year s list The predictions also include air breathing batteries computer programs that can tell when and where traffic jams will take place environmental information generated by sensors in cars and phones and cities powered by the heat thrown off by computer servers These are all stretch goals and that s good said Paul Saffo managing director of foresight at the investment advisory firm Discern in San Francisco In an era when pessimism is the new black a little dose of technological optimism is not a bad thing For IBM it s not just idle speculation The company is one of the few big corporations investing in long range research projects and it counts on innovation to fuel growth Saffo said Not all of its predictions pan out though IBM was overly optimistic about the spread of speech technology for instance When the ideas do lead to products they can have broad implications for society as well as IBM s bottom line he said Research Spending They have continued to do research when all the other grand research organizations are gone said Saffo who is also a consulting associate professor at Stanford University IBM invested 5 8 billion in research and development last year 6 1 percent of revenue While that s down from about 10 percent in the early 1990s the company spends a bigger share on research than its computing rivals Hewlett Packard Co the top maker of personal computers spent 2 4 percent last year At Almaden scientists work on projects that don t always fit in with IBM s computer business The lab s research includes efforts to develop an electric car battery that runs 500 miles on one charge a filtration system for desalination and a program that shows changes in geographic data IBM rose 9 cents to 146 04 at 11 02 a m in New York Stock Exchange composite trading The stock had gained 11 percent this year before today Citizen Science The list is meant to give a window into the company s innovation engine said Josephine Cheng a vice president at IBM s Almaden lab All this demonstrates a real culture of innovation at IBM and willingness to devote itself to solving some of the world s biggest problems she said Many of the predictions are based on projects that IBM has in the works One of this year s ideas that sensors in cars wallets and personal devices will give scientists better data about the environment is an expansion of the company s citizen science initiative Earlier this year IBM teamed up with the California State Water Resources Control Board and the City of San Jose Environmental Services to help gather information about waterways Researchers from Almaden created an application that lets smartphone users snap photos of streams and creeks and report back on conditions The hope is that these casual observations will help local and state officials who don t have the resources to do the work themselves Traffic Predictors IBM also sees data helping shorten commutes in the next five years Computer programs will use algorithms and real time traffic information to predict which roads will have backups and how to avoid getting stuck Batteries may last 10 times longer in 2015 than today IBM says Rather than using the current lithium ion technology new models could rely on energy dense metals that only need to interact with the air to recharge Some electronic devices might ditch batteries altogether and use something similar to kinetic wristwatches which only need to be shaken to generate a charge The final prediction involves recycling the heat generated by computers and data centers Almost half of the power used by data centers is currently spent keeping the computers cool IBM scientists say it would be better to harness that heat to warm houses and offices In IBM s first list of predictions compiled at the end of 2006 researchers said instantaneous speech translation would become the norm That hasn t happened yet While some programs can quickly translate electronic documents and instant messages and other apps can perform limited speech translation there s nothing widely available that acts like the universal translator in Star Trek Second Life The company also predicted that online immersive environments such as Second Life would become more widespread While immersive video games are as popular as ever Second Life s growth has slowed Internet users are flocking instead to the more 2 D environments of Facebook Inc and Twitter Inc Meanwhile a 2007 prediction that mobile phones will act as a wallet ticket broker concierge bank and shopping assistant is coming true thanks to the explosion of smartphone applications Consumers can pay bills through their banking apps buy movie tickets and get instant feedback on potential purchases all with a few taps on their phones The nice thing about the list is that it provokes thought Saffo said If everything came true they wouldn t be doing their job To contact the reporter on this story Ryan Flinn in San Francisco at rflinn bloomberg net To contact the editor responsible for this story Tom Giles at tgiles5 bloomberg net by 2015, your mobile phone will project a 3-d image of anyone who calls and your laptop will be powered by kinetic energy. at least that\\u2019s what international business machines corp. sees in its crystal ball."",""url"":""bloomberg news 2010 12 23 ibm predicts holographic calls air breathing batteries by 2015 html""}"', u'"business"', u'"0.789131"', u'"2.055555556"', u'"0.676470588"', u'"0.205882353"', u'"0.047058824"', u'"0.023529412"', u'"0.443783175"', u'"0"', u'"0"', u'"0.09077381"', u'"0"', u'"0.245831182"', u'"0.003883495"', u'"1"', u'"1"', u'"24"', u'"0"', u'"5424"', u'"170"', u'"8"', u'"0.152941176"', u'"0.079129575"', u'"0"']

資料集頁面中的資料簡介已經告訴我們。前四列分別指的是URL、頁面的ID、原始的文字內容和分配給頁面的類別。接下來22列包含各種各樣的數值或者類屬特徵。最後一列為目標值,1為長久, 0為短暫。 

我們將用簡單的方法直接對數值特徵做處理。因為每個類屬變數是二元的,對這些變數已有一個用1-of-k編碼的特徵,於是不需要額外提取特徵。 

由於資料格式的問題,我們做一些資料清理的工作,在處理過程中把額外的(")去掉。資料集中還有一些用"?"代替的缺失資料,本例中,我們直接用0替換那些缺失資料: 

from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.linalg import Vectors
trimmed = records.map(lambda x: [xx.replace('\"','') for xx in x])
label = trimmed.map(lambda x : x[-1])

data = trimmed.map(lambda x : (x[-1],x[4:-1])).\
               map(lambda (x,y):(x.replace('\"',''),[0.0 if yy=='\"?\"' else yy.replace('\"','') for yy in y])).\
               map(lambda (x,y):(x.replace("\"",""),[0.0 if yy =='?' else yy.replace("\"","") for yy in y])).\
               map(lambda (x,y):(int(x), [float(yy) for yy in y])).\
               map(lambda (x,y):LabeledPoint(x,Vectors.dense(y)))
data.take(5)

輸出結果:
[LabeledPoint(0.0, [0.789131,2.055555556,0.676470588,0.205882353,0.047058824,0.023529412,0.443783175,0.0,0.0,0.09077381,0.0,0.245831182,0.003883495,1.0,1.0,24.0,0.0,5424.0,170.0,8.0,0.152941176,0.079129575]),
 LabeledPoint(1.0, [0.574147,3.677966102,0.50802139,0.288770053,0.213903743,0.144385027,0.468648998,0.0,0.0,0.098707403,0.0,0.203489628,0.088652482,1.0,1.0,40.0,0.0,4973.0,187.0,9.0,0.181818182,0.125448029]),
 LabeledPoint(1.0, [0.996526,2.382882883,0.562015504,0.321705426,0.120155039,0.042635659,0.525448029,0.0,0.0,0.072447859,0.0,0.22640177,0.120535714,1.0,1.0,55.0,0.0,2240.0,258.0,11.0,0.166666667,0.057613169]),
 LabeledPoint(1.0, [0.801248,1.543103448,0.4,0.1,0.016666667,0.0,0.480724749,0.0,0.0,0.095860566,0.0,0.265655744,0.035343035,1.0,0.0,24.0,0.0,2737.0,120.0,5.0,0.041666667,0.100858369]),
 LabeledPoint(0.0, [0.719157,2.676470588,0.5,0.222222222,0.12345679,0.043209877,0.446143274,0.0,0.0,0.024908425,0.0,0.228887247,0.050473186,1.0,1.0,14.0,0.0,12032.0,162.0,10.0,0.098765432,0.082568807])]


對資料進行快取,同時統計資料樣本的數目:

data.cache()
numData = data.count()
print numData

輸出結果:
7395

在對資料集做進一步處理之前,我們發現數值資料中包含負的特徵值。我們知道,樸素貝葉斯模型要求特徵值非負,否則碰到負的特徵值程式會丟擲錯誤。因此,需要為樸素貝葉斯模型構建一份輸入特徵向量的資料,將負特徵值設為0

nbdata = trimmed.map(lambda x : (x[-1],x[4:-1])).\
                 map(lambda (x,y):(x.replace('\"',''),[0.0 if yy=='\"?\"' else yy.replace('\"','') for yy in y])).\
                 map(lambda (x,y):(x.replace("\"",""),[0.0 if yy =='?' else yy.replace("\"","") for yy in y])).\
                 map(lambda (x,y):(int(x), [float(yy) for yy in y])).\
                 map(lambda (x,y):(x, [0.0 if yy<0 else yy for yy in y])).\
                 map(lambda (x,y):LabeledPoint(x,Vectors.dense(y)))
nbdata.take(2)

輸出結果:
[LabeledPoint(0.0, [0.789131,2.055555556,0.676470588,0.205882353,0.047058824,0.023529412,0.443783175,0.0,0.0,0.09077381,0.0,0.245831182,0.003883495,1.0,1.0,24.0,0.0,5424.0,170.0,8.0,0.152941176,0.079129575]),
 LabeledPoint(1.0, [0.574147,3.677966102,0.50802139,0.288770053,0.213903743,0.144385027,0.468648998,0.0,0.0,0.098707403,0.0,0.203489628,0.088652482,1.0,1.0,40.0,0.0,4973.0,187.0,9.0,0.181818182,0.125448029])]

3 訓練分類模型 

現在我們已經從資料集中提取了基本的特徵並且建立了RDD,接下來開始訓練各種模型吧。為了比較不同模型的效能,我們將訓練邏輯迴歸、SVM、樸素貝葉斯和決策樹。你會發現每個模型的訓練方法幾乎一樣,不同的是每個模型都有著自己特定可配置的模型引數。MLlib大多數情況下會設定明確的預設值,但實際上,最好的引數配置需要通過評估技術來選擇,這個我們會在後續博文中進行討論。 

#匯入相應的類
from pyspark.mllib.classification import LogisticRegressionWithSGD
from pyspark.mllib.classification import SVMWithSGD
from pyspark.mllib.classification import NaiveBayes
from pyspark.mllib.tree import DecisionTree

numIteration = 10   #迭代次數
maxTreeDepth = 5    #樹的深度

numClass = label.distinct().count()     #類別數
print '類別數:',numClass

#訓練邏輯迴歸、支援向量機、樸素貝葉斯和決策樹模型
lrModel = LogisticRegressionWithSGD.train(data, numIteration)
svmModel = SVMWithSGD.train(data, numIteration)
nbModel = NaiveBayes.train(nbdata)
dtModel = DecisionTree.trainClassifier(data,numClass,{},impurity='entropy', maxDepth=maxTreeDepth)
print '邏輯迴歸模型引數:',lrModel
print '支援向量機模型引數:',svmModel
print '樸素貝葉斯模型引數:',nbModel
print '決策樹模型引數:',dtModel

輸出結果:
類別數: 2
邏輯迴歸模型引數: (weights=[-0.110216274454,-0.493200344739,-0.0712665620384,-0.0214744216778,0.00276706475384,0.00246385887598,-1.33300460292,0.0525232672351,0.0,-0.0320576776,-0.00653638798541,-0.0613702511674,-0.14975863133,-0.13648187383,-0.121161700009,-15.6451616669,-0.0177690355464,745.987958686,-7.73567729685,-1.38587998188,-0.0355600416613,-0.0352085128613], intercept=0.0)
支援向量機模型引數: (weights=[-0.122188386978,-0.527510758159,-0.0742371782434,-0.0206667449306,0.00546395033577,0.00409811283781,-1.54824523474,0.0607028905087,0.0,-0.037008323802,-0.007374037142,-0.067970375864,-0.172289581054,-0.148716595522,-0.129369384966,-18.0315472516,-0.0202704220321,1025.48043141,-5.05188911633,-1.54111193167,-0.038689478606,-0.0397619278886], intercept=0.0)
樸素貝葉斯模型引數: <pyspark.mllib.classification.NaiveBayesModel object at 0x11291ab90>
決策樹模型引數: DecisionTreeModel classifier of depth 5 with 61 nodes

4 使用分類模型 

現在我們有四個在輸入標籤和特徵下訓練好的模型,接下來看看如何使用這些模型進行預測。目前,我們將使用同樣的訓練資料來解釋每個模型的預測方法。 此處以邏輯迴歸模型為例,其他模型的使用方法類似。

dataPoint = data.first()
prediction = lrModel.predict(dataPoint.features)
print '預測的類別:',prediction
print '真實的類別:',int(dataPoint.label)

輸出結果:
預測的類別: 1
真實的類別: 0

可以看到對於訓練資料中第一個樣本,模型預測值為

相關推薦

spark機器學習筆記Spark Python構建分類模型

因此,當 wTx的估計值大於等於閾值0時,SVM對資料點標記為1,否則標記為0(其中閾值是SVM可以自適應的模型引數)。 SVM的損失函式被稱為合頁損失,定義為:                                                        

spark機器學習筆記Spark Python構建推薦系統

輸出結果: [[Rating(user=789, product=1012, rating=4.0), Rating(user=789, product=127, rating=5.0), Rating(user=789, product=475, rating=5.0), Rating(us

spark機器學習筆記Spark Python構建迴歸模型

博主簡介:風雪夜歸子(英文名:Allen),機器學習演算法攻城獅,喜愛鑽研Meachine Learning的黑科技,對Deep Learning和Artificial Intelligence充滿興趣,經常關注Kaggle資料探勘競賽平臺,對資料、Machi

spark機器學習筆記Spark Python進行資料處理和特徵提取

下面用“|”字元來分隔各行資料。這將生成一個RDD,其中每一個記錄對應一個Python列表,各列表由使用者ID(user ID)、年齡(age)、性別(gender)、職業(occupation)和郵編(ZIP code)五個屬性構成。4之後再統計使用者、性別、職業和郵編的數目。這可通過如下程式碼

機器學習筆記最大熵模型,推導,與似然函式關係的推導,求解

1、最大熵模型 最大熵原理:最大熵原理認為在學習概率模型時,在所有可能的概率模型中,熵最大的模型是最少的模型。 該原理認為要選擇的概率模型首先得承認已有的現實(約束條件),對未來無偏(即不確定的部分是等可能的)。比如隨機變數取值有A,B,C,另外已知

機器學習筆記決策樹ID3,C4.5,CART

學習資料:《統計學習方法》,《機器學習》(周志華),韓小陽ppt,鄒博ppt。 決策樹是一種樹形結構,對例項進行分類和迴歸的,下面主要說的是用來進行分類,最後說道CART的時候也會說到決策樹用到迴歸問題上。 1、決策樹模型與學習 先給出分類決策樹模型定義:是一種對例項資料進行

Python機器學習筆記不得不瞭解的機器學習知識點2

      之前一篇筆記: Python機器學習筆記:不得不瞭解的機器學習知識點(1) 1,什麼樣的資料集不適合用深度學習? 資料集太小,資料樣本不足時,深度學習相對其它機器學習演算法,沒有明顯優勢。 資料集沒有區域性相關特性,目前深度學習表現比較好的領域主要是影象/語音/自然語言處理等領域,

Python機器學習筆記線性判別分析LDA演算法

預備知識   首先學習兩個概念:   線性分類:指存在一個線性方程可以把待分類資料分開,或者說用一個超平面能將正負樣本區分開,表示式為y=wx,這裡先說一下超平面,對於二維的情況,可以理解為一條直線,如一次函式。它的分類演算法是基於一個線性的預測函式,決策的邊界是平的,比如直線和平面。一般的方法有感知器,最小

Python機器學習筆記SVM1——SVM概述

前言   整理SVM(support vector machine)的筆記是一個非常麻煩的事情,一方面這個東西本來就不好理解,要深入學習需要花費大量的時間和精力,另一方面我本身也是個初學者,整理起來難免思路混亂。所以我對SVM的整理會分為四篇(暫定為四篇)學習,不足之處,請多多指導。   四篇分別為: Pyt

Python機器學習筆記SVM2——SVM核函式

  上一節我學習了完整的SVM過程,下面繼續對核函式進行詳細學習,具體的參考連結都在上一篇文章中,SVM四篇筆記連結為: Python機器學習筆記:SVM(1)——SVM概述 Python機器學習筆記:SVM(2)——SVM核函式 Python機器學習筆記:SVM(3)——證明SVM Python機器學習筆記

Python機器學習筆記SVM3——證明SVM

  說實話,凡是涉及到要證明的東西(理論),一般都不好惹。絕大多數時候,看懂一個東西不難,但證明一個東西則需要點數學功底,進一步,證明一個東西也不是特別難,難的是從零開始發明這個東西的時候,則顯得艱難(因為任何時代,大部分人的研究所得都不過是基於前人的研究成果,前人所做的是開創性的工作,而這往往是最艱難最有價

Python機器學習筆記SVM4——sklearn實現

  上一節我學習了SVM的推導過程,下面學習如何實現SVM,具體的參考連結都在第一篇文章中,SVM四篇筆記連結為: Python機器學習筆記:SVM(1)——SVM概述 Python機器學習筆記:SVM(2)——SVM核函式 Python機器學習筆記:SVM(3)——證明SVM Python機器學習筆記:SV

Python機器學習筆記異常點檢測演算法——LOFLocal Outiler Factor

完整程式碼及其資料,請移步小編的GitHub   傳送門:請點選我   如果點選有誤:https://github.com/LeBron-Jian/MachineLearningNote   在資料探勘方面,經常需要在做特徵工程和模型訓練之前對資料進行清洗,剔除無效資料和異常資料。異常檢測也是資料探勘的一個方

Python機器學習筆記奇異值分解SVD演算法

完整程式碼及其資料,請移步小編的GitHub   傳送門:請點選我   如果點選有誤:https://github.com/LeBron-Jian/MachineLearningNote   奇異值分解(Singular  Value Decomposition,後面簡稱 SVD)是線上性代數中一種

大資料之Spark--- Spark機器學習,樸素貝葉斯,酒水評估和分類案例學習,垃圾郵件過濾學習案例,電商商品推薦,電影推薦學習案例

一、Saprk機器學習介紹 ------------------------------------------------------------------ 1.監督學習 a.有訓練資料集,符合規範的資料 b.根據資料集,產生一個推斷函式

Spark機器學習(5)SVM算法

線性 logs pro 二維 log libs jar 解析 cti 1. SVM基本知識 SVM(Support Vector Machine)是一個類分類器,能夠將不同類的樣本在樣本空間中進行分隔,分隔使用的面叫做分隔超平面。 比如對於二維樣本,分布在二維平面上,此

Spark機器學習(6)決策樹算法

projects 信息 txt .cn import n) .com util seq 1. 決策樹基本知識 決策樹就是通過一系列規則對數據進行分類的一種算法,可以分為分類樹和回歸樹兩類,分類樹處理離散變量的,回歸樹是處理連續變量。 樣本一般都有很多個特征,有的特征對分

Spark機器學習(8)LDA主題模型算法

算法 ets 思想 dir 骰子 cati em算法 第一個 不同 1. LDA基礎知識 LDA(Latent Dirichlet Allocation)是一種主題模型。LDA一個三層貝葉斯概率模型,包含詞、主題和文檔三層結構。 LDA是一個生成模型,可以用來生成一篇文

Spark機器學習(10)ALS交替最小二乘算法

mllib 測試 con 相互 idt color ted 個人 使用 1. Alternating Least Square ALS(Alternating Least Square),交替最小二乘法。在機器學習中,特指使用最小二乘法的一種協同推薦算法。如下圖所示,u表

Spark機器學習(11)協同過濾算法

設置 tel println print emp master ani alt tro 協同過濾(Collaborative Filtering,CF)算法是一種常用的推薦算法,它的思想就是找出相似的用戶或產品,向用戶推薦相似的物品,或者把物品推薦給相似的用戶。怎樣評價用戶