1. 程式人生 > >使用pyspark進行機器學習(聚類問題)

使用pyspark進行機器學習(聚類問題)

BisectingKMeans

class pyspark.ml.clustering.BisectingKMeans(self, featuresCol="features", predictionCol="prediction", maxIter=20, seed=None, k=4, minDivisibleClusterSize=1.0)

引數解釋

maxIter: 最大迭代次數
K:聚類簇數
minDivisibleClusterSize: 聚類的最少資料點數(>1)或比例(0-1之間)
fit(dataset, params=None)方法

擬合後的模型擁有的方法和屬性

clusterCenters(): 獲取聚類中心,numpy array型別
computeCost():計算點與其中心的平方和距離
Transform():對預測資料進行預測
hasSummary:訓練模型是否有summary
Summary:獲取summary
擁有對引數的getter和setter方法

Summary擁有的屬性

cluster:預測的聚類中心
clusterSizes:每個聚類的大小
K:聚類個數
Predictions:由模型的transforn方法產生的預測資料框

程式碼

from pyspark.ml.linalg import Vectors
From pyspark.ml
.clustering import BisectingKMeans data = [(Vectors.dense([0.0, 0.0]),), (Vectors.dense([1.0, 1.0]),),(Vectors.dense([9.0, 8.0]),), (Vectors.dense([8.0, 9.0]),)] df = spark.createDataFrame(data, ["features"]) bkm = BisectingKMeans(k=2, minDivisibleClusterSize=1.0) model = bkm.fit(df) centers = model.clusterCenters
() len(centers) model.computeCost(df) model.hasSummary summary = model.summary summary.k summary.clusterSizes #預測 transformed = model.transform(df).select("features", "prediction") rows = transformed.collect() rows[0].prediction == rows[1].prediction rows[2].prediction == rows[3].prediction

KMeans

class pyspark.ml.clustering.KMeans(self, featuresCol="features", predictionCol="prediction", k=2, initMode="k-means||", initSteps=2, tol=1e-4, maxIter=20, seed=None)

引數解釋

initMode: 初始化演算法,可以使隨機的“random",也可以是”k-means||"
initSteps: k-means||初始化的步數,需>0
fit(datast,params=None)方法

擬合後的模型擁有的方法和屬性

clusterCenters(): 同BisectingKMeans
computeCost(): 同BisectingKMeans
summary: 同BisectingKMeans
transform: 同BisectingKMeans
對引數的getter和setter方法

程式碼

from pyspark.ml.linalg import Vectors
data = [(Vectors.dense([0.0, 0.0]),), (Vectors.dense([1.0, 1.0]),),(Vectors.dense([9.0, 8.0]),), (Vectors.dense([8.0, 9.0]),)]
df = spark.createDataFrame(data, ["features"])
kmeans = KMeans(k=2, seed=1)
model = kmeans.fit(df)
centers = model.clusterCenters()
len(centers)
#2
model.computeCost(df)
#2.000...
transformed = model.transform(df).select("features", "prediction")
rows = transformed.collect()
rows[0].prediction == rows[1].prediction
#True
rows[2].prediction == rows[3].prediction
#True
model.hasSummary
#True
summary = model.summary
summary.k
#2
summary.clusterSizes
#[2, 2]
kmeans_path = temp_path + "/kmeans"
kmeans.save(kmeans_path)
kmeans2 = KMeans.load(kmeans_path)
kmeans2.getK()
#2
model_path = temp_path + "/kmeans_model"
model.save(model_path)
model2 = KMeansModel.load(model_path)
model2.hasSummary
#False
model.clusterCenters()[0] == model2.clusterCenters()[0]
#array([ True,  True], dtype=bool)
model.clusterCenters()[1] == model2.clusterCenters()[1]
#array([ True,  True], dtype=bool)

GaussianMixture

class pyspark.ml.clustering.GaussianMixture(self, featuresCol="features", predictionCol="prediction", k=2, probabilityCol="probability", tol=0.01, maxIter=100, seed=None)

引數解釋

fit(dataset,params=None)方法
k: 獨立高斯分佈的個數,>1
maxIter: 最大迭代次數 >=0
tol: 迭代演算法的收斂偏差 >=0
Setter方法和getter方法

擬合後的模型擁有的方法和屬性

gaussianDF: 抽取高斯分佈作為資料框,每一行代表高斯分佈,有兩列:mean(vector)和           cov(Matrix)
hasSummary: 模型是否有總括函式
summary: 獲取總括資訊
transform(dataset,params=None)方法
weights: 高斯混合模型的權重,和為1

Summary擁有的屬性

cluster: 每個訓練資料點預測的聚類中心資料框
clusterSize: 每個簇的大小(簇內資料點的個數)
k:  模型訓練的簇個數
predictions: 由模型transform方法產生的資料框

程式碼

from pyspark.ml.linalg import Vectors
data = [(Vectors.dense([-0.1, -0.05 ]),),(Vectors.dense([-0.01, -0.1]),),(Vectors.dense([0.9, 0.8]),),(Vectors.dense([0.75,0.935]),),(Vectors.dense([-0.83, -0.68]),),(Vectors.dense([-0.91, -0.76]),)]
df = spark.createDataFrame(data, ["features"])
gm = GaussianMixture(k=3, tol=0.0001,maxIter=10, seed=10)
model = gm.fit(df)
model.hasSummary
#True
summary = model.summary
summary.k
#3
summary.clusterSizes
#[2, 2, 2]
weights = model.weights
len(weights)
#3
model.gaussiansDF.show()
transformed=model.transform(df).select("features","prediction")
rows = transformed.collect()
rows[4].prediction == rows[5].prediction
#True
rows[2].prediction == rows[3].prediction
#True
gmm_path = temp_path + "/gmm"
gm.save(gmm_path)
gm2 = GaussianMixture.load(gmm_path)
gm2.getK()
#3
model_path = temp_path + "/gmm_model"
model.save(model_path)
model2 = GaussianMixtureModel.load(model_path)
model2.hasSummary
#False
model2.weights == model.weights
#True
model2.gaussiansDF.show()