1. 程式人生 > >R語言中的資料探勘演算法

R語言中的資料探勘演算法

      R是用於統計分析、繪圖的語言和操作環境。R是屬於GNU系統的一個自由、免費、原始碼開放的軟體,它是一個用於統計計算和統計製圖的優秀工具。

                                                                                                                                                                                                                                                                      ——百度百科

      由於R語言可以很好地進行統計計算等工作,提供了一系列對聚類、分類演算法實現的包,所以對於資料探勘等工作有很大的幫助。

   一、基於密度的DBSCAN演算法

      在進行呼叫DBSCAN演算法的介面之前,需要使用命令安裝依賴庫,命令如下:

           install.packages("fpc", dependencies = TRUE)

     在R語言的fpc包中提供了實現DBSCAN聚類演算法並進行視覺化的函式,如下:

    dbscan(data, eps, MinPts, scale, method, seeds, showplot, countmode)

    data進行聚類的資料(可以是原始資料矩陣,也可以是一個距離矩陣);

    eps密度(掃描半徑);

    MinPts:最小包含點數;

    scale是否對data標準化(T/F);

    mehtod三個可選引數如下,

    rawdata視為原始資料,並避免計算距離矩陣(儲存儲存器,也可以是慢);

    distdata視為距離矩陣(比較快,但記憶體價格昂貴);

    hybrid計算部分距離矩陣(適度的記憶體需求,非常快);

    seeds:T/F

    showplot:是否畫聚類結果圖(三個可選引數:0,不畫:1,每次迭代畫;2,每次子迭代畫);

    countmode:NULL或者一個用於報告進度的向量。

    樣例程式碼如下:

new1 <- c(0,5183.328938,11420.98223,21320.32421,16989.59236,14899.47468,18480.556186,10386.55199,9236.277226,10180.589785)
new2 <- c(5183.328938,0,12360.82514,22350.72344,16893.23695,20657.25945,11074.88822,11074.88822,9924.613457,9591.926128)
new3 <- c(11420.98223,12360.82514,0,2090.117679,21019.15289,21105.79131,12360.82514,12360.82514,12360.82514,11031.75103)
new4 <- c(21320.32421,22350.72344,2090.117679,0,21019.15289,21105.79131,12360.82514,12360.82514,13603.98286,12071.69154)
new5 <- c(16989.59236,16893.23695,21019.15289,21019.15289,0,5183.328938,17945.32085,15775.28119,20562.67213,20268.02825)
new6 <- c(14899.47468,20657.25945,21105.79131,21105.79131,5183.328938,0,21674.62059,21674.62059,16989.59236,16694.94848)
new7 <- c(18480.556186,11074.88822,12360.82514,12360.82514,17945.32085,21674.62059,0,5576.559036,11954.7204,13959.63176)
new8 <- c(10386.55199,11074.88822,12360.82514,12360.82514,15775.28119,21674.62059,5576.559036,0,11954.7204,13959.63176)
new9 <- c(9236.277226,9924.613457,12360.82514,13603.98286,20562.67213,16989.59236,11954.7204,11954.7204,0,6782.135558)
new10 <- c(10180.589785,9591.926128,11031.75103,12071.69154,20268.02825,16694.94848,13959.63176,13959.63176,6782.135558,0)
X <- rbind(new1,new2,new3,new4,new5,new6,new7,new8,new9,new10)
#X <- scale(X)  #標準化
X #距離矩陣
Y <- as.dist(X)
#Y
par(bg="white")
model <- dbscan(X,MinPts=2,eps=7000,scale=F,showplot=2,method="dist")
model 
plot(model,X,main="DBSCAN聚類結果",ylab="",xlab="") 

  二、層次聚類(hierarchicalclustering)

      在R語言中提供了hcluster(data,method)函式進行層次聚類,具體引數不再詳細分析。

      樣例程式碼如下:

new1 <- c(0,5183.328938,11420.98223,11420.98223,16989.59236,14899.47468,8480.556186,10386.55199,9236.277226,8180.589785)
new2 <- c(5183.328938,0,12360.82514,12360.82514,16893.23695,20657.25945,11074.88822,11074.88822,9924.613457,9591.926128)
new3 <- c(11420.98223,12360.82514,0,2090.117679,21019.15289,21105.79131,12360.82514,12360.82514,12360.82514,11031.75103)
new4 <- c(11420.98223,12360.82514,2090.117679,0,21019.15289,21105.79131,12360.82514,12360.82514,13603.98286,12071.69154)
new5 <- c(16989.59236,16893.23695,21019.15289,21019.15289,0,5183.328938,17945.32085,15775.28119,20562.67213,20268.02825)
new6 <- c(14899.47468,20657.25945,21105.79131,21105.79131,5183.328938,0,21674.62059,21674.62059,16989.59236,16694.94848)
new7 <- c(8480.556186,11074.88822,12360.82514,12360.82514,17945.32085,21674.62059,0,5576.559036,11954.7204,13959.63176)
new8 <- c(10386.55199,11074.88822,12360.82514,12360.82514,15775.28119,21674.62059,5576.559036,0,11954.7204,13959.63176)
new9 <- c(9236.277226,9924.613457,12360.82514,13603.98286,20562.67213,16989.59236,11954.7204,11954.7204,0,6782.135558)
new10 <- c(8180.589785,9591.926128,11031.75103,12071.69154,20268.02825,16694.94848,13959.63176,13959.63176,6782.135558,0)
X <- rbind(new1,new2,new3,new4,new5,new6,new7,new8,new9,new10)
Y <- as.dist(X)
Y
out.hclust <- hclust(Y,"single")   #最短距離法
cbind(hc1$merge,hc1$height)
rownames(S)=paste("new",1:10,"")
plclust(out.hclust,sub="",xlab="",ylab="",main="層次聚類結果圖")       #對結果畫圖  
#rect.hclust(out.hclust,k=5)                   #用矩形畫出分為5類的區域
out.id=cutree(out.hclust,k=5)                 #得到分為5類的數值
out.id

更多細節:

https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Clustering/Hierarchical_Clustering