【原創】演算法分享（5）聚類演算法DBSCAN

阿新 • • 發佈：2018-12-26

簡介

DBSCAN：Density-based spatial clustering of applications with noise

is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996.It is a density-based clustering algorithm: given a set of points in some space, it groups together points that are closely packed together (points with many

nearby neighbors), marking as outliers points that lie alone in low-density regions (whose nearest neighbors are too far away). DBSCAN is one of the most common clustering algorithms and also most cited in scientific literature.

原理

DBSCAN是一種基於密度的聚類演算法，演算法過程比較簡單，即將相距較近的點（中心點和它的鄰居點）聚成一個cluster，然後不斷找鄰居點的鄰居點並加到這個cluster中，直到cluster無法再擴大，然後再處理其他未訪問的點；

演算法虛擬碼

子方法虛擬碼

DBSCAN requires two parameters: ε (eps) and the minimum number of points required to form a dense region (minPts).

DBSCAN演算法主要有兩個引數，一個是距離Eps，一個是最小鄰居的數量MinPts，即在中心點半徑Eps之內的鄰居點數量超過MinPts時，中心點和鄰居點才可以組成一個cluster；

程式碼實現

python

程式碼

import pandas as pd
import sys
import 
 numpy as np
from sklearn.cluster import DBSCAN, KMeans

def main_fun():
    loc_data = [(40.8379295833, -73.70228875), (40.750613794,-73.993434906), (40.6927066969, -73.8085984165), (40.7489736586, -73.9859616017), (40.8379525833, -73.70209875), (40.6997066969, -73.8085234165), (40.7484436586, -73.9857316017)]
    epsilon = 10
    db = DBSCAN(eps=epsilon, min_samples=1, algorithm='ball_tree', metric='haversine').fit(np.radians(loc_data))
    labels = db.labels_
    print(labels)
    n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
    for i in range(0, n_clusters_):
        indexs = np.where(labels == i)
        for j in indexs:
            print(loc_data[j])

if __name__ == '__main__':
    main_fun()

scala

依賴

<dependency>
  <groupId>org.scalanlp</groupId>
  <artifactId>nak_2.11</artifactId>
  <version>1.3</version>
</dependency>

程式碼

import nak.cluster.{DBSCAN, GDBSCAN, Kmeans}

    val matrix = DenseMatrix(
      (40.8379295833, -73.70228875),
      (40.6927066969, -73.8085984165),
      (40.7489736586, -73.9859616017),
      (40.8379525833, -73.70209875),
      (40.6997066969, -73.8085234165),
      (40.7484436586, -73.9857316017),
      (40.750613794,-73.993434906))

    val gdbscan = new GDBSCAN(
      DBSCAN.getNeighbours(epsilon = 1000.0, distance = Kmeans.euclideanDistance),
      DBSCAN.isCorePoint(minPoints = 1)
    )
    val clusters = gdbscan cluster matrix
    clusters.foreach(cluster => {
        println(cluster.id + ", " + cluster.points.length)
        cluster.points.foreach(p => p.value.data.foreach(println))
      })

演算法細節詳見參考

參考：A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise

其他：

http://www.cs.fsu.edu/~ackerman/CIS5930/notes/DBSCAN.pdf

https://www.oreilly.com/ideas/clustering-geolocated-data-using-spark-and-dbscan

【原創】演算法分享（5）聚類演算法DBSCAN

簡介

原理

演算法虛擬碼

程式碼實現

python

scala

【原創】演算法分享（5）聚類演算法DBSCAN

【原創】經驗分享（10）Could not transfer artifact org.apache.maven:maven. from/to central. Received fatal alert: protocol_version

【原創】經驗分享（12）如何程式化kill提交到spark thrift上的sql

【原創】經驗分享（15）spark sql limit實現原理

【原創】經驗分享（20）spark job之間會停頓幾分鐘

【原創】案例分享（3）使用者行為分析--見證scala的強大

【原創】案例分享（4）定位分析--見證scala的強大

【原創】經驗分享（22）檢視linux發行版以及核心版本

【原創】MapReduce實戰（一）

【原創】命令列（2）----一些伺服器命令列

【原創】java-NIO（一）阻塞IO與非阻塞IO--轉載請註明出處

【無監督學習】3：Density Peaks聚類演算法實現（區域性密度聚類演算法）

【原創】java-NIO（一）阻塞IO與非阻塞IO

機器學習筆記之（7）——聚類演算法

（3）聚類演算法之DBSCAN演算法

機器學習筆記（九）聚類演算法及實踐（K-Means,DBSCAN,DPEAK,Spectral_Clustering）

python_機器學習（2）聚類演算法

【原創】演算法分享（4）Cardinality Estimate 基數計數概率演算法

【原創】演算法分享（7）最小二乘法

【原創】Logistic regression （邏輯迴歸）概述

【原創】演算法分享（5）聚類演算法DBSCAN

簡介

原理

演算法虛擬碼

程式碼實現

python

scala

相關推薦