【原創】演算法分享(5)聚類演算法DBSCAN
簡介
DBSCAN:Density-based spatial clustering of applications with noise
is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996.It is a density-based clustering algorithm: given a set of points in some space, it groups together points that are closely packed together (points with many
原理
DBSCAN是一種基於密度的聚類演算法,演算法過程比較簡單,即將相距較近的點(中心點和它的鄰居點)聚成一個cluster,然後不斷找鄰居點的鄰居點並加到這個cluster中,直到cluster無法再擴大,然後再處理其他未訪問的點;
演算法虛擬碼
子方法虛擬碼
DBSCAN requires two parameters: ε (eps) and the minimum number of points required to form a dense region (minPts).
DBSCAN演算法主要有兩個引數,一個是距離Eps,一個是最小鄰居的數量MinPts,即在中心點半徑Eps之內的鄰居點數量超過MinPts時,中心點和鄰居點才可以組成一個cluster;
程式碼實現
python
程式碼
import pandas as pd import sys importnumpy as np from sklearn.cluster import DBSCAN, KMeans def main_fun(): loc_data = [(40.8379295833, -73.70228875), (40.750613794,-73.993434906), (40.6927066969, -73.8085984165), (40.7489736586, -73.9859616017), (40.8379525833, -73.70209875), (40.6997066969, -73.8085234165), (40.7484436586, -73.9857316017)] epsilon = 10 db = DBSCAN(eps=epsilon, min_samples=1, algorithm='ball_tree', metric='haversine').fit(np.radians(loc_data)) labels = db.labels_ print(labels) n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0) for i in range(0, n_clusters_): indexs = np.where(labels == i) for j in indexs: print(loc_data[j]) if __name__ == '__main__': main_fun()
scala
依賴
<dependency>
<groupId>org.scalanlp</groupId>
<artifactId>nak_2.11</artifactId>
<version>1.3</version>
</dependency>
程式碼
import nak.cluster.{DBSCAN, GDBSCAN, Kmeans} val matrix = DenseMatrix( (40.8379295833, -73.70228875), (40.6927066969, -73.8085984165), (40.7489736586, -73.9859616017), (40.8379525833, -73.70209875), (40.6997066969, -73.8085234165), (40.7484436586, -73.9857316017), (40.750613794,-73.993434906)) val gdbscan = new GDBSCAN( DBSCAN.getNeighbours(epsilon = 1000.0, distance = Kmeans.euclideanDistance), DBSCAN.isCorePoint(minPoints = 1) ) val clusters = gdbscan cluster matrix clusters.foreach(cluster => { println(cluster.id + ", " + cluster.points.length) cluster.points.foreach(p => p.value.data.foreach(println)) })
演算法細節詳見參考
參考:A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise
其他:
http://www.cs.fsu.edu/~ackerman/CIS5930/notes/DBSCAN.pdf
https://www.oreilly.com/ideas/clustering-geolocated-data-using-spark-and-dbscan