Scikit-learn:聚類clustering
阿新 • • 發佈:2019-01-05
不同聚類效果比較
sklearn不同聚類示例比較
不同聚類綜述
Method name | Parameters | Scalability | Usecase | Geometry (metric used) |
---|---|---|---|---|
number of clusters | Very large n_samples , medium n_clusters withMiniBatch code | General-purpose, even cluster size, flat geometry, not too many clusters | Distances between points | |
damping, sample preference | Not scalable with n_samples | Many clusters, uneven cluster size, non-flat geometry | Graph distance (e.g. nearest-neighbor graph) | |
bandwidth | Not scalable with n_samples | Many clusters, uneven cluster size, non-flat geometry | Distances between points | |
number of clusters | Medium n_samples , small n_clusters | Few clusters, even cluster size, non-flat geometry | Graph distance (e.g. nearest-neighbor graph) | |
number of clusters | Large n_samples and n_clusters | Many clusters, possibly connectivity constraints | Distances between points | |
number of clusters, linkage type, distance | Large n_samples and n_clusters | Many clusters, possibly connectivity constraints, non Euclideandistances | Any pairwise distance | |
neighborhood size | Very large n_samples , medium n_clusters | Non-flat geometry, uneven cluster sizes | Distances between nearest points | |
many | Not scalable | Flat geometry, good for density estimation | Mahalanobis distances to centers | |
Birch | branching factor, threshold, optional global clusterer. | Large n_clusters and n_samples | Large dataset, outlier removal, data reduction. | Euclidean distance between points |
DBSCAN聚類
程式碼示例
def Dist(x, y): from geopy import distance return distance.vincenty(x, y).meters
import pickle, subprocess, pwd import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.cluster import DBSCAN df = pd.read_pickle(os.path.join(CWD, 'middlewares/df.pkl')) ll = df[['longitude', 'latitude']].values x, y = ll[:, 0], ll[:, 1] print('starting dbsan...')
dbscaner = DBSCAN(eps=DBSCAN_R, min_samples=DBSCAN_MIN_S, metric=Dist, n_jobs=-1).fit(ll) pickle.dump(dbscaner, open(os.path.join(CWD, 'middlewares/dbscaner.pkl'), 'wb')) print('dbsan dumping end...')dbscaner = pickle.load(open(os.path.join(CWD, 'middlewares/dbscaner.pkl'), 'rb'))labels = dbscaner.labels_# print(set(labels))colors = plt.cm.Spectral(np.linspace(0, 1, len(set(labels))))for k, col in zip(set(labels), colors):marker = '.'if k == -1:col = 'k'marker = 'x'inds_k = labels == k plt.scatter(x[inds_k], y[inds_k], marker=marker, color=col)if pwd.getpwuid(os.geteuid()).pw_name == 'piting':plt.savefig('./1.png')elif pwd.getpwuid(os.geteuid()).pw_name == 'pipi':plt.show()
ref: