1. 程式人生 > >Scikit-learn:聚類clustering

Scikit-learn:聚類clustering

不同聚類效果比較

sklearn不同聚類示例比較

../_images/sphx_glr_plot_cluster_comparison_0011.png

A comparison of the clustering algorithms in scikit-learn

不同聚類綜述

Method nameParametersScalabilityUsecaseGeometry (metric used)
number of clustersVery large n_samples, medium n_clusters withMiniBatch codeGeneral-purpose, even cluster size, flat geometry, not too many clustersDistances between points
damping, sample preferenceNot scalable with n_samplesMany clusters, uneven cluster size, non-flat geometryGraph distance (e.g. nearest-neighbor graph)
bandwidthNot scalable with n_samplesMany clusters, uneven cluster size, non-flat geometryDistances between points
number of clustersMedium n_samples, small n_clusters
Few clusters, even cluster size, non-flat geometryGraph distance (e.g. nearest-neighbor graph)
number of clustersLarge n_samples and n_clustersMany clusters, possibly connectivity constraintsDistances between points
number of clusters, linkage type, distanceLarge n_samples and n_clustersMany clusters, possibly connectivity constraints, non EuclideandistancesAny pairwise distance
neighborhood sizeVery large n_samples, medium n_clustersNon-flat geometry, uneven cluster sizesDistances between nearest points
manyNot scalableFlat geometry, good for density estimationMahalanobis distances to centers
Birchbranching factor, threshold, optional global clusterer.Large n_clusters and n_samplesLarge dataset, outlier removal, data reduction.Euclidean distance between points
皮皮blog

DBSCAN聚類

程式碼示例

def Dist(x, y):
from geopy import distance
    return distance.vincenty(x, y).meters
import pickle, subprocess, pwd
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN

df = pd.read_pickle(os.path.join(CWD, 'middlewares/df.pkl'))
ll = df[['longitude', 'latitude']].values
x, y = ll[:, 0], ll[:, 1]

print('starting dbsan...')
dbscaner = DBSCAN(eps=DBSCAN_R, min_samples=DBSCAN_MIN_S, metric=Dist, n_jobs=-1).fit(ll)
pickle.dump(dbscaner, open(os.path.join(CWD, 'middlewares/dbscaner.pkl'), 'wb'))
print('dbsan dumping end...')
dbscaner = pickle.load(open(os.path.join(CWD, 'middlewares/dbscaner.pkl'), 'rb'))labels = dbscaner.labels_# print(set(labels))colors = plt.cm.Spectral(np.linspace(0, 1, len(set(labels))))for k, col in zip(set(labels), colors):marker = '.'if k == -1:col = 'k'marker = 'x'inds_k = labels == k plt.scatter(x[inds_k], y[inds_k], marker=marker, color=col)if pwd.getpwuid(os.geteuid()).pw_name == 'piting':plt.savefig('./1.png')elif pwd.getpwuid(os.geteuid()).pw_name == 'pipi':plt.show()

ref: