1. 程式人生 > >原始碼解讀----之_k-means++初始化質心的方法(被k_means呼叫)

原始碼解讀----之_k-means++初始化質心的方法(被k_means呼叫)

本文是個人的理解,由於剛接觸並且自身能力也有限,也許會存在誤解,歡迎留言指正,本人一定虛心請教,謝謝
def _k_init(X, n_clusters, x_squared_norms, random_state, n_local_trials=None):
"""根據k-means++初始化質心
    @:parameter X : 輸入資料,應該是雙精度(dtype = np.float64)。
    @:parameter n_clusters : integer質心數
    @:parameter x_squared_norms : array, shape (n_samples,)每個資料點的歐幾里得範數的平方
@:parameter random_state : numpy.RandomState#隨機數生成器,用於初始化中心 @:parameter n_local_trials : integer, optional The number of seeding trials for each center (except the first), of which the one reducing inertia the most is greedily chosen. Set to None to make the number of trials depend logarithmically
on the number of seeds (2+log(k)); this is the default. 通過一種特別的方式對K-means聚類選擇初始簇中心,從而加快收斂速度 """ n_samples, n_features = X.shape centers = np.empty((n_clusters, n_features), dtype=X.dtype) assert x_squared_norms is not None, 'x_squared_norms None in _k_init' # 如果沒有設定seeding trials 則在此設定
if n_local_trials is None: # This is what Arthur/Vassilvitskii tried, but did not report # specific results for other than mentioning in the conclusion # that it helped. n_local_trials = 2 + int(np.log(n_clusters)) # 隨機的選擇第一個中心 center_id = random_state.randint(n_samples) if sp.issparse(X): centers[0] = X[center_id].toarray() else: centers[0] = X[center_id] # 初始化最近距離的列表,並計算當前概率 closest_dist_sq = euclidean_distances( centers[0, np.newaxis], X, Y_norm_squared=x_squared_norms, squared=True)#計算X與中心的距離的平方得到距離矩陣 current_pot = closest_dist_sq.sum()#距離矩陣的和 # 選擇其餘n_clusters-1點 for c in range(1, n_clusters): # 通過概率的比例選擇中心點候選點 # 離已經存在的中心最近的距離的平方 rand_vals = random_state.random_sample(n_local_trials) * current_pot #將rand_vals插入原有序陣列距離矩陣的累積求和矩陣中,並返回插入元素的索引值 candidate_ids = np.searchsorted(stable_cumsum(closest_dist_sq), rand_vals) # 計算離中心候選點的距離 distance_to_candidates = euclidean_distances( X[candidate_ids], X, Y_norm_squared=x_squared_norms, squared=True) # 決定哪個中心候選點是最好 best_candidate = None best_pot = None best_dist_sq = None for trial in range(n_local_trials): # Compute potential when including center candidate #多個數組的對應位置上元素大小的比較:返回每個索引位置上的最小值 new_dist_sq = np.minimum(closest_dist_sq, distance_to_candidates[trial]) new_pot = new_dist_sq.sum()#求和 # 如果是到目前為止最好的實驗結果則儲存該結果 if (best_candidate is None) or (new_pot < best_pot): best_candidate = candidate_ids[trial] best_pot = new_pot best_dist_sq = new_dist_sq # Permanently add best center candidate found in local tries #把從試驗中選出的最好的中心候選點新增到中心點集中 if sp.issparse(X): centers[c] = X[best_candidate].toarray() else: centers[c] = X[best_candidate] current_pot = best_pot closest_dist_sq = best_dist_sq return centers

參考地址:https://github.com/scikit-learn/scikit-learn