你在網上搜索機器學習系列文章的話,大部分都是以KNN(k nearest neighbors)作為第一篇入門的,因為這個演算法實在是太簡單了.簡單到其實沒啥可說的.





前面說了,通過判斷兩個樣本之間的距離(或者說N維空間中的2個點之間的距離),來判斷兩個樣本的相似程度. 那問題來了,我們如何表達"兩個點之間的距離呢"?

二維空間中距離:$$\sqrt {(x^{(a)}-x^{(b)})^2+(y^{(a)}-y^{(b)})^2}$$

三維空間中距離:$$\sqrt {(x^{(a)}-x^{(b)})^2+(y^{(a)}-y^{(b)})^2+(z^{(a)}-z^{(b)})^2}$$

推而廣之,N維空間中距離:$$\sqrt {(x_1^{(a)}-x_1^{(b)})^2+(x_2^{(a)}-x_2^{(b)})^2+…+(x_n^{(a)}-x_n^{(b)})^2} =\sqrt {\sum_{i=1}^n(x_i^{(a)}-x_i^{(b)})^2}$$


實際上,如何度量距離,還有曼哈頓距離,$$\sum_{i=1}^n |X_i^{(a)}-X_i^{(b)}|$$

尤拉距離和曼哈頓距離都可以統一表達為明科夫斯基距離$$(\sum_{i=1}^n |X_i^{(a)}-X_i^{(b)}|^p)^\frac 1 p$$,



Metrics intended for real-valued vector spaces:

identifier class name args distance function
“euclidean” EuclideanDistance   sqrt(sum((x y)^2))
“manhattan” ManhattanDistance   sum(|x y|)
“chebyshev” ChebyshevDistance   max(|x y|)
“minkowski” MinkowskiDistance p sum(|x y|^p)^(1/p)
“wminkowski” WMinkowskiDistance p, w sum(|w (x y)|^p)^(1/p)
“seuclidean” SEuclideanDistance V sqrt(sum((x y)^2 V))
“mahalanobis” MahalanobisDistance V or VI sqrt((x y)' V^-1 (x y))

Metrics intended for integer-valued vector spaces: Though intended for integer-valued vectors, these are also valid metrics in the case of real-valued vectors.

identifier class name distance function
“hamming” HammingDistance N_unequal(x, y) N_tot
“canberra” CanberraDistance sum(|x y| (|x| |y|))
“braycurtis” BrayCurtisDistance sum(|x y|) (sum(|x|) sum(|y|))



知道如何計算距離了,似乎我們的KNN已經可以工作了,但是,問題又來了,考慮一下這個場景:我們選取K=3,然鵝,好巧不巧的,最終算出來的最近的3個距離是一樣的,而這3個樣本又分別屬於不同的類別,這我們要怎麼歸類呢?如果你覺得這個例子比較極端,那考慮一下這個場景:我們通過計算找出了距離待測樣本最近的3個點,假設這3個點p1,p2,p3分別屬於類別A,B,B. 但是,待測樣本點距離點p1的距離為1,距離p2的距離為100,距離p3的距離為50.這個時候顯然待測點和p1是極為接近的,把待測樣本歸類到A是更合理的.而由p1,p2,p3投票的話會把待測樣本歸類為B。



  • ‘uniform’ : uniform weights. All points in each neighborhood are weighted equally.
  • ‘distance’ : weight points by the inverse of their distance. in this case, closer neighbors of a query point will have a greater influence than neighbors which are further away.
  • [callable] : a user-defined function which accepts an array of distances, and returns an array of the same shape containing the weights.

uniform 代表等權重. sklean中預設取值是uniform。





class sklearn.neighbors.KNeighborsClassifier(n_neighbors=5weights=’uniform’algorithm=’auto’leaf_size=30p=2metric=’minkowski’metric_params=Nonen_jobs=None**kwargs)[source]

n_neighbors int, optional (default = 5)

Number of neighbors to use by default for kneighbors queries.

weights str or callable, optional (default = ‘uniform’)

weight function used in prediction. Possible values:

algorithm {‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, optional

Algorithm used to compute the nearest neighbors:

  • ‘ball_tree’ will use BallTree
  • ‘kd_tree’ will use KDTree
  • ‘brute’ will use a brute-force search.
  • ‘auto’ will attempt to decide the most appropriate algorithm based on the values passed to fitmethod.

Note: fitting on sparse input will override the setting of this parameter, using brute force.

leaf_size int, optional (default = 30)

Leaf size passed to BallTree or KDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.

p integer, optional (default = 2)

Power parameter for the Minkowski metric. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.

metric string or callable, default ‘minkowski’

the distance metric to use for the tree. The default metric is minkowski, and with p=2 is equivalent to the standard Euclidean metric. See the documentation of the DistanceMetric class for a list of available metrics.

metric_params dict, optional (default = None)

Additional keyword arguments for the metric function.

n_jobs int or None, optional (default=None)

The number of parallel jobs to run for neighbors search. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details. Doesn’t affect fit method.

>>> X = [[0], [1], [2], [3]]
>>> y = [0, 0, 1, 1]
>>> from sklearn.neighbors import KNeighborsClassifier
>>> neigh = KNeighborsClassifier(n_neighbors=3)
>>> neigh.fit(X, y) 
>>> print(neigh.predict([[1.1]]))
>>> print(neigh.predict_proba([[0.9]]))
[[0.66666667 0.33333333]]






這裡就引入了一個話題:資料的歸一化.  資料歸一化將所有的資料對映到同一尺度.


$$x_{scale} = \frac {x - x_{min}} {x_{max} - x_{min}}$$



$$x_{scale} = \frac {x - x_{mean}} S$$



