scikit-learn_cookbook1: 高效能機器學習-NumPy

NumPy scikit-learn 高效能 · 發表 2018-10-29 07:02:33

摘要：原始碼下載在本章主要內容： NumPy基礎知識載入iris資料集檢視iris資料集用pandas檢視iris資料集用NumPy和matplotlib繪圖最小機器學習配方 - SVM分類介紹交叉驗證以上彙總機器...

ofollow,noindex">原始碼下載

在本章主要內容：

NumPy基礎知識
載入iris資料集
檢視iris資料集
用pandas檢視iris資料集
用NumPy和matplotlib繪圖
最小機器學習配方 - SVM分類
介紹交叉驗證
以上彙總
機器學習概述 - 分類與迴歸

簡介

本章我們將學習如何使用scikit-learn進行預測。機器學習強調衡量預測能力，並用scikit-learn進行準確和快速的預測。我們將檢查iris資料集，該資料集由三種iris的測量結果組成：Iris Setosa，Iris Versicolor和Iris Virginica。

為了衡量預測，我們將：

儲存一些資料以進行測試
僅使用訓練資料構建模型
測量測試集的預測能力

解決問題的方法

類別(Classification)：
- 非文字，比如Iris
迴歸
聚類
降維

可愛的python測試開發庫謝謝在github上點贊。

python中文庫文件彙總

介面自動化效能測試線上培訓大綱

python測試開發自動化測試資料分析人工智慧自學每週一練

python3標準庫-中文版

更多內容請關注雪峰磁針石：簡書

技術支援釘釘群：21745728(可以加釘釘pythontesting邀請加入) qq群：144081101 591302926 567351477
道家技術-手相手診看相中醫等釘釘群21734177 qq群：391441566 184175668 338228106 看手相、面相、舌相、抽籤、體質識別。服務費50元每人次起。請聯絡釘釘或者微信pythontesting

NumPy基礎

資料科學經常處理結構化的資料表。scikit-learn庫需要二維NumPy陣列。在本節中，您將學習

NumPy的shape和dimension

#!python

In [1]: import numpy as np

In [2]: np.arange(10)
Out[2]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [3]: array_1 = np.arange(10)

In [4]: array_1.shape
Out[4]: (10,)

In [5]: array_1.ndim 
Out[5]: 1

In [6]: array_1.reshape((5,2))
Out[6]: 
array([[0, 1],
[2, 3],
[4, 5],
[6, 7],
[8, 9]])

In [7]: array_1 = array_1.reshape((5,2))

In [8]: array_1.ndim
Out[8]: 2

NumPy廣播(broadcasting)

#!python

In [9]: array_1 + 1
Out[9]: 
array([[ 1,2],
[ 3,4],
[ 5,6],
[ 7,8],
[ 9, 10]])

In [10]: array_2 = np.arange(10)

In [11]: array_2 * array_2
Out[11]: array([ 0,1,4,9, 16, 25, 36, 49, 64, 81])

In [12]: array_2 = array_2 ** 2 #Note that this is equivalent to array_2 *

In [13]: array_2
Out[13]: array([ 0,1,4,9, 16, 25, 36, 49, 64, 81])

In [14]: array_2 = array_2.reshape((5,2))

In [15]: array_2
Out[15]: 
array([[ 0,1],
[ 4,9],
[16, 25],
[36, 49],
[64, 81]])

In [16]: array_1 = array_1 + 1

In [17]: array_1
Out[17]: 
array([[ 1,2],
[ 3,4],
[ 5,6],
[ 7,8],
[ 9, 10]])

In [18]: array_1 + array_2
Out[18]: 
array([[ 1,3],
[ 7, 13],
[21, 31],
[43, 57],
[73, 91]])

scikit-learn-cookbook-numpy-compare-rule.png

初始化NumPy陣列和dtypes

#!python

In [19]: np.zeros((5,2))
Out[19]: 
array([[0., 0.],
[0., 0.],
[0., 0.],
[0., 0.],
[0., 0.]])

In [20]: np.ones((5,2), dtype = np.int)
Out[20]: 
array([[1, 1],
[1, 1],
[1, 1],
[1, 1],
[1, 1]])

In [21]: np.empty((5,2), dtype = np.float)
Out[21]: 
array([[0.00000000e+000, 0.00000000e+000],


[6.90082649e-310, 6.90082647e-310],
[6.90072710e-310, 6.90072711e-310],
[6.90083466e-310, 0.00000000e+000],
[6.90083921e-310, 1.90979621e-310]])

索引

#!python

In [22]: array_1[0,0] #Finds value in first row and first column.
Out[22]: 1

In [23]: array_1[0,:] # View the first row
Out[23]: array([1, 2])

In [24]: array_1[:,0] # view the first column
Out[24]: array([1, 3, 5, 7, 9])

In [25]: array_1[2:5, :]
Out[25]: 
array([[ 5,6],
[ 7,8],
[ 9, 10]])

In [26]: array_1
Out[26]: 
array([[ 1,2],
[ 3,4],
[ 5,6],
[ 7,8],
[ 9, 10]])

In [27]: array_1[2:5,0]
Out[27]: array([5, 7, 9])

布林陣列

#!python

In [28]: array_1 > 5
Out[28]: 
array([[False, False],
[False, False],
[False,True],
[ True,True],
[ True,True]])

In [29]: array_1[array_1 > 5]
Out[29]: array([ 6,7,8,9, 10])

算術運算

#!python

In [30]: array_1.sum()
Out[30]: 55

In [31]: array_1.sum(axis = 1) # Find all the sums by row: 
Out[31]: array([ 3,7, 11, 15, 19])

In [32]: array_1.sum(axis = 0) # Find all the sums by column
Out[32]: array([25, 30])

In [33]: array_1.mean(axis = 0)
Out[33]: array([5., 6.])

NaN值

#!python

# Scikit-learn不接受np.nan
In [34]: array_3 = np.array([np.nan, 0, 1, 2, np.nan])

In [35]: np.isnan(array_3)
Out[35]: array([ True, False, False, False,True])

In [36]: array_3[~np.isnan(array_3)]
Out[36]: array([0., 1., 2.])

In [37]: array_3[np.isnan(array_3)] = 0

In [38]: array_3
Out[38]: array([0., 0., 1., 2., 0.])

Scikit-learn只接受實數的二維NumPy陣列，沒有缺失的np.nan值。從經驗來看，最好將np.nan改為某個值丟棄。就我個人而言，我喜歡跟蹤布林模板並保持資料的形狀大致相同，因為這會導致更少的編碼錯誤和更多的編碼靈活性。

載入資料

#!python

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: import matplotlib.pyplot as plt

In [4]: from sklearn import datasets

In [5]: iris = datasets.load_iris()

In [6]: iris.data
Out[6]: 
array([[5.1, 3.5, 1.4, 0.2],
[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2],
[5.4, 3.9, 1.7, 0.4],
[4.6, 3.4, 1.4, 0.3],
[5. , 3.4, 1.5, 0.2],
[4.4, 2.9, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.1],
[5.4, 3.7, 1.5, 0.2],
[4.8, 3.4, 1.6, 0.2],
[4.8, 3. , 1.4, 0.1],
[4.3, 3. , 1.1, 0.1],
[5.8, 4. , 1.2, 0.2],
[5.7, 4.4, 1.5, 0.4],
[5.4, 3.9, 1.3, 0.4],
[5.1, 3.5, 1.4, 0.3],
[5.7, 3.8, 1.7, 0.3],
[5.1, 3.8, 1.5, 0.3],
[5.4, 3.4, 1.7, 0.2],
[5.1, 3.7, 1.5, 0.4],
[4.6, 3.6, 1. , 0.2],
[5.1, 3.3, 1.7, 0.5],
[4.8, 3.4, 1.9, 0.2],
[5. , 3. , 1.6, 0.2],
[5. , 3.4, 1.6, 0.4],
[5.2, 3.5, 1.5, 0.2],
[5.2, 3.4, 1.4, 0.2],
[4.7, 3.2, 1.6, 0.2],
[4.8, 3.1, 1.6, 0.2],
[5.4, 3.4, 1.5, 0.4],
[5.2, 4.1, 1.5, 0.1],
[5.5, 4.2, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.1],
[5. , 3.2, 1.2, 0.2],
[5.5, 3.5, 1.3, 0.2],
[4.9, 3.1, 1.5, 0.1],
[4.4, 3. , 1.3, 0.2],
[5.1, 3.4, 1.5, 0.2],
[5. , 3.5, 1.3, 0.3],
[4.5, 2.3, 1.3, 0.3],
[4.4, 3.2, 1.3, 0.2],
[5. , 3.5, 1.6, 0.6],
[5.1, 3.8, 1.9, 0.4],
[4.8, 3. , 1.4, 0.3],
[5.1, 3.8, 1.6, 0.2],
[4.6, 3.2, 1.4, 0.2],
[5.3, 3.7, 1.5, 0.2],
[5. , 3.3, 1.4, 0.2],
[7. , 3.2, 4.7, 1.4],
[6.4, 3.2, 4.5, 1.5],
[6.9, 3.1, 4.9, 1.5],
[5.5, 2.3, 4. , 1.3],
[6.5, 2.8, 4.6, 1.5],
[5.7, 2.8, 4.5, 1.3],
[6.3, 3.3, 4.7, 1.6],
[4.9, 2.4, 3.3, 1. ],
[6.6, 2.9, 4.6, 1.3],
[5.2, 2.7, 3.9, 1.4],
[5. , 2. , 3.5, 1. ],
[5.9, 3. , 4.2, 1.5],
[6. , 2.2, 4. , 1. ],
[6.1, 2.9, 4.7, 1.4],
[5.6, 2.9, 3.6, 1.3],
[6.7, 3.1, 4.4, 1.4],
[5.6, 3. , 4.5, 1.5],
[5.8, 2.7, 4.1, 1. ],
[6.2, 2.2, 4.5, 1.5],
[5.6, 2.5, 3.9, 1.1],
[5.9, 3.2, 4.8, 1.8],
[6.1, 2.8, 4. , 1.3],
[6.3, 2.5, 4.9, 1.5],
[6.1, 2.8, 4.7, 1.2],
[6.4, 2.9, 4.3, 1.3],
[6.6, 3. , 4.4, 1.4],
[6.8, 2.8, 4.8, 1.4],
[6.7, 3. , 5. , 1.7],
[6. , 2.9, 4.5, 1.5],
[5.7, 2.6, 3.5, 1. ],
[5.5, 2.4, 3.8, 1.1],
[5.5, 2.4, 3.7, 1. ],
[5.8, 2.7, 3.9, 1.2],
[6. , 2.7, 5.1, 1.6],
[5.4, 3. , 4.5, 1.5],
[6. , 3.4, 4.5, 1.6],
[6.7, 3.1, 4.7, 1.5],
[6.3, 2.3, 4.4, 1.3],
[5.6, 3. , 4.1, 1.3],
[5.5, 2.5, 4. , 1.3],
[5.5, 2.6, 4.4, 1.2],
[6.1, 3. , 4.6, 1.4],
[5.8, 2.6, 4. , 1.2],
[5. , 2.3, 3.3, 1. ],
[5.6, 2.7, 4.2, 1.3],
[5.7, 3. , 4.2, 1.2],
[5.7, 2.9, 4.2, 1.3],
[6.2, 2.9, 4.3, 1.3],
[5.1, 2.5, 3. , 1.1],
[5.7, 2.8, 4.1, 1.3],
[6.3, 3.3, 6. , 2.5],
[5.8, 2.7, 5.1, 1.9],
[7.1, 3. , 5.9, 2.1],
[6.3, 2.9, 5.6, 1.8],
[6.5, 3. , 5.8, 2.2],
[7.6, 3. , 6.6, 2.1],
[4.9, 2.5, 4.5, 1.7],
[7.3, 2.9, 6.3, 1.8],
[6.7, 2.5, 5.8, 1.8],
[7.2, 3.6, 6.1, 2.5],
[6.5, 3.2, 5.1, 2. ],
[6.4, 2.7, 5.3, 1.9],
[6.8, 3. , 5.5, 2.1],
[5.7, 2.5, 5. , 2. ],
[5.8, 2.8, 5.1, 2.4],
[6.4, 3.2, 5.3, 2.3],
[6.5, 3. , 5.5, 1.8],
[7.7, 3.8, 6.7, 2.2],
[7.7, 2.6, 6.9, 2.3],
[6. , 2.2, 5. , 1.5],
[6.9, 3.2, 5.7, 2.3],
[5.6, 2.8, 4.9, 2. ],
[7.7, 2.8, 6.7, 2. ],
[6.3, 2.7, 4.9, 1.8],
[6.7, 3.3, 5.7, 2.1],
[7.2, 3.2, 6. , 1.8],
[6.2, 2.8, 4.8, 1.8],
[6.1, 3. , 4.9, 1.8],
[6.4, 2.8, 5.6, 2.1],
[7.2, 3. , 5.8, 1.6],
[7.4, 2.8, 6.1, 1.9],
[7.9, 3.8, 6.4, 2. ],
[6.4, 2.8, 5.6, 2.2],
[6.3, 2.8, 5.1, 1.5],
[6.1, 2.6, 5.6, 1.4],
[7.7, 3. , 6.1, 2.3],
[6.3, 3.4, 5.6, 2.4],
[6.4, 3.1, 5.5, 1.8],
[6. , 3. , 4.8, 1.8],
[6.9, 3.1, 5.4, 2.1],
[6.7, 3.1, 5.6, 2.4],
[6.9, 3.1, 5.1, 2.3],
[5.8, 2.7, 5.1, 1.9],
[6.8, 3.2, 5.9, 2.3],
[6.7, 3.3, 5.7, 2.5],
[6.7, 3. , 5.2, 2.3],
[6.3, 2.5, 5. , 1.9],
[6.5, 3. , 5.2, 2. ],
[6.2, 3.4, 5.4, 2.3],
[5.9, 3. , 5.1, 1.8]])

In [7]: iris.data.shape
Out[7]: (150, 4)

In [8]: iris.data[0]
Out[8]: array([5.1, 3.5, 1.4, 0.2])

In [9]: iris.feature_names
Out[9]: 
['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [10]: iris.target
Out[10]: 
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [11]: iris.target.shape
Out[11]: (150,)

In [12]: iris.target_names
Out[12]: array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

用pandas檢視資料

#!python

import numpy as np#Load the numpy library for fast array computations
import pandas as pd#Load the pandas data-analysis library
import matplotlib.pyplot as plt#Load the pyplot visualization library

%matplotlib inline

from sklearn import datasets
iris = datasets.load_iris()

iris_df = pd.DataFrame(iris.data, columns = iris.feature_names)

iris_df['sepal length (cm)'].hist(bins=30)

scikit-learn-cookbook1-pandas1.png

#!python

for class_number in np.unique(iris.target):
plt.figure(1)
iris_df['sepal length (cm)'].iloc[np.where(iris.target == class_number)[0]].hist(bins=30)

#!python

np.where(iris.target == class_number)[0]

執行結果

#!python

array([100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112,
113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125,
126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138,
139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149], dtype=int64)

matplotlib和NumPy作圖

#!python

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

plt.plot(np.arange(10), np.arange(10))

plt.plot(np.arange(10), np.exp(np.arange(10)))


# 兩張圖片放在一起
plt.figure()
plt.subplot(121)
plt.plot(np.arange(10), np.exp(np.arange(10)))
plt.subplot(122)
plt.scatter(np.arange(10), np.exp(np.arange(10)))



plt.figure()
plt.subplot(211)
plt.plot(np.arange(10), np.exp(np.arange(10)))
plt.subplot(212)
plt.scatter(np.arange(10), np.exp(np.arange(10)))

plt.figure()
plt.subplot(221)
plt.plot(np.arange(10), np.exp(np.arange(10)))
plt.subplot(222)
plt.scatter(np.arange(10), np.exp(np.arange(10)))
plt.subplot(223)
plt.scatter(np.arange(10), np.exp(np.arange(10)))
plt.subplot(224)
plt.scatter(np.arange(10), np.exp(np.arange(10)))

from sklearn.datasets import load_iris

iris = load_iris()
data = iris.data
target = iris.target

# Resize the figure for better viewing
plt.figure(figsize=(12,5))

# First subplot
plt.subplot(121)

# Visualize the first two columns of data:
plt.scatter(data[:,0], data[:,1], c=target)

# Second subplot
plt.subplot(122)

# Visualize the last two columns of data:
plt.scatter(data[:,2], data[:,3], c=target)

執行結果參見

最小機器學習快速入門 - 向量機分類

為了做出預測，我們將：

說明要解決的問題
選擇一個模型來解決問題
訓練模型
作出預測
衡量模型的表現如何

scikit-learn_cookbook1: 高效能機器學習-NumPy

簡介

NumPy基礎

載入資料

matplotlib和NumPy作圖

最小機器學習快速入門 - 向量機分類

您可能也會喜歡…