1. 程式人生 > >第四章:NumPy基礎Day6-7

第四章:NumPy基礎Day6-7

說明:本文章為Python資料處理學習日誌,記錄內容為實現書本內容時遇到的錯誤以及一些與書本不一致的地方,一些簡單操作則不再贅述。日誌主要內容來自書本《利用Python進行資料分析》,Wes McKinney著,機械工業出版社。

1、ndarray

ndarray介紹

ndarray的說明:

Class docstring:

ndarray(shape, dtype=float, buffer=None, offset=0,strides=None, order=None)

An array object represents a multidimensional, homogeneous array of fixed-size items. An associated data-type object describes the format of each element in the array (its byte-order, how many bytes it occupies in memory, whether it is an integer, a floating point number, or something else, etc.)

Arrays should be constructed using array, zeros or empty (refer to the See Also section below). The parameters given here refer to a low-level method (ndarray(...)) for instantiating an array.

引數介紹:

Parameters

shape : tuple of ints
Shape of created array.

dtype : data-type, optional


Any object that can be interpreted as a numpy data type.

buffer : object exposing buffer interface, optional
Used to fill the array with data.

offset : int, optional
Offset of array data in buffer.

strides : tuple of ints, optional
Strides of data in memory.

order : {‘C’, ‘F’}, optional
Row-major (C-style) or column-major (Fortran-style) order.

屬性:

Attributes

T : ndarray
Transpose of the array.

data : buffer
The array’s elements, in memory.

dtype : dtype object
Describes the format of the elements in the array.

flags : dict
Dictionary containing information related to memory use, e.g.,’C_CONTIGUOUS’, ‘OWNDATA’, ‘WRITEABLE’, etc.

flat : numpy.flatiter object
Flattened version of the array as an iterator. The iterator allows assignments, e.g., x.flat = 3 (See ndarray.flat for
assignment examples; TODO).

imag : ndarray
Imaginary part of the array.

real : ndarray
Real part of the array.

size : int
Number of elements in the array.

itemsize : int
The memory use of each array element in bytes.

nbytes : int
The total number of bytes required to store the array data, i.e., itemsize * size.

ndim : int
The array’s number of dimensions.

shape : tuple of ints
Shape of the array.

strides : tuple of ints
The step-size required to move from one element to the next in memory. For example, a contiguous (3, 4) array of type int16 in C-order has strides (8, 2). This implies that to move from element to element in memory requires jumps of 2 bytes.To move from row-to-row, one needs to jump 8 bytes at a time(2 * 4).

ctypes : ctypes object
Class containing properties of the array needed for interaction with ctypes.

base : ndarray
If the array is a view into another array, that array is its base(unless that array is also a view). The base array is where
the array data is actually stored.

書本注

P87 astype函式
轉化為int則會報錯:

num_strings = np.array(['1.25','-9.6','42'],dtype = np.string_)

num_strings.dtype
Out[40]: dtype('S4')

num_strings.astype(float64)
Out[41]: array([  1.25,  -9.6 ,  42.  ])

num_strings.astype(int)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-42-6d0676c88a7e> in <module>()
----> 1 num_strings.astype(int)

ValueError: invalid literal for int() with base 10: '1.25' 

P89檢視與copy
用“=”定義新的陣列,改變其一,另一個都會隨之變化:

arr = np.arange(10)

arr
Out[78]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

arr1 = arr

arr1
Out[80]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

arr1[3:5]=233

arr1
Out[82]: array([  0,   1,   2, 233, 233,   5,   6,   7,   8,   9])

arr
Out[83]: array([  0,   1,   2, 233, 233,   5,   6,   7,   8,   9])

arr[3:5]=0

arr
Out[98]: array([0, 1, 2, 0, 0, 5, 6, 7, 8, 9])

arr1
Out[99]: array([0, 1, 2, 0, 0, 5, 6, 7, 8, 9])

若不想這樣,可以用copy()函式得到副本而非檢視:

arr = np.arange(10)

arr1 = arr.copy()

arr
Out[103]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

arr1
Out[104]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

arr1[5:8] = 233

arr1
Out[106]: array([  0,   1,   2,   3,   4, 233, 233, 233,   8,   9])

arr
Out[107]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

arr[1:3] = 666

arr
Out[109]: array([  0, 666, 666,   3,   4,   5,   6,   7,   8,   9])

arr1
Out[110]: array([  0,   1,   2,   3,   4, 233, 233, 233,   8,   9])

P90三維array

arr3d = np.array([[[1,2,3],[4,5,6]],[[7,8,9],[10,11,12]]])

儲存方式如下圖:
這裡寫圖片描述

參考上圖,則容易理解以下操作結果:

arr3d[0]
Out[15]: 
array([[1, 2, 3],
       [4, 5, 6]])

arr3d[0,0]
Out[16]: array([1, 2, 3])

arr3d[:,:1]
Out[17]: 
array([[[1, 2, 3]],

       [[7, 8, 9]]])

arr3d[:,:1,1:]
Out[18]: 
array([[[2, 3]],

       [[8, 9]]])

P93二維array的切片
儘管選取的元素是一樣的,但格式上是有差別的:

arr2d[2,:].shape
Out[20]: (3L,)

arr2d[2:,:].shape
Out[21]: (1L, 3L)

arr2d[2,:]
Out[22]: array([7, 8, 9])

arr2d[2:,:]
Out[23]: array([[7, 8, 9]]) #這裡多一層[]

P93布林型索引
注意維度:

data[names == 'Bob'] #選中為True的行
Out[40]: 
array([[ 0.50050011,  0.29686535, -0.28906449,  2.01808417],
       [ 1.11494101,  0.16982801,  0.71012665,  0.27181318]])

data[names == 'Bob',3] #此時的3為列
Out[41]: array([ 2.01808417,  0.27181318])

data[names == 'Bob'][1] #此時1為行
Out[43]: array([ 1.11494101,  0.16982801,  0.71012665,  0.27181318])

data[names == 'Bob'][1,2] #此時1為行,2為列
Out[44]: 0.71012664552279547

P94選擇除“Bob”以外的值
負號(-)用於邏輯計算是deprecated,建議使用“~”:

data[-(names == 'Bob')]
-c:1: DeprecationWarning: numpy boolean negative, the `-` operator, is deprecated, use the `~` operator or the logical_not function instead.
Out[45]: 
array([[ 0.14050569,  0.3441191 , -1.33003662,  0.42766074],
       [-0.45652879, -1.51870623, -2.45291715,  0.86532186],
       [ 0.775377  , -0.73127956,  0.02678242, -0.73168782],
       [ 0.80537817, -1.1611737 ,  1.31483485,  1.79852968],
       [-0.07039251,  0.1692129 , -1.7043453 ,  0.03626713]])

data[~(names == 'Bob')]
Out[46]: 
array([[ 0.14050569,  0.3441191 , -1.33003662,  0.42766074],
       [-0.45652879, -1.51870623, -2.45291715,  0.86532186],
       [ 0.775377  , -0.73127956,  0.02678242, -0.73168782],
       [ 0.80537817, -1.1611737 ,  1.31483485,  1.79852968],
       [-0.07039251,  0.1692129 , -1.7043453 ,  0.03626713]])

P95花式索引
索引並不是按值來索引,而是按行。即[4,3,0,6]為行,而並非值,書上例子易誤解:

for i in range(8):
    arr[i]=i*i

arr
Out[67]: 
array([[  0.,   0.,   0.,   0.],
       [  1.,   1.,   1.,   1.],
       [  4.,   4.,   4.,   4.],
       [  9.,   9.,   9.,   9.],
       [ 16.,  16.,  16.,  16.],
       [ 25.,  25.,  25.,  25.],
       [ 36.,  36.,  36.,  36.],
       [ 49.,  49.,  49.,  49.]])

arr[[4,3,2,6]]
Out[68]: 
array([[ 16.,  16.,  16.,  16.],
       [  9.,   9.,   9.,   9.],
       [  4.,   4.,   4.,   4.],
       [ 36.,  36.,  36.,  36.]])

arr[[-3,-1,-7]]
Out[69]: 
array([[ 25.,  25.,  25.,  25.],
       [ 49.,  49.,  49.,  49.],
       [  1.,   1.,   1.,   1.]])

2、通用函式

實際上是對陣列的每一個元素進行運算,用通用函式會方便許多,完全不用寫任何迴圈!!!

3、利用陣列進行資料處理

示例實現結果圖:
這裡寫圖片描述

函式說明

mean()函式(其他如sum均類似,不贅述)

Signature:
mean(a, axis=None, dtype=None, out=None, keepdims=False)
Docstring:
Compute the arithmetic mean along the specified axis.

Returns the average of the array elements. The average is taken over the flattened array by default, otherwise over the specified axis.float64 intermediate and return values are used for integer inputs.

Parameters:
a : array_like
Array containing numbers whose mean is desired. If a is not an array, a conversion is attempted.
axis : None or int or tuple of ints, optional
Axis or axes along which the means are computed. The default is to compute the mean of the flattened array. If this is a tuple of ints, a mean is performed over multiple axes,instead of a single axis or all the axes as before.
dtype : data-type, optional
Type to use in computing the mean. For integer inputs, the default is float64; for floating point inputs, it is the same as the input dtype.
out : ndarray, optional
Alternate output array in which to place the result. The default is None; if provided, it must have the same shape as the expected output, but the type will be cast if necessary.See doc.ufuncs for details.
keepdims : bool, optional
If this is set to True, the axes which are reduced are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the original arr.

Returns
m : ndarray, see dtype parameter above
If out=None, returns a new array containing the mean values, otherwise a reference to the output array is returned.

書本注

P103邏輯運算
分別實現三種方法,第三個表示式似乎有點問題。第三種方法同樣因為版本問題,有些符號需要更改,如求反改“-”為“~”,亦或改“-”為“^”(這裡實際上並不是求亦或,而是表示式出了問題):

cond1 = array([True,True,False,False])

cond2 = array([True,False,True,False])

result1=[]

for i in range(4):
    if cond1[i] and cond2[i]:
        result1.append(0)
    elif cond1[i]:
        result1.append(1)
    elif cond2[i]:
        result1.append(2)
    else:
        result1.append(3)


result1
Out[179]: [0, 1, 2, 3]

result2 = np.where(cond1&cond2,0,
    np.where(cond1,1,
        np.where(cond2,2,3)))

result2
Out[181]: array([0, 1, 2, 3])

result3 = 1*(cond1-cond2)+2*(cond2& -cond1)+3* -(cond1|cond2)
-c:1: DeprecationWarning: numpy boolean subtract, the `-` operator, is deprecated, use the bitwise_xor, the `^` operator, or the logical_xor function instead.
-c:1: DeprecationWarning: numpy boolean negative, the `-` operator, is deprecated, use the `~` operator or the logical_not function instead.

result3
Out[184]: array([0, 1, 3, 3])

result3 = 1*(cond1^cond2)+2*(cond2&~cond1)+3*~(cond1|cond2)

result3
Out[186]: array([0, 1, 3, 3])

該邏輯的卡諾圖為下圖:
這裡寫圖片描述

表示式為:
F(AB)=0×AB+1×A¯B+2×AB¯+3×A¯B¯

應將表示式修改為:

result3 = 1*(cond1&~cond2)+2*(cond2&~cond1)+3*~(cond1|cond2)

result3
Out[188]: array([0, 1, 2, 3])

P104數學和統計方法
在此說明一些mean()的引數axis(其他方法的引數axis均與mean()函式相同):

arr = random.randn(5,4) #Canopy中默認了很多東西,此處可以不加np.

arr
Out[214]: 
array([[-1.35496721, -0.15978328,  0.38585305, -0.31769088],
       [ 1.02980411, -0.66722211,  0.7295801 , -1.08059653],
       [-0.78498058, -1.32036524,  0.96689313, -0.72398153],
       [ 0.1016877 ,  1.3240266 ,  1.70886154, -0.60597185],
       [-1.79229484, -2.20550415,  0.69782981,  0.68054199]])

arr.mean(1) #此處的引數1即為axis的值,延axis1計算,故計算結果為每一行的平均值
Out[215]: array([-0.36164708,  0.0028914 , -0.46560855,  0.632151  , -0.6548568 ])

arr.mean(0) #此處的引數0也為axis的值,延axis0計算,故計算結果為每一列的平均值
Out[216]: array([-0.56015016, -0.60576964,  0.89780353, -0.40953976])

arr.mean(2) #因為arr為二維陣列,axis並沒有2,故報錯
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-217-9e6caf702904> in <module>()
----> 1 arr.mean(2) #因為arr為二維陣列,axis並沒有2,故報錯

E:\Enthought\hzk\User\lib\site-packages\numpy\core\_methods.pyc in _mean(a, axis, dtype, out, keepdims)
     54     arr = asanyarray(a)
     55 
---> 56     rcount = _count_reduce_items(arr, axis)
     57     # Make this warning show up first
     58     if rcount == 0:

E:\Enthought\hzk\User\lib\site-packages\numpy\core\_methods.pyc in _count_reduce_items(arr, axis)
     48     items = 1
     49     for ax in axis:
---> 50         items *= arr.shape[ax]
     51     return items
     52 

IndexError: tuple index out of range 

arr3d = random.randn(3,3,3)

arr3d
Out[219]: 
array([[[-0.44556533,  1.38462328, -0.41991462],
        [-0.94348762,  0.24127859,  0.10032657],
        [-1.00207669,  0.09779813,  1.39707048]],

       [[ 1.82511376,  0.21457405,  0.2793703 ],
        [-2.52457174, -0.48253527,  0.11162123],
        [ 1.32029231,  1.25795631, -0.86304957]],

       [[ 1.2496241 ,  0.60886197, -0.79459398],
        [-3.11245287,  1.19690542,  0.59970726],
        [-0.75850899,  1.11220732,  0.7452851 ]]])

arr3d.mean(2) #改為三維陣列則不會報錯
Out[220]: 
array([[ 0.17304778, -0.20062749,  0.16426397],
       [ 0.77301937, -0.96516193,  0.57173302],
       [ 0.3546307 , -0.43861339,  0.36632781]])

注意:axis引數0,1,指的是延axis0或axis1計算,這一點可以從cumsum()和cumprod()函式中體現出了:

arr = array([[1,2,3],[4,5,6],[7,8,9],[10,11,12]])

arr
Out[231]: 
array([[ 1,  2,  3],
       [ 4,  5,  6],
       [ 7,  8,  9],
       [10, 11, 12]])

arr.cumsum(0)
Out[232]: 
array([[ 1,  2,  3],
       [ 5,  7,  9],
       [12, 15, 18],
       [22, 26, 30]])

arr.cumsum(1)
Out[233]: 
array([[ 1,  3,  6],
       [ 4,  9, 15],
       [ 7, 15, 24],
       [10, 21, 33]])

arr.cumprod(0)
Out[234]: 
array([[   1,    2,    3],
       [   4,   10,   18],
       [  28,   80,  162],
       [ 280,  880, 1944]])

arr.cumprod(1)
Out[235]: 
array([[   1,    2,    6],
       [   4,   20,  120],
       [   7,   56,  504],
       [  10,  110, 1320]])

二維陣列的axis示意圖為:
這裡寫圖片描述

5、用於陣列的檔案輸入輸出

存的檔名最好不要帶空格“ ”,儘量用下劃線“_”代替,不然按Tab鍵無法自動載入。

6、線性代數

書本注

P110矩陣運算
可能是精度問題,計算XX1並不得單位矩陣E:

x = array(randn(5,5),dtype='f')

x
Out[328]: 
array([[-0.22578056,  0.11948908, -0.34033489, -1.17195463,  0.03571514],
       [-0.32171333,  0.30542919, -2.02457905, -0.58019698, -0.97900975],
       [-0.66901571, -1.31925774, -1.30758452,  1.68790388,  0.91629416],
       [ 1.85947955,  0.29689333,  0.06040272, -0.38136303,  1.94833517],
       [ 1.61013746,  0.54303652, -1.0496527 ,  1.74626613,  0.73138028]], dtype=float32)

inv(x)
Out[329]: 
array([[-3.83887243,  1.60917497, -0.23554267,  1.54410875, -1.47681177],
       [ 5.01534271, -2.0442934 , -0.20324162, -1.94631255,  2.45808005],
       [-0.25406945, -0.28958428, -0.14813125, -0.02293177, -0.12855303],
       [ 0.56339014, -0.47421208,  0.07606155, -0.51252431,  0.6077475 ],
       [ 3.0176971 , -1.30811489,  0.27525163, -0.7634539 ,  1.15783525]], dtype=float32)

dot(x,inv(x))
Out[330]: 
array([[  1.00000000e+00,  -1.49011612e-08,   4.65661287e-09,
         -3.72529030e-09,  -8.19563866e-08],
       [  2.38418579e-07,   9.99999881e-01,   0.00000000e+00,
         -5.96046448e-08,   0.00000000e+00],
       [  4.76837158e-07,   1.19209290e-07,   1.00000000e+00,
          5.96046448e-08,   0.00000000e+00],
       [  0.00000000e+00,   2.38418579e-07,  -5.96046448e-08,
          1.00000012e+00,  -2.38418579e-07],
       [ -2.38418579e-07,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   1.00000012e+00]], dtype=float32)

7、範例:隨機漫步

書本注

P112
繪圖直接用程式碼:

plot(walk)
Out[351]: [<matplotlib.lines.Line2D at 0x157ae7f0>]

結果:
這裡寫圖片描述