Cris 的 Python 資料分析筆記 01:NumPy 基本知識
阿新 • • 發佈:2018-11-19
01. NumPy基本知識
文章目錄
1. numpy 的第一個函式 genfromtxt
import numpy as np
world_alcohol = np.genfromtxt('world_alcohol.txt',delimiter=',',dtype='str')
# <class 'numpy.ndarray'>
print(type(world_alcohol))
print(world_alcohol)
print(help(np.genfromtxt))
<class 'numpy.ndarray'> [['Year' 'WHO region' 'Country' 'Beverage Types' 'Display Value'] ['1986' 'Western Pacific' 'Viet Nam' 'Wine' '0'] ['1986' 'Americas' 'Uruguay' 'Other' '0.5'] ... ['1987' 'Africa' 'Malawi' 'Other' '0.75'] ['1989' 'Americas' 'Bahamas' 'Wine' '1.5'] ['1985' 'Africa' 'Malawi' 'Spirits' '0.31']] Help on function genfromtxt in module numpy.lib.npyio: genfromtxt(fname, dtype=<class 'float'>, comments='#', delimiter=None, skip_header=0, skip_footer=0, converters=None, missing_values=None, filling_values=None, usecols=None, names=None, excludelist=None, deletechars=None, replace_space='_', autostrip=False, case_sensitive=True, defaultfmt='f%i', unpack=None, usemask=False, loose=True, invalid_raise=True, max_rows=None, encoding='bytes') Load data from a text file, with missing values handled as specified. Each line past the first `skip_header` lines is split at the `delimiter` character, and characters following the `comments` character are discarded. Parameters ---------- fname : file, str, pathlib.Path, list of str, generator File, filename, list, or generator to read. If the filename extension is `.gz` or `.bz2`, the file is first decompressed. Note that generators must return byte strings in Python 3k. The strings in a list or produced by a generator are treated as lines. dtype : dtype, optional Data type of the resulting array. If None, the dtypes will be determined by the contents of each column, individually. comments : str, optional The character used to indicate the start of a comment. All the characters occurring on a line after a comment are discarded delimiter : str, int, or sequence, optional The string used to separate values. By default, any consecutive whitespaces act as delimiter. An integer or sequence of integers can also be provided as width(s) of each field. skiprows : int, optional `skiprows` was removed in numpy 1.10. Please use `skip_header` instead. skip_header : int, optional The number of lines to skip at the beginning of the file. skip_footer : int, optional The number of lines to skip at the end of the file. converters : variable, optional The set of functions that convert the data of a column to a value. The converters can also be used to provide a default value for missing data: ``converters = {3: lambda s: float(s or 0)}``. missing : variable, optional `missing` was removed in numpy 1.10. Please use `missing_values` instead. missing_values : variable, optional The set of strings corresponding to missing data. filling_values : variable, optional The set of values to be used as default when the data are missing. usecols : sequence, optional Which columns to read, with 0 being the first. For example, ``usecols = (1, 4, 5)`` will extract the 2nd, 5th and 6th columns. names : {None, True, str, sequence}, optional If `names` is True, the field names are read from the first line after the first `skip_header` lines. This line can optionally be proceeded by a comment delimeter. If `names` is a sequence or a single-string of comma-separated names, the names will be used to define the field names in a structured dtype. If `names` is None, the names of the dtype fields will be used, if any. excludelist : sequence, optional A list of names to exclude. This list is appended to the default list ['return','file','print']. Excluded names are appended an underscore: for example, `file` would become `file_`. deletechars : str, optional A string combining invalid characters that must be deleted from the names. defaultfmt : str, optional A format used to define default field names, such as "f%i" or "f_%02i". autostrip : bool, optional Whether to automatically strip white spaces from the variables. replace_space : char, optional Character(s) used in replacement of white spaces in the variables names. By default, use a '_'. case_sensitive : {True, False, 'upper', 'lower'}, optional If True, field names are case sensitive. If False or 'upper', field names are converted to upper case. If 'lower', field names are converted to lower case. unpack : bool, optional If True, the returned array is transposed, so that arguments may be unpacked using ``x, y, z = loadtxt(...)`` usemask : bool, optional If True, return a masked array. If False, return a regular array. loose : bool, optional If True, do not raise errors for invalid values. invalid_raise : bool, optional If True, an exception is raised if an inconsistency is detected in the number of columns. If False, a warning is emitted and the offending lines are skipped. max_rows : int, optional The maximum number of rows to read. Must not be used with skip_footer at the same time. If given, the value must be at least 1. Default is to read the entire file. .. versionadded:: 1.10.0 encoding : str, optional Encoding used to decode the inputfile. Does not apply when `fname` is a file object. The special value 'bytes' enables backward compatibility workarounds that ensure that you receive byte arrays when possible and passes latin1 encoded strings to converters. Override this value to receive unicode arrays and pass strings as input to converters. If set to None the system default is used. The default value is 'bytes'. .. versionadded:: 1.14.0 Returns ------- out : ndarray Data read from the text file. If `usemask` is True, this is a masked array. See Also -------- numpy.loadtxt : equivalent function when no data is missing. Notes ----- * When spaces are used as delimiters, or when no delimiter has been given as input, there should not be any missing data between two fields. * When the variables are named (either by a flexible dtype or with `names`, there must not be any header in the file (else a ValueError exception is raised). * Individual values are not stripped of spaces by default. When using a custom converter, make sure the function does remove spaces. References ---------- .. [1] NumPy User Guide, section `I/O with NumPy <http://docs.scipy.org/doc/numpy/user/basics.io.genfromtxt.html>`_. Examples --------- >>> from io import StringIO >>> import numpy as np Comma delimited file with mixed dtype >>> s = StringIO("1,1.3,abcde") >>> data = np.genfromtxt(s, dtype=[('myint','i8'),('myfloat','f8'), ... ('mystring','S5')], delimiter=",") >>> data array((1, 1.3, 'abcde'), dtype=[('myint', '<i8'), ('myfloat', '<f8'), ('mystring', '|S5')]) Using dtype = None >>> s.seek(0) # needed for StringIO example only >>> data = np.genfromtxt(s, dtype=None, ... names = ['myint','myfloat','mystring'], delimiter=",") >>> data array((1, 1.3, 'abcde'), dtype=[('myint', '<i8'), ('myfloat', '<f8'), ('mystring', '|S5')]) Specifying dtype and names >>> s.seek(0) >>> data = np.genfromtxt(s, dtype="i8,f8,S5", ... names=['myint','myfloat','mystring'], delimiter=",") >>> data array((1, 1.3, 'abcde'), dtype=[('myint', '<i8'), ('myfloat', '<f8'), ('mystring', '|S5')]) An example with fixed-width columns >>> s = StringIO("11.3abcde") >>> data = np.genfromtxt(s, dtype=None, names=['intvar','fltvar','strvar'], ... delimiter=[1,3,5]) >>> data array((1, 1.3, 'abcde'), dtype=[('intvar', '<i8'), ('fltvar', '<f8'), ('strvar', '|S5')]) None
2. numpy 的第二個函式 array
import numpy as np
vector = np.array([1,2,3])
# [1 2 3]
print(vector)
# <class 'numpy.ndarray'> numpy 中特殊的資料型別,可以理解為矩陣
print(type(vector))
matrix = np.array([[11,22,33],['cris','james','小哥哥'],[11.11,True,False,]])
'''
array 方法裡面的元素必須為同一個型別,否則將會把資料往更加通用的資料型別上轉換(自動型別轉換),例如 int-->float,其他資料型別-->str
[['11' '22' '33']
['cris' 'james' '小哥哥']
['11.11' 'True' 'False']]
'''
print(matrix)
# <class 'numpy.ndarray'>
print(type(matrix))
print(help(np.array))
[1 2 3]
<class 'numpy.ndarray'>
[['11' '22' '33']
['cris' 'james' '小哥哥']
['11.11' 'True' 'False']]
<class 'numpy.ndarray'>
Help on built-in function array in module numpy.core.multiarray:
array(...)
array(object, dtype=None, copy=True, order='K', subok=False, ndmin=0)
Create an array.
Parameters
----------
object : array_like
An array, any object exposing the array interface, an object whose
__array__ method returns an array, or any (nested) sequence.
dtype : data-type, optional
The desired data-type for the array. If not given, then the type will
be determined as the minimum type required to hold the objects in the
sequence. This argument can only be used to 'upcast' the array. For
downcasting, use the .astype(t) method.
copy : bool, optional
If true (default), then the object is copied. Otherwise, a copy will
only be made if __array__ returns a copy, if obj is a nested sequence,
or if a copy is needed to satisfy any of the other requirements
(`dtype`, `order`, etc.).
order : {'K', 'A', 'C', 'F'}, optional
Specify the memory layout of the array. If object is not an array, the
newly created array will be in C order (row major) unless 'F' is
specified, in which case it will be in Fortran order (column major).
If object is an array the following holds.
===== ========= ===================================================
order no copy copy=True
===== ========= ===================================================
'K' unchanged F & C order preserved, otherwise most similar order
'A' unchanged F order if input is F and not C, otherwise C order
'C' C order C order
'F' F order F order
===== ========= ===================================================
When ``copy=False`` and a copy is made for other reasons, the result is
the same as if ``copy=True``, with some exceptions for `A`, see the
Notes section. The default order is 'K'.
subok : bool, optional
If True, then sub-classes will be passed-through, otherwise
the returned array will be forced to be a base-class array (default).
ndmin : int, optional
Specifies the minimum number of dimensions that the resulting
array should have. Ones will be pre-pended to the shape as
needed to meet this requirement.
Returns
-------
out : ndarray
An array object satisfying the specified requirements.
See Also
--------
empty, empty_like, zeros, zeros_like, ones, ones_like, full, full_like
Notes
-----
When order is 'A' and `object` is an array in neither 'C' nor 'F' order,
and a copy is forced by a change in dtype, then the order of the result is
not necessarily 'C' as expected. This is likely a bug.
Examples
--------
>>> np.array([1, 2, 3])
array([1, 2, 3])
Upcasting:
>>> np.array([1, 2, 3.0])
array([ 1., 2., 3.])
More than one dimension:
>>> np.array([[1, 2], [3, 4]])
array([[1, 2],
[3, 4]])
Minimum dimensions 2:
>>> np.array([1, 2, 3], ndmin=2)
array([[1, 2, 3]])
Type provided:
>>> np.array([1, 2, 3], dtype=complex)
array([ 1.+0.j, 2.+0.j, 3.+0.j])
Data-type consisting of more than one element:
>>> x = np.array([(1,2),(3,4)],dtype=[('a','<i4'),('b','<i4')])
>>> x['a']
array([1, 3])
Creating an array from sub-classes:
>>> np.array(np.mat('1 2; 3 4'))
array([[1, 2],
[3, 4]])
>>> np.array(np.mat('1 2; 3 4'), subok=True)
matrix([[1, 2],
[3, 4]])
None
3. numpy 的第三個函式 shape
import numpy as np
'''
通過 shape 函式可以檢視變數的資料型別,例如下面程式碼的(3,) 表示有3個元素的列表;(2,3)表示兩行三列的矩陣
'''
vector = [1,2,3]
result = np.shape(element)
print(result)
# (3,)
matrix = np.shape([[1,2,3],['cris',False,True]])
print(matrix)
# (2, 3)
(3,)
(2, 3)
4. numpy 的 ndarray 資料型別的 dtype 屬性
import numpy as np
'''
經過 numpy 的 array 函式後,資料就變成了 ndarray 資料型別(type函式),而 dtype 屬性可以檢視當前 ndarray 裡的每一個元素的資料型別
(注意元素的自動資料型別轉換)
'''
vector = np.array([1,2,3,'jj'])
# ['1' '2' '3' 'jj']
print(vector)
# <class 'numpy.ndarray'>
print(type(vector))
# <U11
print(vector.dtype)
['1' '2' '3' 'jj']
<class 'numpy.ndarray'>
<U11
5. numpy 的 ndarray 資料型別如何取值
import numpy as np
data = np.genfromtxt('world_alcohol.txt', delimiter=',',dtype=str,skip_header=1)
print(data)
# 類似 Python 的序列資料型別,可以指定取出二維矩陣位置的元素,第一個引數為行,第二個引數為列
# 預設索引都是從 0 開始
data_01 = data[1,4]
data_02 = data[2,3]
print(data_01)
print(data_02)
[['1986' 'Western Pacific' 'Viet Nam' 'Wine' '0']
['1986' 'Americas' 'Uruguay' 'Other' '0.5']
['1985' 'Africa' "Cte d'Ivoire" 'Wine' '1.62']
...
['1987' 'Africa' 'Malawi' 'Other' '0.75']
['1989' 'Americas' 'Bahamas' 'Wine' '1.5']
['1985' 'Africa' 'Malawi' 'Spirits' '0.31']]
0.5
Wine
6. numpy 的 ndarray 切片
import numpy as np
# 其實和 Python 中序列切片一模一樣,前包後不包
data = np.array([1,2,3,4,5])
# [1 2 3]
print(data[0:3])
[1 2 3]
7. numpy 的 二維陣列切片
import numpy as np
matrix = np.array([['james','USA',45],['cris','CHINA',33],['大帥','UK',11]])
# ['USA' 'CHINA' 'UK'] 可以對二維陣列取出所有行的制定列的值,:表示所有行
print(matrix[:,1])
'''
可以通過切片指定取指定的那幾列的所有行的值
[['james' 'USA']
['cris' 'CHINA']
['大帥' 'UK']]
'''
print(matrix[:,0:2])
'''
同理,可以取指定行的指定列的值,也就是說二維陣列變數可以通過切片的方式取出任意位置的值,切片的第一個引數是行,第二個引數代表列,並且這兩個引數
都是可以使用切片形式的
[['james' 'USA']
['cris' 'CHINA']]
'''
print(matrix[0:2,0:2])
['USA' 'CHINA' 'UK']
[['james' 'USA']
['cris' 'CHINA']
['大帥' 'UK']]
[['james' 'USA']
['cris' 'CHINA']]