1. 程式人生 > >Cris 的 Python 資料分析筆記 01:NumPy 基本知識

Cris 的 Python 資料分析筆記 01:NumPy 基本知識

01. NumPy基本知識

文章目錄

1. numpy 的第一個函式 genfromtxt

import numpy as np

world_alcohol = np.genfromtxt('world_alcohol.txt',delimiter=',',dtype='str')
# <class 'numpy.ndarray'>
print(type(world_alcohol))
print(world_alcohol)
print(help(np.genfromtxt))
<class 'numpy.ndarray'>
[['Year' 'WHO region' 'Country' 'Beverage Types' 'Display Value']
 ['1986' 'Western Pacific' 'Viet Nam' 'Wine' '0']
 ['1986' 'Americas' 'Uruguay' 'Other' '0.5']
 ...
 ['1987' 'Africa' 'Malawi' 'Other' '0.75']
 ['1989' 'Americas' 'Bahamas' 'Wine' '1.5']
 ['1985' 'Africa' 'Malawi' 'Spirits' '0.31']]
Help on function genfromtxt in module numpy.lib.npyio:

genfromtxt(fname, dtype=<class 'float'>, comments='#', delimiter=None, skip_header=0, skip_footer=0, converters=None, missing_values=None, filling_values=None, usecols=None, names=None, excludelist=None, deletechars=None, replace_space='_', autostrip=False, case_sensitive=True, defaultfmt='f%i', unpack=None, usemask=False, loose=True, invalid_raise=True, max_rows=None, encoding='bytes')
    Load data from a text file, with missing values handled as specified.
    
    Each line past the first `skip_header` lines is split at the `delimiter`
    character, and characters following the `comments` character are discarded.
    
    Parameters
    ----------
    fname : file, str, pathlib.Path, list of str, generator
        File, filename, list, or generator to read.  If the filename
        extension is `.gz` or `.bz2`, the file is first decompressed. Note
        that generators must return byte strings in Python 3k.  The strings
        in a list or produced by a generator are treated as lines.
    dtype : dtype, optional
        Data type of the resulting array.
        If None, the dtypes will be determined by the contents of each
        column, individually.
    comments : str, optional
        The character used to indicate the start of a comment.
        All the characters occurring on a line after a comment are discarded
    delimiter : str, int, or sequence, optional
        The string used to separate values.  By default, any consecutive
        whitespaces act as delimiter.  An integer or sequence of integers
        can also be provided as width(s) of each field.
    skiprows : int, optional
        `skiprows` was removed in numpy 1.10. Please use `skip_header` instead.
    skip_header : int, optional
        The number of lines to skip at the beginning of the file.
    skip_footer : int, optional
        The number of lines to skip at the end of the file.
    converters : variable, optional
        The set of functions that convert the data of a column to a value.
        The converters can also be used to provide a default value
        for missing data: ``converters = {3: lambda s: float(s or 0)}``.
    missing : variable, optional
        `missing` was removed in numpy 1.10. Please use `missing_values`
        instead.
    missing_values : variable, optional
        The set of strings corresponding to missing data.
    filling_values : variable, optional
        The set of values to be used as default when the data are missing.
    usecols : sequence, optional
        Which columns to read, with 0 being the first.  For example,
        ``usecols = (1, 4, 5)`` will extract the 2nd, 5th and 6th columns.
    names : {None, True, str, sequence}, optional
        If `names` is True, the field names are read from the first line after
        the first `skip_header` lines.  This line can optionally be proceeded
        by a comment delimeter. If `names` is a sequence or a single-string of
        comma-separated names, the names will be used to define the field names
        in a structured dtype. If `names` is None, the names of the dtype
        fields will be used, if any.
    excludelist : sequence, optional
        A list of names to exclude. This list is appended to the default list
        ['return','file','print']. Excluded names are appended an underscore:
        for example, `file` would become `file_`.
    deletechars : str, optional
        A string combining invalid characters that must be deleted from the
        names.
    defaultfmt : str, optional
        A format used to define default field names, such as "f%i" or "f_%02i".
    autostrip : bool, optional
        Whether to automatically strip white spaces from the variables.
    replace_space : char, optional
        Character(s) used in replacement of white spaces in the variables
        names. By default, use a '_'.
    case_sensitive : {True, False, 'upper', 'lower'}, optional
        If True, field names are case sensitive.
        If False or 'upper', field names are converted to upper case.
        If 'lower', field names are converted to lower case.
    unpack : bool, optional
        If True, the returned array is transposed, so that arguments may be
        unpacked using ``x, y, z = loadtxt(...)``
    usemask : bool, optional
        If True, return a masked array.
        If False, return a regular array.
    loose : bool, optional
        If True, do not raise errors for invalid values.
    invalid_raise : bool, optional
        If True, an exception is raised if an inconsistency is detected in the
        number of columns.
        If False, a warning is emitted and the offending lines are skipped.
    max_rows : int,  optional
        The maximum number of rows to read. Must not be used with skip_footer
        at the same time.  If given, the value must be at least 1. Default is
        to read the entire file.
    
        .. versionadded:: 1.10.0
    encoding : str, optional
        Encoding used to decode the inputfile. Does not apply when `fname` is
        a file object.  The special value 'bytes' enables backward compatibility
        workarounds that ensure that you receive byte arrays when possible
        and passes latin1 encoded strings to converters. Override this value to
        receive unicode arrays and pass strings as input to converters.  If set
        to None the system default is used. The default value is 'bytes'.
    
        .. versionadded:: 1.14.0
    
    Returns
    -------
    out : ndarray
        Data read from the text file. If `usemask` is True, this is a
        masked array.
    
    See Also
    --------
    numpy.loadtxt : equivalent function when no data is missing.
    
    Notes
    -----
    * When spaces are used as delimiters, or when no delimiter has been given
      as input, there should not be any missing data between two fields.
    * When the variables are named (either by a flexible dtype or with `names`,
      there must not be any header in the file (else a ValueError
      exception is raised).
    * Individual values are not stripped of spaces by default.
      When using a custom converter, make sure the function does remove spaces.
    
    References
    ----------
    .. [1] NumPy User Guide, section `I/O with NumPy
           <http://docs.scipy.org/doc/numpy/user/basics.io.genfromtxt.html>`_.
    
    Examples
    ---------
    >>> from io import StringIO
    >>> import numpy as np
    
    Comma delimited file with mixed dtype
    
    >>> s = StringIO("1,1.3,abcde")
    >>> data = np.genfromtxt(s, dtype=[('myint','i8'),('myfloat','f8'),
    ... ('mystring','S5')], delimiter=",")
    >>> data
    array((1, 1.3, 'abcde'),
          dtype=[('myint', '<i8'), ('myfloat', '<f8'), ('mystring', '|S5')])
    
    Using dtype = None
    
    >>> s.seek(0) # needed for StringIO example only
    >>> data = np.genfromtxt(s, dtype=None,
    ... names = ['myint','myfloat','mystring'], delimiter=",")
    >>> data
    array((1, 1.3, 'abcde'),
          dtype=[('myint', '<i8'), ('myfloat', '<f8'), ('mystring', '|S5')])
    
    Specifying dtype and names
    
    >>> s.seek(0)
    >>> data = np.genfromtxt(s, dtype="i8,f8,S5",
    ... names=['myint','myfloat','mystring'], delimiter=",")
    >>> data
    array((1, 1.3, 'abcde'),
          dtype=[('myint', '<i8'), ('myfloat', '<f8'), ('mystring', '|S5')])
    
    An example with fixed-width columns
    
    >>> s = StringIO("11.3abcde")
    >>> data = np.genfromtxt(s, dtype=None, names=['intvar','fltvar','strvar'],
    ...     delimiter=[1,3,5])
    >>> data
    array((1, 1.3, 'abcde'),
          dtype=[('intvar', '<i8'), ('fltvar', '<f8'), ('strvar', '|S5')])

None

2. numpy 的第二個函式 array

import numpy as np

vector = np.array([1,2,3])
# [1 2 3]
print(vector)
# <class 'numpy.ndarray'> numpy 中特殊的資料型別,可以理解為矩陣
print(type(vector))

matrix = np.array([[11,22,33],['cris','james','小哥哥'],[11.11,True,False,]])
'''
    array 方法裡面的元素必須為同一個型別,否則將會把資料往更加通用的資料型別上轉換(自動型別轉換),例如 int-->float,其他資料型別-->str
    [['11' '22' '33']
     ['cris' 'james' '小哥哥']
     ['11.11' 'True' 'False']]
'''
print(matrix) # <class 'numpy.ndarray'> print(type(matrix)) print(help(np.array))
[1 2 3]
<class 'numpy.ndarray'>
[['11' '22' '33']
 ['cris' 'james' '小哥哥']
 ['11.11' 'True' 'False']]
<class 'numpy.ndarray'>
Help on built-in function array in module numpy.core.multiarray:

array(...)
    array(object, dtype=None, copy=True, order='K', subok=False, ndmin=0)
    
    Create an array.
    
    Parameters
    ----------
    object : array_like
        An array, any object exposing the array interface, an object whose
        __array__ method returns an array, or any (nested) sequence.
    dtype : data-type, optional
        The desired data-type for the array.  If not given, then the type will
        be determined as the minimum type required to hold the objects in the
        sequence.  This argument can only be used to 'upcast' the array.  For
        downcasting, use the .astype(t) method.
    copy : bool, optional
        If true (default), then the object is copied.  Otherwise, a copy will
        only be made if __array__ returns a copy, if obj is a nested sequence,
        or if a copy is needed to satisfy any of the other requirements
        (`dtype`, `order`, etc.).
    order : {'K', 'A', 'C', 'F'}, optional
        Specify the memory layout of the array. If object is not an array, the
        newly created array will be in C order (row major) unless 'F' is
        specified, in which case it will be in Fortran order (column major).
        If object is an array the following holds.
    
        ===== ========= ===================================================
        order  no copy                     copy=True
        ===== ========= ===================================================
        'K'   unchanged F & C order preserved, otherwise most similar order
        'A'   unchanged F order if input is F and not C, otherwise C order
        'C'   C order   C order
        'F'   F order   F order
        ===== ========= ===================================================
    
        When ``copy=False`` and a copy is made for other reasons, the result is
        the same as if ``copy=True``, with some exceptions for `A`, see the
        Notes section. The default order is 'K'.
    subok : bool, optional
        If True, then sub-classes will be passed-through, otherwise
        the returned array will be forced to be a base-class array (default).
    ndmin : int, optional
        Specifies the minimum number of dimensions that the resulting
        array should have.  Ones will be pre-pended to the shape as
        needed to meet this requirement.
    
    Returns
    -------
    out : ndarray
        An array object satisfying the specified requirements.
    
    See Also
    --------
    empty, empty_like, zeros, zeros_like, ones, ones_like, full, full_like
    
    Notes
    -----
    When order is 'A' and `object` is an array in neither 'C' nor 'F' order,
    and a copy is forced by a change in dtype, then the order of the result is
    not necessarily 'C' as expected. This is likely a bug.
    
    Examples
    --------
    >>> np.array([1, 2, 3])
    array([1, 2, 3])
    
    Upcasting:
    
    >>> np.array([1, 2, 3.0])
    array([ 1.,  2.,  3.])
    
    More than one dimension:
    
    >>> np.array([[1, 2], [3, 4]])
    array([[1, 2],
           [3, 4]])
    
    Minimum dimensions 2:
    
    >>> np.array([1, 2, 3], ndmin=2)
    array([[1, 2, 3]])
    
    Type provided:
    
    >>> np.array([1, 2, 3], dtype=complex)
    array([ 1.+0.j,  2.+0.j,  3.+0.j])
    
    Data-type consisting of more than one element:
    
    >>> x = np.array([(1,2),(3,4)],dtype=[('a','<i4'),('b','<i4')])
    >>> x['a']
    array([1, 3])
    
    Creating an array from sub-classes:
    
    >>> np.array(np.mat('1 2; 3 4'))
    array([[1, 2],
           [3, 4]])
    
    >>> np.array(np.mat('1 2; 3 4'), subok=True)
    matrix([[1, 2],
            [3, 4]])

None

3. numpy 的第三個函式 shape

import numpy as np

'''
    通過 shape 函式可以檢視變數的資料型別,例如下面程式碼的(3,) 表示有3個元素的列表;(2,3)表示兩行三列的矩陣
'''
vector = [1,2,3]
result = np.shape(element)
print(result)
# (3,)
matrix = np.shape([[1,2,3],['cris',False,True]])
print(matrix)
# (2, 3)
(3,)
(2, 3)

4. numpy 的 ndarray 資料型別的 dtype 屬性

import numpy as np

'''
    經過 numpy 的 array 函式後,資料就變成了 ndarray 資料型別(type函式),而 dtype 屬性可以檢視當前 ndarray 裡的每一個元素的資料型別
    (注意元素的自動資料型別轉換)
'''

vector = np.array([1,2,3,'jj'])
# ['1' '2' '3' 'jj']
print(vector)
# <class 'numpy.ndarray'>
print(type(vector))
# <U11
print(vector.dtype)
['1' '2' '3' 'jj']
<class 'numpy.ndarray'>
<U11

5. numpy 的 ndarray 資料型別如何取值

import numpy as np

data = np.genfromtxt('world_alcohol.txt', delimiter=',',dtype=str,skip_header=1)
print(data)
# 類似 Python 的序列資料型別,可以指定取出二維矩陣位置的元素,第一個引數為行,第二個引數為列
# 預設索引都是從 0 開始
data_01 = data[1,4]
data_02 = data[2,3]
print(data_01)
print(data_02)
[['1986' 'Western Pacific' 'Viet Nam' 'Wine' '0']
 ['1986' 'Americas' 'Uruguay' 'Other' '0.5']
 ['1985' 'Africa' "Cte d'Ivoire" 'Wine' '1.62']
 ...
 ['1987' 'Africa' 'Malawi' 'Other' '0.75']
 ['1989' 'Americas' 'Bahamas' 'Wine' '1.5']
 ['1985' 'Africa' 'Malawi' 'Spirits' '0.31']]
0.5
Wine

6. numpy 的 ndarray 切片

import numpy as np

# 其實和 Python 中序列切片一模一樣,前包後不包
data = np.array([1,2,3,4,5])
# [1 2 3]
print(data[0:3])
[1 2 3]

7. numpy 的 二維陣列切片

import numpy as np

matrix = np.array([['james','USA',45],['cris','CHINA',33],['大帥','UK',11]])
# ['USA' 'CHINA' 'UK'] 可以對二維陣列取出所有行的制定列的值,:表示所有行
print(matrix[:,1])

'''
    可以通過切片指定取指定的那幾列的所有行的值
    [['james' 'USA']
     ['cris' 'CHINA']
     ['大帥' 'UK']]
 '''
print(matrix[:,0:2])

'''
    同理,可以取指定行的指定列的值,也就是說二維陣列變數可以通過切片的方式取出任意位置的值,切片的第一個引數是行,第二個引數代表列,並且這兩個引數
    都是可以使用切片形式的
    [['james' 'USA']
     ['cris' 'CHINA']]
'''
print(matrix[0:2,0:2])
['USA' 'CHINA' 'UK']
[['james' 'USA']
 ['cris' 'CHINA']
 ['大帥' 'UK']]
[['james' 'USA']
 ['cris' 'CHINA']]