1. 程式人生 > >第六章:資料載入、儲存於檔案格式Day12-14

第六章:資料載入、儲存於檔案格式Day12-14

說明:本文章為Python資料處理學習日誌,記錄內容為實現書本內容時遇到的錯誤以及一些與書本不一致的地方,一些簡單操作則不再贅述。日誌主要內容來自書本《利用Python進行資料分析》,Wes McKinney著,機械工業出版社。

讀寫文字格式的資料

read_csv()

Signature:
pd.read_csv(
filepath_or_buffer,
sep=’,’,
delimiter=None,
header=’infer’,
names=None,
index_col=None,
usecols=None,
squeeze=False,
prefix=None,
mangle_dupe_cols=True,
dtype=None,
engine=None,
converters=None,
true_values=None,
false_values=None,
skipinitialspace=False,
skiprows=None,
skipfooter=None,
nrows=None,
na_values=None,
keep_default_na=True,
na_filter=True,
verbose=False,
skip_blank_lines=True,
parse_dates=False,
infer_datetime_format=False,
keep_date_col=False,
date_parser=None,
dayfirst=False,
iterator=False,
chunksize=None,
compression=’infer’,
thousands=None,
decimal=’.’,
lineterminator=None,
quotechar=’”’,
quoting=0,
escapechar=None,
comment=None,
encoding=None,
dialect=None,
tupleize_cols=False,
error_bad_lines=True,
warn_bad_lines=True,
skip_footer=0,
doublequote=True,
delim_whitespace=False,
as_recarray=False,
compact_ints=False,
use_unsigned=False,
low_memory=True,
buffer_lines=None,
memory_map=False,
float_precision=None)

Docstring: Read CSV (comma-separated) file into DataFrame Also supports optionally iterating or breaking of the file into chunks.
Additional help can be found in the online docs for IO Tools
<http://pandas.pydata.org/pandas-docs/stable/io.html>
_.

Parameters:
filepath_or_buffer : str, pathlib.Path, py._path.local.LocalPath or any object with a read() method (such as a file handle or StringIO)


The string could be a URL. Valid URL schemes include http, ftp, s3, and
file. For file URLs, a host is expected. For instance, a local file could
be file ://localhost/path/to/table.csv
sep : str, default ‘,’
Delimiter to use. If sep is None, will try to automatically determine
this. Regular expressions are accepted and will force use of the python
parsing engine and will ignore quotes in the data.
delimiter : str, default None

Alternative argument name for sep.
header : int or list of ints, default ‘infer’
Row number(s) to use as the column names, and the start of the data.
Default behavior is as if set to 0 if no names passed, otherwise
None. Explicitly pass header=0 to be able to replace existing
names. The header can be a list of integers that specify row locations for
a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not
specified will be skipped (e.g. 2 in this example is skipped). Note that
this parameter ignores commented lines and empty lines if
skip_blank_lines=True, so header=0 denotes the first line of data
rather than the first line of the file.
names : array-like, default None
List of column names to use. If file contains no header row, then you
should explicitly pass header=None
index_col : int or sequence or False, default None
Column to use as the row labels of the DataFrame. If a sequence is given, a
MultiIndex is used. If you have a malformed file with delimiters at the end
of each line, you might consider index_col=False to force pandas to not
use the first column as the index (row names)
usecols : array-like, default None
Return a subset of the columns.
Results in much faster parsing time and lower memory usage.
squeeze : boolean, default False
If the parsed data only contains one column then return a Series
prefix : str, default None
Prefix to add to column numbers when no header, e.g. ‘X’ for X0, X1, …
mangle_dupe_cols : boolean, default True
Duplicate columns will be specified as ‘X.0’…’X.N’, rather than ‘X’…’X’
dtype : Type name or dict of column -> type, default None
Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32}
(Unsupported with engine=’python’). Use str or object to preserve and
not interpret dtype.
engine : {‘c’, ‘python’}, optional
Parser engine to use. The C engine is faster while the python engine is
currently more feature-complete.
converters : dict, default None
Dict of functions for converting values in certain columns. Keys can either
be integers or column labels
true_values : list, default None
Values to consider as True
false_values : list, default None
Values to consider as False
skipinitialspace : boolean, default False
Skip spaces after delimiter.
skiprows : list-like or integer, default None
Line numbers to skip (0-indexed) or number of lines to skip (int)
at the start of the file
skipfooter : int, default 0
Number of lines at bottom of file to skip (Unsupported with engine=’c’)
nrows : int, default None
Number of rows of file to read. Useful for reading pieces of large files
na_values : str or list-like or dict, default None
Additional strings to recognize as NA/NaN. If dict passed, specific
per-column NA values. By default the following values are interpreted as
NaN: '', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan', '1.#IND', '1.#QNAN', 'N/A',
'NA', 'NULL', 'NaN', 'nan'.
keep_default_na : bool, default True
If na_values are specified and keep_default_na is False the default NaN
values are overridden, otherwise they’re appended to.
na_filter : boolean, default True
Detect missing value markers (empty strings and the value of na_values). In
data without any NAs, passing na_filter=False can improve the performance
of reading a large file
verbose : boolean, default False
Indicate number of NA values placed in non-numeric columns
skip_blank_lines : boolean, default True
If True, skip over blank lines rather than interpreting as NaN values parse_dates : boolean or list of ints or names or list of lists
or dict, default False

* boolean. If True -> try parsing the index.
* list of ints or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3
  each as a separate date column.
* list of lists. e.g.  If [[1, 3]] -> combine columns 1 and 3 and parse as
    a single date column.
* dict, e.g. {'foo' : [1, 3]} -> parse columns 1, 3 as date and call result
  'foo'
Note: A fast-path exists for iso8601-formatted dates.

infer_datetime_format : boolean, default False
If True and parse_dates is enabled for a column, attempt to infer
the datetime format to speed up the processing
keep_date_col : boolean, default False
If True and parse_dates specifies combining multiple columns then
keep the original columns.
date_parser : function, default None
Function to use for converting a sequence of string columns to an array of
datetime instances. The default uses dateutil.parser.parser to do the
conversion. Pandas will try to call date_parser in three different ways,
advancing to the next if an exception occurs: 1) Pass one or more arrays
(as defined by parse_dates) as arguments; 2) concatenate (row-wise) the
string values from the columns defined by parse_dates into a single array
and pass that; and 3) call date_parser once for each row using one or more
strings (corresponding to the columns defined by parse_dates) as arguments.
dayfirst : boolean, default False
DD/MM format dates, international and European format
iterator : boolean, default False
Return TextFileReader object for iteration or getting chunks with
get_chunk().
chunksize : int, default None
Return TextFileReader object for iteration. See IO Tools docs for more
information
<http://pandas.pydata.org/pandas-docs/stable/io.html#io-chunking>
_ on
iterator and chunksize.
compression : {‘infer’, ‘gzip’, ‘bz2’, None}, default ‘infer’
For on-the-fly decompression of on-disk data. If ‘infer’, then use gzip or
bz2 if filepath_or_buffer is a string ending in ‘.gz’ or ‘.bz2’,
respectively, and no decompression otherwise. Set to None for no
decompression.
thousands : str, default None
Thousands separator
decimal : str, default ‘.’
Character to recognize as decimal point (e.g. use ‘,’ for European data).
lineterminator : str (length 1), default None
Character to break file into lines. Only valid with C parser.
quotechar : str (length 1), optional
The character used to denote the start and end of a quoted item. Quoted
items can include the delimiter and it will be ignored.
quoting : int or csv.QUOTE_* instance, default None
Control field quoting behavior per csv.QUOTE_* constants. Use one of
QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).
Default (None) results in QUOTE_MINIMAL behavior.
escapechar : str (length 1), default None
One-character string used to escape delimiter when quoting is QUOTE_NONE.
comment : str, default None
Indicates remainder of line should not be parsed. If found at the beginning
of a line, the line will be ignored altogether. This parameter must be a
single character. Like empty lines (as long as skip_blank_lines=True),
fully commented lines are ignored by the parameter header but not by
skiprows. For example, if comment=’#’, parsing ‘#empty\na,b,c\n1,2,3’
with header=0 will result in ‘a,b,c’ being
treated as the header.
encoding : str, default None
Encoding to use for UTF when reading/writing (ex. ‘utf-8’). List of Python
standard encodings
<https://docs.python.org/3/library/codecs.html#standard-encodings>
_
dialect : str or csv.Dialect instance, default None
If None defaults to Excel dialect. Ignored if sep longer than 1 char
See csv.Dialect documentation for more details
tupleize_cols : boolean, default False
Leave a list of tuples on columns as is (default is to convert to
a Multi Index on the columns)
error_bad_lines : boolean, default True
Lines with too many fields (e.g. a csv line with too many commas) will by
default cause an exception to be raised, and no DataFrame will be returned.
If False, then these “bad lines” will dropped from the DataFrame that is
returned. (Only valid with C parser)
warn_bad_lines : boolean, default True
If error_bad_lines is False, and warn_bad_lines is True, a warning for each
“bad line” will be output. (Only valid with C parser).

Returns:
result : DataFrame or TextParser

read_table和read_csv相似,sep改為”\t”,其餘大同小異。

to_csv()

Signature: data.to_csv(path_or_buf=None, sep=’,’, na_rep=”, float_format=None, columns=None, header=True, index=True,
index_label=None, mode=’w’, encoding=None, compression=None,
quoting=None, quotechar=’”’, line_terminator=’\n’, chunksize=None,
tupleize_cols=False, date_format=None, doublequote=True,
escapechar=None, decimal=’.’, **kwds)

Docstring: Write DataFrame to a comma-separated values (csv) file

Parameters:
path_or_buf : string or file handle, default None
File path or object, if None is provided the result is returned as
a string.
sep : character, default ‘,’
Field delimiter for the output file.
na_rep : string, default ”
Missing data representation
float_format : string, default None
Format string for floating point numbers
columns : sequence, optional
Columns to write
header : boolean or list of string, default True
Write out column names. If a list of string is given it is assumed
to be aliases for the column names
index : boolean, default True
Write row names (index)
index_label : string or sequence, or False, default None
Column label for index column(s) if desired. If None is given, and
header and index are True, then the index names are used. A
sequence should be given if the DataFrame uses MultiIndex. If
False do not print fields for index names. Use index_label=False
for easier importing in R
nanRep : None
deprecated, use na_rep
mode : str
Python write mode, default ‘w’
encoding : string, optional
A string representing the encoding to use in the output file,
defaults to ‘ascii’ on Python 2 and ‘utf-8’ on Python 3.
compression : string, optional
a string representing the compression to use in the output file,
allowed values are ‘gzip’, ‘bz2’,
only used when the first argument is a filename
line_terminator : string, default ‘\n’
The newline character or character sequence to use in the output
file
quoting : optional constant from csv module
defaults to csv.QUOTE_MINIMAL
quotechar : string (length 1), default ‘”’
character used to quote fields
doublequote : boolean, default True
Control quoting of quotechar inside a field
escapechar : string (length 1), default None
character used to escape sep and quotechar when appropriate
chunksize : int or None
rows to write at a time
tupleize_cols : boolean, default False
write multi_index columns as a list of tuples (if True)
or new (expanded format) if False)
date_format : string, default None
Format string for datetime objects
decimal: string, default ‘.’
Character recognized as decimal separator. E.g. use ‘,’ for
European data

書本注

P164 header=None
header不同:

pd.read_csv('ex2.csv',header=None)
Out[14]: 
   0   1   2   3      4
0  1   2   3   4  hello
1  5   6   7   8  world
2  9  10  11  12    foo

P166 讀txt
似乎並不需要書上那麼複雜,現在功能擴充完善,不過格式稍有不同:

list(open('ex3.txt'))
Out[21]: 
['            A         B         C\n',
 'aaa -0.264438 -1.026059 -0.619500\n',
 'bbb  0.927272  0.302904 -0.032399\n',
 'ccc -0.264273 -0.386314 -0.217601\n',
 'ddd -0.871858 -0.348382  1.100491\n']

result = pd.read_csv('ex3.txt')

result
Out[23]: 
               A         B         C
0  aaa -0.264438 -1.026059 -0.619500
1  bbb  0.927272  0.302904 -0.032399
2  ccc -0.264273 -0.386314 -0.217601
3  ddd -0.871858 -0.348382  1.100491

result = pd.read_table('ex3.txt')

result
Out[25]: 
               A         B         C
0  aaa -0.264438 -1.026059 -0.619500
1  bbb  0.927272  0.302904 -0.032399
2  ccc -0.264273 -0.386314 -0.217601
3  ddd -0.871858 -0.348382  1.100491

result = pd.read_table('ex3.txt',sep='\s+')

result
Out[27]: 
            A         B         C
aaa -0.264438 -1.026059 -0.619500
bbb  0.927272  0.302904 -0.032399
ccc -0.264273 -0.386314 -0.217601
ddd -0.871858 -0.348382  1.100491

P166 skiprows
如果不skiprows,函式會將連續的讀成一個單元格的資料:

pd.read_csv('ex4.csv')
Out[29]: 
                                                                      # hey!
a                                                  b        c   d    message
# just wanted to make things more difficult for... NaN      NaN NaN      NaN
# who reads CSV files with computers                anyway? NaN NaN      NaN
1                                                  2        3   4      hello
5                                                  6        7   8      world
9                                                  10       11  12       foo

pd.read_csv('ex4.csv',skiprows=[0,2,3])
Out[30]: 
   a   b   c   d message
0  1   2   3   4   hello
1  5   6   7   8   world
2  9  10  11  12     foo

P166 na_values的含義

na_values : str or list-like or dict, default None
Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: '', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan', '1.#IND', '1.#QNAN', 'N/A',
'NA', 'NULL', 'NaN', 'nan'.

na_values=[‘xxx’]的意思為DataFrame裡面為xxx的元素標記未NaN:

result = pd.read_csv('ex5.csv')
result
Out[33]: 
  something  a   b     c   d message
0       one  1   2   3.0   4     NaN
1       two  5   6   NaN   8   world
2     three  9  10  11.0  12     foo

result = pd.read_csv('ex5.csv',na_values=['5'])
result
Out[40]: 
  something    a   b     c   d message
0       one  1.0   2   3.0   4     NaN
1       two  NaN   6   NaN   8   world
2     three  9.0  10  11.0  12     foo

result = pd.read_csv('ex5.csv',na_values=['three'])
result
Out[42]: 
  something  a   b     c   d message
0       one  1   2   3.0   4     NaN
1       two  5   6   NaN   8   world
2       NaN  9  10  11.0  12     foo

P168 顯示具體資訊

"""
直接result回車,會直接顯示result全部內容
"""
result.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 5 columns):
one      10000 non-null float64
two      10000 non-null float64
three    10000 non-null float64
four     10000 non-null float64
key      10000 non-null object
dtypes: float64(4), object(1)
memory usage: 390.7+ KB

P170 to_csv()
引數變了,cols–>columns:

 data.to_csv(sys.stdout,index=False,cols=list('abc'))
something,a,b,c,d,message
one,1,2,3.0,4,
two,5,6,,8,world
three,9,10,11.0,12,foo

data.to_csv(sys.stdout,index=False,columns=list('abc'))
a,b,c
1,2,3.0
5,6,
9,10,11.0

P172 diaect
書上寫錯了,應該是dialect:

reader = csv.reader(f,diaect=my_dialect)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-97-fff676b470d2> in <module>()
----> 1 reader = csv.reader(f,diaect=my_dialect)

TypeError: 'diaect' is an invalid keyword argument for this function 

P172 寫錯誤,定義類的時候似乎沒有繼承

with open('mydata.csv','w') as f:
    writer = csv.writer(f,dialect=my_dialect)
    writer.writerow(('one','two','three'))
    writer.writerow(('1','2','3'))
    writer.writerow(('1','2','3'))
    writer.writerow(('1','2','3'))

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-106-1f82c8cdfdb0> in <module>()
      1 with open('mydata.csv','w') as f:
----> 2     writer = csv.writer(f,dialect=my_dialect)
      3     writer.writerow(('one','two','three'))
      4     writer.writerow(('1','2','3'))
      5     writer.writerow(('1','2','3'))

TypeError: "quoting" must be an integer 

"""
重新定義一下my_dialect中的quoting,必須是整數,並不清楚其含義,暫設為0。
"""
class my_dialect(csv.Dialect):
    lineterminator = '\n'
    delimiter = ';'
    quotechar = '"'
    quoting = 0


with open('mydata.csv','w') as f:
    writer = csv.writer(f,dialect=my_dialect)
    writer.writerow(('one','two','three'))
    writer.writerow(('1','2','3'))
    writer.writerow(('1','2','3'))
    writer.writerow(('1','2','3'))


!cat mydata.csv
one;two;three
1;2;3
1;2;3
1;2;3

P173 json.loads()和json.dumps()
經過一系列轉化後,和原obj還是iyouyidian差別的:

obj = """
{"name":"Wes",
"place_lived":["United States","Spain","Germany"],
"pet":null,
"siblings":[{"name":"Scott","age":25,"pet":"Zuko"},
{"naem":"Katie","age":33,"pet":"Cisco"}]
}
"""

import json

obj
Out[117]: '\n{"name":"Wes",\n"place_lived":["United States","Spain","Germany"],\n"pet":null,\n"siblings":[{"name":"Scott","age":25,"pet":"Zuko"},\n{"naem":"Katie","age":33,"pet":"Cisco"}]\n}\n'

result = json.loads(obj)

result
Out[119]: 
{u'name': u'Wes',
 u'pet': None,
 u'place_lived': [u'United States', u'Spain', u'Germany'],
 u'siblings': [{u'age': 25, u'name': u'Scott', u'pet': u'Zuko'},
  {u'age': 33, u'naem': u'Katie', u'pet': u'Cisco'}]}

asjson = json..dumps(result)
  File "<ipython-input-120-b73195ced089>", line 1
    asjson = json..dumps(result)
                  ^
SyntaxError: invalid syntax


asjson = json.dumps(result)

asjson
Out[122]: '{"pet": null, "place_lived": ["United States", "Spain", "Germany"], "name": "Wes", "siblings": [{"pet": "Zuko", "age": 25, "name": "Scott"}, {"pet": "Cisco", "age": 33, "naem": "Katie"}]}'