1. 程式人生 > >Python科學計算:讀取txt,csv,mat檔案

Python科學計算:讀取txt,csv,mat檔案

一.txt檔案

txt檔案是非常常見的文字檔案了,我們能夠把一些資料儲存在txt檔案裡面,然後讀取出來。
沒有例子講的話很麻煩,所以這裡結合例子給出背景來講怎麼方便的載入.txt檔案到一個數組。
首先我們這裡有名叫data1.txt的data2.txt的兩個文字檔案。兩個檔案的內容相同,只是分隔符號不相同。直接貼圖來看一下。
data1.txt
這裡寫圖片描述
data2.txt
這裡寫圖片描述
一個用的是空格,一個用的是逗號。
說了那麼多前戲,我這裡提供兩種方法來讀取這裡的.txt檔案

Ⅰ.自己寫

自己寫的意思就是用最基本的python自帶的一些IO操作來做。優點就是不管檔案是怎樣子的,你可以隨心所欲的按自己的需要來定製讀寫操作。但是缺點就是太過於繁瑣,細節很多,對於格式簡單的檔案,你會發現你做了很多的無用功在裡面。
以data1.txt為例子

# -*- coding: utf-8 -*-
import numpy as np

#load data
file=open("data1.txt")
lines=file.readlines()
rows=len(lines)

datamat=np.zeros((rows,2))

row=0
for line in lines:
    line=line.strip().split('\t')
    datamat[row,:]=line[:]
    row+=1

print(datamat)
print(datamat.shape)

結果:
這裡寫圖片描述
成功的讀入了ndarray裡面,程式很容易懂,有點基礎的都看得懂,就不獻醜了。這裡為了讀一個.txt檔案居然寫了這個多行程式碼。事實上,還能夠更加簡化。

Ⅱ.調函式

這個函式在上次的numpy介紹裡面其實已經講過了。這裡再來講一遍。
numpy.loadtxt(fname, dtype=<type 'float'>, comments=’#’, delimiter=None, converters=None, skiprows=0, usecols=None, unpack=False, ndmin=0)
因為之前詳細說過這個函式,這裡就不多說了。

numpy.savetxt(fname, X, fmt=’%.18e’, delimiter=’ ‘, newline=’\n’, header=’‘, footer=’‘, comments=’#‘)

作用:把一個array儲存到文字檔案(看作是上面函式的逆操作)
引數:
fname : 你想要儲存的檔名(對.gz的支援參考文件)
X : 待放入文字的array
fmt : (可選)你儲存的內容的格式,就是字串那裡面的格式控制符,這裡不復習了,自己複習一下。
delimiter : 分隔符,你自己定義。預設是空格“ ”
newline : 新的一行,自己定義。建議定義為os.linesep.預設是“\n”,但是我有時候不管用。
header : str, optional String that will be written at the beginning of the file.
footer : str, optional String that will be written at the end of the file.
comments : str, optional
String that will be prepended to the header and footer strings, to mark them as comments. Default: ‘# ‘, as expected by e.g. numpy.loadtxt.

直接舉例子看怎麼用的
我們這個例子的目的就是用函式把data1.txt和data2.txt載入到array,然後把array在寫到檔案,分別為命名為data3.txt和data4.txt

# -*- coding: utf-8 -*-
import numpy as np
import os

#load data1.txt
print("------Load data1.txt------")
data1=np.loadtxt("data1.txt",delimiter=' ')
print(data1)
print(data1.shape)
print("type of data1:",type(data1))
print("type of element of data1:",data1.dtype)
print("\n")
#load data2.txt
print("------Load data2.txt------")
data2=np.loadtxt("data2.txt",delimiter=',')
print(data2)
print(data2.shape)
print("type of data2:",type(data2))
print("type of element of data2:",data2.dtype)
print("\n")

#usecols
print("------usecols test:------")
#use 2th column
test=np.loadtxt("data1.txt",delimiter=' ',usecols=(1,))
print(test)
print(test.shape)
print("type of test:",type(test))
print("type of element of test:",test.dtype)


#write test
np.savetxt("data3.txt",data1,fmt="%5.3f",delimiter=" ",newline=os.linesep)
np.savetxt("data4.txt",data1,fmt="%5.2f",delimiter=",",newline=os.linesep)
np.savetxt("data5.txt",test,fmt="%.3f",delimiter=" ",newline=os.linesep)

結果:
這裡寫圖片描述

這裡寫圖片描述

這裡寫圖片描述
同時,在資料夾裡面新生成了3個新的.txt檔案
這裡寫圖片描述
程式不用解釋,就是直接呼叫函式就行了。然後選擇不同的delimiter啊,想要的列usecols這個引數就行了。
注意,在使用空格“ ”作為分隔符的時候,有時候要是出現了哪一行有問題什麼的,你就找到哪一行,看這行後面有沒有多了空格,這會導致load出問題,而且一行後面有沒有空格很難發覺,需要細心一點。
寫這麼的多就是介紹了兩個常用的函式來讀取寫入.txt檔案,不用什麼都自己去寫底層。

二.csv檔案

摘抄一段百度百科:
逗號分隔值(Comma-Separated Values,CSV,有時也稱為字元分隔值,因為分隔字元也可以不是逗號),其檔案以純文字形式儲存表格資料(數字和文字)。
CSV檔案由任意數目的記錄組成,記錄間以某種換行符分隔;每條記錄由欄位組成,欄位間的分隔符是其它字元或字串,最常見的是逗號或製表符。通常,所有記錄都有完全相同的欄位序列。
大概有些瞭解就行啦。
在一些資料競賽裡面碰到N多次資料都是.csv檔案給出的,說明應用應該還算是有一些廣吧。
python內建了csv庫,可以呼叫然後自己手動來寫操作的程式碼。但是比較簡單的方式還是使用python的pandas庫中的read_csv()函式來讀取。很方便。
照樣先把函式細節列出來
pandas.read_csv(filepath_or_buffer, sep=’, ‘, delimiter=None, header=’infer’, names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, iterator=False, chunksize=None, compression=’infer’, thousands=None, decimal=’.’, lineterminator=None, quotechar=’”’, quoting=0, escapechar=None, comment=None, encoding=None, dialect=None, tupleize_cols=False, error_bad_lines=True, warn_bad_lines=True, skipfooter=0, skip_footer=0, doublequote=True, delim_whitespace=False, as_recarray=False, compact_ints=False, use_unsigned=False, low_memory=True, buffer_lines=None, memory_map=False, float_precision=None)

你沒有看錯,這個函式就是這麼長,但是平時常用的引數並不會很多,所以也不用看到了很害怕。

作用:
讀取.csv檔案到DataFrame中去
引數:
filepath_or_buffer : 表示檔案系統位置,URL,檔案型別物件的字串
sep : 分隔符,預設是’,’(因為.SCV的預設是‘,’),更加詳細的參考文件。
delimiter : 同上
delim_whitespace : boolean, default False
Specifies whether or not whitespace (e.g. ’ ’ or ’ ‘) will be used as the sep. Equivalent to setting sep=’\s+’. If this option is set to True, nothing should be passed in for the delimiter parameter.
New in version 0.18.1: support for the Python parser.
header : int or list of ints, default ‘infer’
Row number(s) to use as the column names, and the start of the data. Default behavior is as if set to 0 if no names passed, otherwise None. Explicitly pass header=0 to be able to replace existing names. The header can be a list of integers that specify row locations for a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example is skipped). Note that this parameter ignores commented lines and empty lines if skip_blank_lines=True, so header=0 denotes the first line of data rather than the first line of the file.
names : array-like, default None
List of column names to use. If file contains no header row, then you should explicitly pass header=None. Duplicates in this list are not allowed unless mangle_dupe_cols=True, which is the default.
index_col : 整形或者序列或者False,預設是None. 這個表示的是用來作為行標籤的那一列.如果傳入的是一個序列,那麼就是用了層次化索引(MultiIndex)
If a sequence is given, a is used. If you have a malformed file with delimiters at the end of each line, you might consider index_col=False to force pandas to not use the first column as the index (row names)
usecols : 類array型別,預設是None,返回列的子集.在這個array裡面的元素,要麼是與位置相關的(表明列的整形索引)或者是使用者提供或者文件頭行提供列名字的字串.比如你可以用[0, 1, 2]或者[‘foo’, ‘bar’, ‘baz’]類似的索引形式.
as_recarray : boolean, default False
DEPRECATED: this argument will be removed in a future version. Please call pd.read_csv(…).to_records() instead.
Return a NumPy recarray instead of a DataFrame after parsing the data. If set to True, this option takes precedence over the squeeze parameter. In addition, as row indices are not available in such a format, the index_col parameter will be ignored.
squeeze : boolean, default False
If the parsed data only contains one column then return a Series
prefix : str, default None
Prefix to add to column numbers when no header, e.g. ‘X’ for X0, X1, …
mangle_dupe_cols : boolean, default True
Duplicate columns will be specified as ‘X.0’…’X.N’, rather than ‘X’…’X’. Passing in False will cause data to be overwritten if there are duplicate names in the columns.
dtype : Type name or dict of column -> type, default None
Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32} (Unsupported with engine=’python’). Use str or object to preserve and not interpret dtype.
engine : {‘c’, ‘python’}, optional
Parser engine to use. The C engine is faster while the python engine is currently more feature-complete.
converters : dict, default None
Dict of functions for converting values in certain columns. Keys can either be integers or column labels
true_values : list, default None
Values to consider as True
false_values : list, default None
Values to consider as False
skipinitialspace : boolean, default False
Skip spaces after delimiter.
skiprows : 需要跳過的行數(從檔案開始算起),這時候是一個整數。需要跳過的行號列表(索引從0開始),這個時候是一個列表。
skipfooter : int, default 0
Number of lines at bottom of file to skip (Unsupported with engine=’c’)
skip_footer : int, default 0,已經棄用,使用上面的skipfooter
nrows : 需要讀取的行數(從檔案開始出算起),對於讀取大檔案非常有用
na_values : scalar, str, list-like, or dict, default None
Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’,
‘1.#IND’, ‘1.#QNAN’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘nan’`.
keep_default_na : bool, default True
If na_values are specified and keep_default_na is False the default NaN values are overridden, otherwise they’re appended to.
na_filter : boolean, default True
Detect missing value markers (empty strings and the value of na_values). In data without any NAs, passing na_filter=False can improve the performance of reading a large file
verbose : boolean, default False
Indicate number of NA values placed in non-numeric columns
skip_blank_lines : boolean, default True
If True, skip over blank lines rather than interpreting as NaN values
parse_dates : boolean or list of ints or names or list of lists or dict, default False
boolean. If True -> try parsing the index.
list of ints or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column.
list of lists. e.g. If [[1, 3]] -> combine columns 1 and 3 and parse as
a single date column.
dict, e.g. {‘foo’ : [1, 3]} -> parse columns 1, 3 as date and call result ‘foo’
Note: A fast-path exists for iso8601-formatted dates.
infer_datetime_format : boolean, default False
If True and parse_dates is enabled, pandas will attempt to infer the format of the datetime strings in the columns, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the parsing speed by ~5-10x.
keep_date_col : boolean, default False
If True and parse_dates specifies combining multiple columns then keep the original columns.
date_parser : function, default None
Function to use for converting a sequence of string columns to an array of datetime instances. The default uses dateutil.parser.parser to do the conversion. Pandas will try to call date_parser in three different ways, advancing to the next if an exception occurs: 1) Pass one or more arrays (as defined by parse_dates) as arguments; 2) concatenate (row-wise) the string values from the columns defined by parse_dates into a single array and pass that; and 3) call date_parser once for each row using one or more strings (corresponding to the columns defined by parse_dates) as arguments.
dayfirst : boolean, default False
DD/MM format dates, international and European format
iterator : boolean, default False
Return TextFileReader object for iteration or getting chunks with get_chunk().
chunksize :檔案快的大小(用於迭代)
compression : {‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}, default ‘infer’
For on-the-fly decompression of on-disk data. If ‘infer’, then use gzip, bz2, zip or xz if filepath_or_buffer is a string ending in ‘.gz’, ‘.bz2’, ‘.zip’, or ‘xz’, respectively, and no decompression otherwise. If using ‘zip’, the ZIP file must contain only one data file to be read in. Set to None for no decompression.
New in version 0.18.1: support for ‘zip’ and ‘xz’ compression.
thousands : str, default None
Thousands separator
decimal : str, default ‘.’
Character to recognize as decimal point (e.g. use ‘,’ for European data).
float_precision : string, default None
Specifies which converter the C engine should use for floating-point values. The options are None for the ordinary converter, high for the high-precision converter, and round_trip for the round-trip converter.
lineterminator : str (length 1), default None
Character to break file into lines. Only valid with C parser.
quotechar : str (length 1), optional
The character used to denote the start and end of a quoted item. Quoted items can include the delimiter and it will be ignored.
quoting : int or csv.QUOTE_* instance, default 0
Control field quoting behavior per csv.QUOTE_* constants. Use one of QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).
doublequote : boolean, default True
When quotechar is specified and quoting is not QUOTE_NONE, indicate whether or not to interpret two consecutive quotechar elements INSIDE a field as a single quotechar element.
escapechar : str (length 1), default None
One-character string used to escape delimiter when quoting is QUOTE_NONE.
comment : str, default None
Indicates remainder of line should not be parsed. If found at the beginning of a line, the line will be ignored altogether. This parameter must be a single character. Like empty lines (as long as skip_blank_lines=True), fully commented lines are ignored by the parameter header but not by skiprows. For example, if comment=’#’, parsing ‘#emptyna,b,cn1,2,3’ with header=0 will result in ‘a,b,c’ being treated as the header.
encoding : str型別,預設是None,當讀取和寫入的時候回使用UTF來解碼.你也可以自己指定,你如GBK編碼的話,你可以使用encoding=”gbk”
dialect : str or csv.Dialect instance, default None
If None defaults to Excel dialect. Ignored if sep longer than 1 char See csv.Dialect documentation for more details
tupleize_cols : boolean, default False
Leave a list of tuples on columns as is (default is to convert to a Multi Index on the columns)
error_bad_lines : boolean, default True
Lines with too many fields (e.g. a csv line with too many commas) will by default cause an exception to be raised, and no DataFrame will be returned. If False, then these “bad lines” will dropped from the DataFrame that is returned. (Only valid with C parser)
warn_bad_lines : boolean, default True
If error_bad_lines is False, and warn_bad_lines is True, a warning for each “bad line” will be output. (Only valid with C parser).
low_memory : boolean, default True
Internally process the file in chunks, resulting in lower memory use while parsing, but possibly mixed type inference. To ensure no mixed types either set False, or specify the type with the dtype parameter. Note that the entire file is read into a single DataFrame regardless, use the chunksize or iterator parameter to return the data in chunks. (Only valid with C parser)
memory_map : boolean, default False
If a filepath is provided for filepath_or_buffer, map the file object directly onto memory and access the data directly from there. Using this option can improve performance because there is no longer any I/O overhead.
返回:
DataFrame型別的物件

我這裡有一個200多萬行資料的.csv,大概有30多個屬性。用來做實驗。
例一:

# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
import os

#load .csv
df=pd.read_csv("200W.csv")
print("type of df",type(df))
value=df.values
print("type of value:",type(value))
print("shape of value:",value.shape)

結果:
這裡寫圖片描述
首選載入pandas包的地方就不說了,看載入的函式,df=pd.read_csv("200W.csv") 這裡什麼選項都沒有做,僅僅是傳入了.csv的檔案的名稱。其實這樣就夠了,他會返回一個DataFrame的物件,這個是pandas的一個數據結構。更多的話可以深入看一下pandas這個庫,挺好用挺方便,這裡就不多說了。包括輸出df的型別,也是輸出的DataFrame。然後DataFrame有一個屬性values,就是把其中的資料以ndarray的形式返回,這就是我們要的東東。value=df.values 之後,我們就相當於把所有的.scv檔案都載入到一個ndarray物件裡面去了。
但是我在執行的時候,記憶體差點爆了,因為畢竟似乎200萬行的資料。電腦好一點隨意,不怎麼好的話,就要考慮到省記憶體的一些東西了。
你會發現這個函式超多的引數,在上面的例子裡面我們一個都沒有用上,這裡重點講幾個,其他的自己需要的時候看文件。

例二:

import numpy as np
import pandas as pd
import os

#load .csv
df=pd.read_csv("200W.csv",nrows=400,usecols=(0,1,2,5,6))
print("type of df",type(df))
value=df.values
print("type of value:",type(value))
print("shape of value:",value.shape)

僅僅在後面加了兩個引數,nrows=400,表示我只想讀取400行,usecols的列表表示我只需要(0,1,2,3,6)這幾列。速度立馬加快了很多。
所以有時候分析你需要的是什麼也是很重要的。還有很多引數有各種各樣的用法,自己去組合吧。

三.mat檔案

.mat檔案是MATLAB儲存資料的標準格式,很多的機器學習任務用.MAT來存出資料檔案。python的scipy中有專門的函式來方便.mat的檔案的載入和儲存。
還是按照上面的慣例,直接把函式給出來然後給出兩個例子。
scipy.io.loadmat(file_name, mdict=None, appendmat=True, **kwargs)

作用:載入MATLAB檔案
引數:
file_name : MATLAB檔名(如果appendmat=True的話,不要.mat的字尾),也能夠傳入open過得檔案物件。
mdict : dict, optional
Dictionary in which to insert matfile variables.
appendmat : 如果是true的話,後面就不用加上.mat的字尾了
byte_order : str or None, optional
None by default, implying byte order guessed from mat file. Otherwise can be one of (‘native’, ‘=’, ‘little’, ‘<’, ‘BIG’, ‘>’).
mat_dtype : bool, optional
If True, return arrays in same dtype as would be loaded into MATLAB (instead of the dtype with which they are saved).
squeeze_me : bool, optional
Whether to squeeze unit matrix dimensions or not.
chars_as_strings : bool, optional
Whether to convert char arrays to string arrays.
matlab_compatible : bool, optional
Returns matrices as would be loaded by MATLAB (implies squeeze_me=False, chars_as_strings=False, mat_dtype=True, struct_as_record=True).
struct_as_record : bool, optional
Whether to load MATLAB structs as numpy record arrays, or as old-style numpy arrays with dtype=object. Setting this flag to False replicates the behavior of scipy version 0.7.x (returning numpy object arrays). The default setting is True, because it allows easier round-trip load and save of MATLAB files.
verify_compressed_data_integrity : bool, optional
Whether the length of compressed sequences in the MATLAB file should be checked, to ensure that they are not longer than we expect. It is advisable to enable this (the default) because overlong compressed sequences in MATLAB files generally indicate that the files have experienced some sort of corruption.
variable_names : None or sequence
If None (the default) - read all variables in file. Otherwise variable_names should be a sequence of strings, giving names of the matlab variables to read from the file. The reader will skip any variable with a name not in this sequence, possibly saving some read processing.
返回:
mat_dict : 返回的是一個字典,變數名作為鍵,載入的矩陣作為值。

scipy.io.savemat(file_name, mdict, appendmat=True, format=’5’, long_field_names=False, do_compression=False, oned_as=’row’)

作用:儲存一個帶有名稱和序列的字典到.mat檔案中
引數:
file_name : str or file-like object
Name of the .mat file (.mat extension not needed if appendmat == True). Can also pass open file_like object.
mdict : dict
Dictionary from which to save matfile variables.
appendmat : bool, optional
True (the default) to append the .mat extension to the end of the given filename, if not already present.
format : {‘5’, ‘4’}, string, optional
‘5’ (the default) for MATLAB 5 and up (to 7.2), ‘4’ for MATLAB 4 .mat files.
long_field_names : bool, optional
False (the default) - maximum field name length in a structure is 31 characters which is the documented maximum length. True - maximum field name length in a structure is 63 characters which works for MATLAB 7.6+.
do_compression : bool, optional
Whether or not to compress matrices on write. Default is False.
oned_as : {‘row’, ‘column’}, optional
If ‘column’, write 1-D numpy arrays as column vectors. If ‘row’, write 1-D numpy arrays as row vectors.

例一:基本的讀取測試

# -*- coding: utf-8 -*-
import numpy as np
from scipy.io import loadmat
from scipy.io import savemat

result_dict=loadmat("train")

#檢視有返回的型別和他的鍵
print("type of reuslt:",type(result_dict))
print("keys:",result_dict.keys())



#'X', '__header__', 'y', '__globals__', '__version__']
#檢視鍵的內容
#X
print("X:",result_dict['X'])
print("type of X:",type(result_dict['X']))
print("shape of X:",result_dict['X'].shape)

#Y
print("y:",result_dict['y'])
print("type of y:",type(result_dict['y']))
print("shape of y:",result_dict['y'].shape)

結果:
這裡寫圖片描述
程式很簡單,就不解釋啦,但是結果可以給一些資訊。就是這個函式一般返回的是一個字典,所以你要用字典的方式從裡面取出資料。然後資料的矩陣也是很標準的ndarray,就很方便了。